133
Chapter 3 Displaying and Summarizing Quantitative Data

Chapter 3...Frequency Histogram vs Relative Frequency Histogram Has the same shape and horizontal scale as a histogram, but the vertical scale is marked with relative frequencies

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

  • Chapter 3

    Displaying and Summarizing

    Quantitative Data

  • Objectives

    • Histogram

    • Stem-and-leaf

    plot

    • Dotplot

    • Shape

    • Center

    • Spread

    • Outliers

    • Mean

    • Median

    • Range

    • Interquartile

    range (IQR)

    • Percentile

    • 5-Number

    summary

    • Resistant

    • Variance

    • Standard

    Deviation

  • HISTOGRAM

    Quantitative Data

  • Histogram

    • To make a histogram we first need

    to organize the data using a

    quantitative frequency table.

    • Two types of quantitative data

    1. Discrete – use ungrouped frequency

    table to organize.

    2. Continuous – use grouped frequency

    table to organize.

  • Quantitative Frequency

    Tables – Ungrouped

    • Example: The at-rest pulse rate for 16 athletes at a meet were 57, 57, 56, 57, 58, 56, 54, 64, 53, 54, 54, 55, 57, 55, 60, and 58. Summarize the information with an ungrouped frequency distribution.

  • Quantitative Frequency

    Tables – Ungrouped

    • Example Continued

    Note: The (ungrouped)

    classes are the

    observed values

    themselves.

  • Quantitative Relative Frequency

    Tables - Ungrouped

    Note: The relative

    frequency for a

    class is obtained

    by computing f/n.

  • Quantitative Frequency

    Tables – Grouped

    • Example: The weights of 30 female students majoring in Physical Education on a college campus are as follows: 143, 113, 107, 151, 90, 139, 136, 126, 122, 127, 123, 137, 132, 121, 112, 132, 133, 121, 126, 104, 140, 138, 99, 134, 119, 112, 133, 104, 129, and 123. Summarize the data with a frequency distribution using seven classes.

  • Quantitative Frequency Tables – Grouped

    Example Continued

    Histogram

    with 7 classes

    for the

    weights.

  • Quantitative Frequency Tables – Grouped

    Example Continued

    • Observations

    • From the histogram, the

    classes (intervals) are 85 –

    95, 95 – 105,105 – 115 etc.

    with corresponding

    frequencies of 1, 3, 4, etc.

    • We will use this information

    to construct the group

    frequency distribution.

  • Quantitative Frequency Tables – Grouped

    Example Continued

    • Observations (continued)

    • Observe that the upper

    class limit of 95 for the

    class 85 – 95 is listed as

    the lower class limit for the

    class 95 – 105.

    • Since the value of 95

    cannot be included in both

    classes, we will use the

    convention that the upper

    class limit is not included in

    the class.

  • Quantitative Frequency Tables – Grouped

    Example Continued

    • Observations (continued)

    • That is, the class 85 – 95

    should be interpreted as

    having the values 85 and

    up to 95 but not including

    the value of 95.

    • Using these observations,

    the grouped frequency

    distribution is constructed

    from the histogram and is

    given on the next slide.

  • Quantitative Frequency Tables – Grouped

    Example Continued

  • Quantitative Frequency Tables – Grouped

    Example Continued

    • Observations (continued)• In the grouped frequency

    distribution, the sum of the relative frequencies did not add up to 1. This is due to rounding to four decimal places.

    • The same observation should be noted for the cumulative relative frequency column.

  • Creating a Histogram

    It is an iterative process—try and try again.

    What bin size should you use?

    • Not too many bins with either 0 or 1 counts

    • Not overly summarized that you lose all the information

    • Not so detailed that it is no longer summary

    Rule of thumb: Start with 5 to10 bins.

    Look at the distribution and refine your bins.

    (There isn’t a unique or “perfect” solution.)

  • Not

    summarized

    enough

    Too summarized

    Same data set

  • Frequency Histogram vs Relative

    Frequency Histogram

    A histogram in which the horizontal scale represents the classes of

    data values and the vertical scale represents the frequencies.

  • Frequency Histogram vs Relative

    Frequency HistogramHas the same shape and horizontal scale as a histogram, but the

    vertical scale is marked with relative frequencies.

  • Frequency Histogram vs Relative

    Frequency Histogram

  • Making Histograms on the

    TI-83/84

    Use of Stat Plots on the TI-83/84

    Raw Data: 548, 405, 375, 400, 475, 450, 412

    375, 364, 492, 482, 384, 490, 492

    490, 435, 390, 500, 400, 491, 945

    435, 848, 792, 700, 572, 739, 572

  • Frequency Table Data:

    Class Limits Frequency

    350 to < 450

    450 to < 550

    550 to < 650

    650 to < 750

    750 to < 850

    850 to < 950

    11

    10

    2

    2

    2

    1

  • STEM AND LEAF PLOT

    Quantitative Data

  • Stem-and-Leaf Plots

    • What is a stem-and-leaf plot? A stem-and-leaf plot is a data plot that uses part of a data value as the stem to form groups or classes and part of the data value as theleaf.

    • Most often used for small or medium sized

    data sets. For larger data sets, histograms

    do a better job.• Note: A stem-and-leaf plot has an

    advantage over a grouped frequency table or hostogram, since a stem-and-leaf plot retains the actual data by showing them in graphic form.

  • Stemplots

    How to make a stemplot:

    1) Separate each observation into a stem,

    consisting of all but the final (rightmost) digit,

    and a leaf, which is that remaining final digit.

    Stems may have as many digits as needed.

    Use only one digit for each leaf—either round or

    truncate the data values to one decimal place

    after the stem.

    2) Write the stems in a vertical column with the

    smallest value at the top, and draw a vertical

    line at the right of this column.

    3) Write each leaf in the row to the right of its

    stem, in increasing order out from the stem.

    Original data: 9, 9, 22, 32, 33, 39, 39, 42, 49, 52, 58, 70

    STEM LEAVES

    Include key – how to

    read the stemplot.0|9 = 9

  • Stem-and-Leaf Plots

    • Example: Consider the following values

    – 96, 98, 107, 110, and 112. Construct

    a stem-and-leaf plot by using the units

    digits as the leaves.

  • Stem-and-Leaf Plot

    Stems and leaves for the

    data values.Stem-and-leaf plot for the

    data values.

    Stem Leaf

    09 6 8

    10 7

    11 0 2

    Key: 09|6 = 96

  • Your Turn: Stem-and-Leaf Plots

    • A sample of the number of admissions to a

    psychiatric ward at a local hospital during the

    full phases of the moon is as follows: 22, 30,

    21, 27, 31, 36, 20, 28, 25, 33, 21, 38, 32, 35,

    26, 19, 43, 30, 30, 34, 27, and 41.

    • Display the data in a stem-and-leaf plot with

    the leaves represented by the unit digits.

  • Stem-and-Leaf Plot

    Stem Leaf

    1 9

    2 0 1 1 2 5 6 7 7 8

    3 0 0 0 1 2 3 4 5 6 8

    4 1 3

    Key: 1|9 = 19

  • Variations of the StemPlot

    • Splitting Stems – (too few stems or classes) Split

    stems to double the number of stems when all the

    leaves would otherwise fall on just a few stems.

    • Each stem appears twice.

    • Leaves 0-4 go on the 1st stem and leaves 5-9 go on

    the 2nd stem.

    • Example: data –120,121,121,123,124,124,125,125,125,126,126,128,129,130,132,

    132,133,134,134,134,135,137,138,138,138,139

    StemPlot StemPlot (splitting stems)

    12 0 1 13445556689 12 0 1 1344

    13 0223444578889 12 5556689

    13 0223444

    13 578889

  • Stemplots are quick and dirty histograms that can easily be

    done by hand, therefore, very convenient for back of the

    envelope calculations. However, they are rarely found in

    scientific or laymen publications.

    Stemplots versus Histograms

  • Stemplots versus Histograms

    • Stem-and-leaf displays show the

    distribution of a quantitative variable,

    like histograms do, while preserving

    the individual values.

    • Stem-and-leaf displays contain all

    the information found in a histogram

    and, when carefully drawn, satisfy

    the area principle and show the

    distribution.

  • Slide 4 - 32

    Stem-and-Leaf Example

    • Compare the histogram and stem-and-leaf

    display for the pulse rates of 24 women at

    a health clinic. Which graphical display do

    you prefer?

    5 6

    6 0 4 4 4

    6 8 8 8 8

    7 2 2 2 2

    7 6 6 6 6

    8 0 0 0 0 4 4

    8 8

    4

    4

    4 8 2 6 0

    4 8 2 6 0

    4 8 2 6 0

    6 0 8 2 6 0 8

    5 6 6 7 7 8 8

    Key: 5|6 = 56

  • DOTPLOTS

    Quantitative Data

  • Dot Plots

    • What is a dot plot? A dot plot is a plot that displays a dot for each value in a data set along a number line. If there are multiple occurrences of a specific value, then the dots will be stacked vertically.

  • Dotplots

    • A dotplot is a simple display. It just places a dot along an axis for each case in the data.

    • The dotplot to the right shows Kentucky Derby winning times, plotting each race as its own dot.

    • You might see a dotplotdisplayed horizontally or vertically.

  • Dot Plot Example:

    • The following data shows the length of 50 movies in

    minutes. Construct a dot plot for the data.

    • 64, 64, 69, 70, 71, 71, 71, 72, 73, 73, 74, 74, 74, 74, 75, 75,

    75, 75, 75, 75, 76, 76, 76, 77, 77, 78, 78, 79, 79, 80, 80, 81,

    81, 81, 82, 82, 82, 83, 83, 83, 84, 86, 88, 89, 89, 90, 90, 92,

    94, 120

    Figure 2-5

  • Dot Plots – Your Turn

    The following frequency

    distribution shows the

    number of defectives

    observed by a quality

    control officer over a 30

    day period. Construct a

    dot plot for the data.

  • Dot Plots – Solution

  • Ogive - Cumulative

    Frequency Curve

  • Cumulative Frequency and the Ogive

    • Histogram displays the distribution of a quantitative variable.

    It tells little about the relative standing (percentile, quartile,

    etc.) of an individual observation.

    • For this information, we use a Cumulative Frequency graph,

    called an Ogive (pronounced O-JIVE).

    • The Pth percentile of a distribution is a value such that P%

    of the data fall at or below it.

  • Cumulative Frequency

    • What is a cumulative frequency for a class? The

    cumulative frequency for a

    specific class in a frequency

    table is the sum of the

    frequencies for all values at or

    below the given class.

  • Cumulative Frequency

  • Ogive

    • A line graph that depicts cumulative

    frequencies.

    • Used to Find Quartiles and

    Percentiles.

  • Example: Cumulative Frequency Curve

    • The frequencies of the scores of 80 students in a test are

    given in the following table. Complete the corresponding

    cumulative frequency table.

    • A suitable table is as follows:

  • Example continued

    • The information provided by a cumulative frequency table

    can be displayed in graphical form by plotting the cumulative

    frequencies given in the table against the upper class

    boundaries, and joining these points with a smooth.

    • The cumulative frequency curve corresponding to the data

    is as follows:

  • Percentiles

    • Explanation of the term –percentiles: Percentiles are numerical values that divide an ordered data set into 100 groups of values with at most1% of the data values in each group.

    • The kth percentile is the number that falls above k% of the data.

  • Percentile Corresponding to a Given Data Value

    • The percentile corresponding to a given data value, say x, in a set is obtained by using the following formula.

    %100

    or at

    setdatainvaluesofNumber

    xbelowvaluesofNumberPercentile

  • Think Before You Draw, Again

    • Remember the “Make a picture” rule?

    • Now that we have options for data

    displays, you need to Think carefully about

    which type of display to make.

    • Before making a stem-and-leaf display, a

    histogram, or a dotplot, check the

    • Quantitative Data Condition: The data

    are values of a quantitative variable

    whose units are known.

  • Shape, Center, and Spread

    • When describing a distribution,

    make sure to always tell about three

    things: shape, center, and spread…

    • Actually you should comment on

    four things when describing a

    distribution. The three above and

    any deviations from the shape.

    • These deviations from the shape are

    called ‘outliers’ and will be

    discussed later.

  • What is the Shape of the

    Distribution?

    1. Does the histogram have a single,

    central hump or several separated

    humps?

    2. Is the histogram symmetric?

    3. Do any unusual features stick out?

  • Humps

    1. Does the histogram have a single,

    central hump or several separated

    bumps?

    • Humps in a histogram are called

    modes or peaks.

    • A histogram with one main peak is

    dubbed unimodal; histograms with

    two peaks are bimodal; histograms

    with three or more peaks are called

    multimodal.

  • Humps (cont.)

    • A bimodal histogram has two apparent peaks:

  • Humps (cont.)

    • A histogram that doesn’t appear to have any mode and

    in which all the bars are approximately the same height

    is called uniform:

  • Uniform or Rectangular

    Distribution

    • A distribution in which every

    class has equal frequency. A

    uniform distribution is

    symmetrical with the added

    property that the bars are the

    same height.

  • Symmetry

    2. Is the histogram symmetric?

    • If you can fold the histogram along a vertical line

    through the middle and have the edges match

    pretty closely, the histogram is symmetric.

  • Symmetrical Distribution

    • In a symmetrical distribution, the data values are evenly distributed on both sides of the mean.

    • When the distribution is unimodal, the mean, the median, and the mode are all equal to one another and are located at the center of the distribution.

  • Symmetrical Distribution

  • Symmetry (cont.)

    • The (usually) thinner ends of a distribution are called the tails. If one tail stretches out farther than the other, the histogram is said to be skewed to the side of the longer tail.

    • In the figure below, the histogram on the left is said to be skewed left, while the histogram on the right is said to be skewed right.

  • Skewed Right Distribution

    • In a skewed right distribution, most of the data values fall to the left of the mean, and the “tail” of the distribution is to the right.

    • The mean is to the right of the median and the mode is to the left of the median.

  • Skewed Right Distribution

    Skewed Right

  • Skewed Left Distribution

    • In a skewed left distribution, most of the data values fall to the right of the mean, and the “tail” of the distribution is to the left.

    • The mean is to the left of the median and the mode is to the right of the median.

  • Skewed Left Distribution

    Skewed Left

  • Anything Unusual?

    3. Do any unusual features stick out?

    • Sometimes it’s the unusual features that tell us something interesting or exciting about the data.

    • You should always mention any stragglers, or outliers, that stand off away from the body of the distribution.

    • Are there any gaps in the distribution? If so, we might have data from more than one group.

  • Anything Unusual? (cont.)

    • The following histogram has outliers—

    there are three cities in the leftmost bar:

  • Deviations from the Overall Pattern

    • Outliers – An individual observation that falls outside the

    overall pattern of the distribution. Extreme Values –

    either high or low.

    • Causes:

    1. Data Mistake

    2. Special nature of some observations

  • Alaska Florida

    Outliers

    An important kind of deviation is an outlier. Outliers are

    observations that lie outside the overall pattern of a

    distribution. Always look for outliers and try to explain them.

    The overall pattern is fairly

    symmetrical except for two

    states clearly not belonging

    to the main trend. Alaska

    and Florida have unusual

    representation of the

    elderly in their population.

    A large gap in the

    distribution is typically a

    sign of an outlier.

  • Numerical Data Properties

    Central Tendency

    (center)

    Variation

    (spread)

    Shape

  • Where is the Center of the

    Distribution?

    • If you had to pick a single number to

    describe all the data what would you pick?

    • It’s easy to find the center when a

    histogram is unimodal and symmetric—it’s

    right in the middle.

    • On the other hand, it’s not so easy to find

    the center of a skewed histogram or a

    histogram with more than one mode.

  • Measures of Central Tendency

    • A measure of central tendency for a collection of data values is a number that is meant to convey the idea of centralnessfor the data set.

    • The most commonly used measures of central tendency for sample data are the: mean, median, and mode.

  • The Mean

    • Explanation of the term – mean:The mean of a set of numerical (data) values is the (arithmetic) average for the set of values.

    • NOTE: When computing the value of the mean, the data values can be population values or sample values.

    • Hence we can compute either the population mean or the sample mean

  • The Mean

    • Explanation of the term –population mean: If the numerical values are from an entire population, then the mean of these values is called the population mean.

    • NOTATION: The population mean is usually denoted by the Greek letter µ (read as “mu”).

  • The Mean

    • Explanation of the term –sample mean: If the numerical values are from a sample, then the mean of these values is called the sample mean.

    • NOTATION: The sample mean is usually denoted by (read as “x-bar”).

    x

  • The Mean -- Example

    • Example: What is the mean of the following 11 sample values?

    3 8 6 14 0 -4

    0 12 -7 0 -10

  • The Mean -- Example (Continued)

    • Solution:

    2

    11

    )10(0)7(120)4(014683

    x

  • The Mean

    • Nonresistant – The mean is sensitive to the influence of

    extreme values and/or outliers. Skewed distributions pull

    the mean away from the center towards the longer tail.

    • The mean is located at the balancing point of the

    histogram. For a skewed distribution, is not a good

    measure of center.

  • The Mean

    • Nonresistant – Example

    • Example – Data: {1,2,3,4,5,6,7}

    • The mean is 4

    • Add an outlier {1,2,3,4,5,6,7,50}

    • New median is 9.75 – large affect

  • Quick Tip:

    • When a data set has a large number of values, we sometimes summarize it as a frequency table. The frequencies represent the number of times each value occurs.

    • When the mean is calculated from a frequency table it is often an approximation, because the raw data is sometimes not known.

  • Calculating Means

    • TI-83/84 1-Var Stats

    • Using raw data

    • Using Frequency table data

  • Calculating Means on TI-83/84

    Raw Data: 548, 405, 375, 400, 475, 450, 412

    375, 364, 492, 482, 384, 490, 492

    490, 435, 390, 500, 400, 491, 945

    435, 848, 792, 700, 572, 739, 572

  • Calculating Means on TI-83/84

    • Grouped Frequency Table Data:

    Class Limits Frequency

    350 to < 450

    450 to < 550

    550 to < 650

    650 to < 750

    750 to < 850

    850 to < 950

    11

    10

    2

    2

    2

    1

  • The Median

    • Explanation of the term – median:The median of a set of numerical (data) values is that numerical value in the middle when the data set is arranged in order.

    • NOTE: When computing the value of the median, the data values can be population values or sample values.

    • Hence we can compute either the population median or the sample median.

  • Center of a Distribution -- Median

    • The median is the value with exactly half the data values

    below it and half above it.

    • It is the middle data

    value (once the data

    values have been

    ordered) that divides

    the histogram into

    two equal areas

    • It has the same units

    as the data

  • Quick Tip:

    • When the number of values in the data set is odd, the median will be the middle value in the ordered array.

    • When the number of values in the data set is even, the median will be the average of the two middle values in the ordered array.

  • The Median -- Example

    • Example: What is the median for the following sample values?

    3 8 6 14 0 -4

    2 12 -7 -1 -10

  • The Median -- Example (Continued)

    • Solution: First of all, we need to arrange the data set in order. The ordered set is:

    -10 -7 -4 -1 0 2 3 6 8 12 14

    6th value

  • The Median -- Example (Continued)

    • Solution (Continued): Since the number of values is odd, the median will be found in the 6th position in the ordered set (To find; data number divided by 2 and round up, 11/2 = 5.5⇒6).

    • Thus, the value of the median is 2.

  • The Median -- Example

    • Example: Find the median age for the following eight college students.

    23 19 32 25 26 22 24 20

  • The Median – Example (continued)

    • Example: First we have to order the values as shown below.

    19 20 22 23 24 25 26 32

  • The Median – Example (continued)

    • Example: Since there is an even number of ages, the median will be the average of the two middle values (To find; data number divided by 2, that number and the next are the two middle numbers, 8/2 = 4⇒4th & 5th are the middle numbers).

    • Thus, median = (23 + 24)/2 = 23.5.

  • The Median

    The median is the midpoint of a distribution—the number such

    that half of the observations are smaller and half are larger.

    1. Sort observations from smallest to largest.

    n = number of observations

    ______________________________

    1 1 0.6

    2 2 1.2

    3 3 1.6

    4 4 1.9

    5 5 1.5

    6 6 2.1

    7 7 2.3

    8 8 2.3

    9 9 2.5

    10 10 2.8

    11 11 2.9

    12 3.3

    13 3.4

    14 1 3.6

    15 2 3.7

    16 3 3.8

    17 4 3.9

    18 5 4.1

    19 6 4.2

    20 7 4.5

    21 8 4.7

    22 9 4.9

    23 10 5.3

    24 11 5.6

    n = 24

    n/2 = 12 &13

    Median = (3.3+3.4) /2 = 3.35

    3. If n is even, the median is the

    mean of the two center observations

    1 1 0.6

    2 2 1.2

    3 3 1.6

    4 4 1.9

    5 5 1.5

    6 6 2.1

    7 7 2.3

    8 8 2.3

    9 9 2.5

    10 10 2.8

    11 11 2.9

    12 12 3.3

    13 3.4

    14 1 3.6

    15 2 3.7

    16 3 3.8

    17 4 3.9

    18 5 4.1

    19 6 4.2

    20 7 4.5

    21 8 4.7

    22 9 4.9

    23 10 5.3

    24 11 5.6

    25 12 6.1

    n = 25

    n/2 = 25/2 = 12.5=13

    Median = 3.4

    2. If n is odd, the median is observation

    n/2 (round up) down the list

  • The Median

    • Resistant – The median is said to

    be resistant, because extreme

    values and/or outliers have little

    effect on the median.

    • Example – Data: {1,2,3,4,5,6,7}

    • The median is 4

    • Add an outlier {1,2,3,4,5,6,7,50}

    • New median is 4.5 – very little affect

  • The Mode

    • Explanation of the term –mode: The mode of a set of numerical (data) values is the most frequently occurring value in the data set.

  • Quick Tip:

    • If all the elements in the data set have the same frequency of occurrence, then the data set is said to have no mode.

    Example of data set with no mode.

  • Quick Tip:

    • If the data set has one value that occurs more frequently than the rest of the values, then the data set is said to be unimodal.

    Example ofA UnimodalData set.

  • Quick Tip:

    • If two data values in the set are tied for the highest frequency of occurrence, then the data set is said to be bimodal.

    Example of a bimodal set of data.

  • Summary Measures of Center

  • How Spread Out is the

    Distribution?

    • Variation matters, and Statistics is about

    variation.

    • Are the values of the distribution tightly

    clustered around the center or more

    spread out?

    • Always report a measure of spread along

    with a measure of center when describing

    a distribution numerically.

  • Measures of Spread

    • A measure of variability for a collection of data values is a number that is meant to convey the idea of spread for the data set.

    • The most commonly used measures of variability for sample data are the: range interquartile range variance or standard deviation

  • Spread: Home on the Range

    • The range of the data is the difference

    between the maximum and minimum

    values:

    Range = max – min

    • A disadvantage of the range is that a

    single extreme value can make it very

    large and, thus, not representative of the

    data overall.

  • Range

    • The range is affected by outliers (large or small values relative to the rest of the data set).

    • The range does not utilize all the information in the data set only the largest and smallest values.

    • Thus it is not a very useful measure of spread or variation.

  • Spread: The Interquartile Range

    • A better way to describe the spread of a

    set of data might be to ignore the extremes

    and concentrate on the middle of the data.

    • The interquartile range (IQR) lets us ignore

    extreme data values and concentrate on

    the middle of the data.

    • To find the IQR, we first need to know

    what quartiles are…

  • Spread: The Interquartile Range

    (cont.)

    • Quartiles divide the data into four equal sections.

    • One quarter of the data lies below the lower quartile, Q1

    • One quarter of the data lies above the upper quartile, Q3.

    • The quartiles border the middle half of the data.

    • The difference between the quartiles is the interquartile range (IQR), so

    IQR = upper quartile – lower quartile

  • Finding Quartiles

    1. Order the Data

    2. Find the median, this divides the data into a lower and

    upper half (the median itself is in neither half).

    3. Q1 is then the median of the lower half.

    4. Q3 is the median of the upper half.

    5. Example

    Even dataQ1=27, M=39, Q3=50.5

    IQR = 50.5 – 27 = 23.5

    Odd dataQ1=35, M=46, Q3=54

    IQR = 54 – 35 = 19

  • The Interquartile Range

    • The following depicts the idea of the interquartile range.

  • IQR = Q3 - Q1

  • Spread: The Interquartile Range

    (cont.)

    • The lower and upper quartiles are the 25th and 75th

    percentiles of the data, so…

    • The IQR contains the middle 50% of the values of the

    distribution, as shown in figure:

  • M = median = 3.4

    Q1= first quartile = 2.2

    Q3= third quartile = 4.35

    1 1 0.6

    2 2 1.2

    3 3 1.6

    4 4 1.9

    5 5 1.5

    6 6 2.1

    7 7 2.3

    8 1 2.3

    9 2 2.5

    10 3 2.8

    11 4 2.9

    12 5 3.3

    13 3.4

    14 1 3.6

    15 2 3.7

    16 3 3.8

    17 4 3.9

    18 5 4.1

    19 6 4.2

    20 7 4.5

    21 1 4.7

    22 2 4.9

    23 3 5.3

    24 4 5.6

    25 5 6.1

    Example IQR

    The first quartile, Q1, is the value in

    the sample that has 25% of the data

    at or below it.

    The third quartile, Q3, is the value in

    the sample that has 75% of the data

    at or below it.

    IQR=Q3-Q1=4.35-2.2

    =2.15

  • Your Turn:

    • The following scores for a statistics 10-point quiz were reported. What is the value of the interquartile range?

    7 8 9 6 8 0 9 9 9

    0 0 7 10 9 8 5 7 9

    Solution: IQR = 3

  • Calculator - IQR

    • TI-83 Solution: The following shows the descriptive statistics output.

    Interquartile range = Q3 – Q1 = 9 – 6 = 3.

  • 5-Number Summary

    • The 5-number summary of a distribution reports its median,

    quartiles, and extremes (maximum and minimum)

    • The 5-number summary for the recent tsunami earthquake

    Magnitudes looks like this:

    • Obtain 5-number summary

    from 1-Var Stats

  • What About Spread? The

    Standard Deviation

    • A more powerful measure of spread than

    the IQR is the standard deviation, which

    takes into account how far each data value

    is from the mean.

    • A deviation is the distance that a data

    value is from the mean.

    • Since adding all deviations together

    would total zero, we square each

    deviation and find an average of sorts

    for the deviations.

  • What About Spread? The

    Standard Deviation (cont.)

    • The variance, notated by s2, is found by summing the squared deviations and (almost) averaging them:

    • Used to calculate Standard Deviation.

    • The variance will play a role later in our

    study, but it is problematic as a measure of

    spread - it is measured in squared units -

    serious disadvantage!

    2

    2

    1

    y ys

    n

  • What About Spread? The

    Standard Deviation (cont.)

    • The standard deviation, s, is just the square root of the variance and is measured in the same units as the original data.

    2

    1

    y ys

    n

  • Procedure for Calculating the Standard

    Deviation using Formula

    1. Compute the mean .

    2. Subtract the mean from each individual value to get a

    list of the deviations from the mean .

    3. Square each of the differences to produce the square

    of the deviations from the mean .

    4. Add all of the squares of the deviations from the mean

    to get .

    5. Divide the sum by . [variance]

    6. Find the square root of the result.

    x

    x x

    2

    x x

    2

    x x

    2

    x x 1n

  • Example:

    • Find the standard deviation of the Mulberry

    Bank customer waiting times. Those times

    (in minutes) are 1, 3, 14.

  • Calculating Standard Deviation

    on the TI-83/84

    • Use 1-Var Stats

    • Sx is the sample standard deviation

    • σx is the population standard deviation

  • Properties of Standard Deviation

    • Measures spread about the mean and should only be

    used to describe the spread of a distribution when the

    mean is used to describe the center (ie. symmetrical

    distributions).

    • The value of s is positive. It is zero only when all of the

    data values are the same number. Larger values of s

    indicate greater amounts of variation.

    • Nonresistant, s can increase dramatically due to extreme

    values or outliers.

    • The units of s are the same as the units of the original

    data. One reason s is preferred to s2.

  • Thinking About Variation

    • Since Statistics is about variation, spread is an important fundamental concept of Statistics.

    • Measures of spread help us talk about what we don’t know.

    • When the data values are tightly clustered around the center of the distribution, the IQR and standard deviation will be small.

    • When the data values are scattered far from the center, the IQR and standard deviation will be large.

  • Summarizing Symmetric

    Distributions -- The Mean

    • When we have symmetric data, there is an alternative other than the median.

    • If we want to calculate a number, we can average the data.

    • We use the Greek letter sigma to mean “sum” and write:

    The formula says that to find the mean, we

    add up all the values of the variable and

    divide by the number of data values, n.

    yTotaly

    n n

  • Summarizing Symmetric

    Distributions -- The Mean (cont.)

    • The mean feels like the center because it

    is the point where the histogram balances:

  • Mean or Median?

    • Because the median considers only the

    order of values, it is resistant to values that

    are extraordinarily large or small; it simply

    notes that they are one of the “big ones” or

    “small ones” and ignores their distance

    from center.

    • To choose between the mean and median,

    start by looking at the data. If the

    histogram is symmetric and there are no

    outliers, use the mean.

    • However, if the histogram is skewed or

    with outliers, you are better off with the

    median.

  • Mean and median for

    skewed distributions

    Mean and median for a

    symmetric distribution

    Left skew Right skew

    Mean

    Median

    Mean

    Median

    Mean

    Median

    Comparing the mean and the median

    •The mean and the median are the same only if the distribution is symmetrical.

    •The median is a measure of center that is resistant to skew and outliers. The

    mean is not.

  • The median, on the other hand,

    is only slightly pulled to the right

    by the outliers (from 3.4 to 3.6).

    The mean is pulled to the

    right a lot by the outliers

    (from 3.4 to 4.2).

    Pe

    rcen

    t o

    f p

    eo

    ple

    dyin

    g

    Mean and Median of a Distribution with Outliers

    3.4x

    Without the outliers

    4.2x

    With the outliers

  • Example

    • Observed mean =2.28, median=3, mode=3.1

    • What is the shape of the distribution and why?

  • Example

    Solution: Skewed Left

    Right-SkewedLeft-Skewed Symmetric

    Mean = Median = ModeMean Median Mode Mode Median Mean

  • Conclusion – Mean or

    Median?

    • Mean – use with symmetrical

    distributions (no outliers),

    because it is nonresistant.

    • Median – use with skewed

    distribution or distribution with

    outliers, because it is resistant.

  • Tell -- Shape, Center, and Spread

    • Always report the shape of its distribution,

    along with a center and a spread.

    • If the shape is skewed, report the

    median and IQR.

    • If the shape is symmetric, report the

    mean and standard deviation and

    possibly the median and IQR as well.

  • Tell -- What About Unusual

    Features?

    • If there are multiple modes, try to understand why. If you identify a reason for the separate modes, it may be good to split the data into two groups.

    • If there are any clear outliers and you are reporting the mean and standard deviation, report them with the outliers present and with the outliers removed. The differences may be quite revealing.

    • Note: The median and IQR are not likely to be affected by the outliers.

  • What Can Go Wrong?

    • Don’t make a histogram of a categorical variable—bar

    charts or pie charts should be used for categorical data.

    • Don’t look for shape,

    center, and spread

    of a bar chart.

  • What Can Go Wrong? (cont.)

    • Choose a bin width appropriate to the data.

    • Changing the bin width changes the appearance of the

    histogram:

  • What Can Go Wrong? (cont.)

    • Don’t forget to do a reality check – don’t let the calculator do the thinking for you.

    • Don’t forget to sort the values before finding the median or percentiles.

    • Don’t worry about small differences when using different methods.

    • Don’t compute numerical summaries of a categorical variable.

    • Don’t report too many decimal places.

    • Don’t round in the middle of a calculation.

    • Watch out for multiple modes

    • Beware of outliers

    • Make a picture … make a picture . . . make a picture !!!

  • What have we learned?

    • We’ve learned how to make a picture for quantitative data to help us see the story the data have to Tell.

    • We can display the distribution of quantitative data with a histogram, stem-and-leaf display, or dotplot.

    • We’ve learned how to summarize distributions of quantitative variables numerically.

    • Measures of center for a distribution include the median and mean.

    • Measures of spread include the range, IQR, and standard deviation.

    • Use the median and IQR when the distribution is skewed. Use the mean and standard deviation if the distribution is symmetric.

  • What have we learned? (cont.)

    • We’ve learned to Think about the type of

    variable we are summarizing.

    • All methods of this chapter assume the

    data are quantitative.

    • The Quantitative Data Condition

    serves as a check that the data are, in

    fact, quantitative.