SDA 3E Chapter 2 (2)

Embed Size (px)

Citation preview

  • 8/4/2019 SDA 3E Chapter 2 (2)

    1/40

    2007 Pearson Education

    Chapter 2: Displaying andSummarizing Data

    Part 2: Descriptive Statistics

  • 8/4/2019 SDA 3E Chapter 2 (2)

    2/40

    Excel SupportExcel statistical functions

    Excel

    Analysis Toolpak toolsExcel

    PHStat tools and proceduresPHStat tools

  • 8/4/2019 SDA 3E Chapter 2 (2)

    3/40

    Descriptive StatisticsFrequency distributions and histograms

    Measures of central tendency

    Measures of dispersion

  • 8/4/2019 SDA 3E Chapter 2 (2)

    4/40

    Terminology and Notation

    Parameter a measurable characteristic of apopulation: mis a parameter, x is not

    xi represents the i th observationindicates the operation of addition

    N is the size of the population; n is the size of

    the samplef i is the number of observations in cell i of afrequency distribution

  • 8/4/2019 SDA 3E Chapter 2 (2)

    5/40

    Frequency DistributionTabular summary showing the frequency of observations in each of several non-

    overlapping (mutually exclusive) classes, orcellsRelative frequency fraction or proportion of observations that fall within a cellCumulative frequency proportion orpercentage of observations that fall below theupper limit of a cell

  • 8/4/2019 SDA 3E Chapter 2 (2)

    6/40

    Home Runs - 2004 Baseball

    Season

  • 8/4/2019 SDA 3E Chapter 2 (2)

    7/40

    HistogramColumn chart representing a frequencydistribution

  • 8/4/2019 SDA 3E Chapter 2 (2)

    8/40

    Excel Tool: HistogramExcel Menu > Tools > Data Analysis > Histogram

    Specify range of data

    Define and specify

    bin range(recommended)

    Select outputoptions (always

    check Chart Output

  • 8/4/2019 SDA 3E Chapter 2 (2)

    9/40

    Good Practice GuidelinesCell intervals should be of equal width.Choose the width using the formula

    (largest value smallest value)/number of cells but round to reasonable values(e.g., 97 to 100)

    Choose somewhere between 5 to 15 cellsto provide a useful picture of the data

  • 8/4/2019 SDA 3E Chapter 2 (2)

    10/40

    Excel Frequency FunctionDefine binsSelect a range of cells adjacent to the bin

    range (if continuous data, add one empty cellbelow this range as an overflow cell)Enter the formula =FREQUENCY( range of data, range of bins ) and press Ctrl-Shift-Enter simultaneously .Construct a histogram using the Chart Wizard for a column chart.

  • 8/4/2019 SDA 3E Chapter 2 (2)

    11/40

    Arithmetic MeanPopulation

    Sample

    Excel function AVERAGE( range )

    N

    x N

    ii

    1m

    x

    x

    n

    i

    i

    n

    1

  • 8/4/2019 SDA 3E Chapter 2 (2)

    12/40

    Example: Team Home Runs

    Mean = 2605/14 = 186.1

  • 8/4/2019 SDA 3E Chapter 2 (2)

    13/40

    Properties of the MeanMeaningful for interval and ratio data

    All data used in the calculation

    Unique for every set of data Affected by unusually large or smallobservations ( outliers )The only measure of central tendency wherethe sum of the deviations of each value fromthe measure is zero; i.e.,

    (xi x ) = 0

  • 8/4/2019 SDA 3E Chapter 2 (2)

    14/40

    MedianMiddle value when data are ordered fromsmallest to largest. This results in an equal

    number of observations above the median asbelow it.Unique for each set of dataNot affected by extremes

    Meaningful for ratio, interval, and ordinal dataExcel function MEDIAN( range )

  • 8/4/2019 SDA 3E Chapter 2 (2)

    15/40

    ModeObservation that occurs most frequently; forgrouped data, the midpoint of the cell withthe largest frequency (approximate value)

    Useful when data consist of a small number of unique values

  • 8/4/2019 SDA 3E Chapter 2 (2)

    16/40

    Bimodal Distribution

  • 8/4/2019 SDA 3E Chapter 2 (2)

    17/40

    Midrange Average of the largest and smallestobservations

    Useful for very small samples, but extremevalues can distort the result

  • 8/4/2019 SDA 3E Chapter 2 (2)

    18/40

    Measures of DispersionDispersion the degree of variation inthe data. E.g., {48, 49, 50, 51, 52} and{10, 30, 50, 70, 90}Range difference between themaximum and minimum observations

    Same issues as with midrange

  • 8/4/2019 SDA 3E Chapter 2 (2)

    19/40

    VariancePopulation

    Sample

    N

    x N

    ii

    1

    2

    2

    m

    1

    1

    2

    2

    n

    x x

    s

    n

    ii

  • 8/4/2019 SDA 3E Chapter 2 (2)

    20/40

    Calculations

    Variance = 17534.93/14 = 1252.49

    Excel functions: VAR, VARP, STDEV, STDEVP

  • 8/4/2019 SDA 3E Chapter 2 (2)

    21/40

    Standard DeviationPopulation

    Sample

    The standard deviation has the same units of measurement as the original data, unlike thevariance

    N

    x N

    ii

    1

    2m

    1

    1

    2

    n

    x xs

    n

    ii

  • 8/4/2019 SDA 3E Chapter 2 (2)

    22/40

    Chebyshevs Theorem For any set of data, the proportion of valuesthat lie within k standard deviations of the

    mean is at least 1 1/k 2, for any k > 1For k = 2, at least of the data lie within 2standard deviations of the meanFor k = 3, at least 8/9, or 89% lie within 3

    standard deviations of the meanFor k = 10, at least 99/100, or 99% of the data liewithin 10 standard deviations of the mean

  • 8/4/2019 SDA 3E Chapter 2 (2)

    23/40

    Example

    Mean = 28.87; standard deviation = 21.92

    3 about the mean: [-36.9, 94.6]

    2 about the mean: [-15.0, 72.7]

  • 8/4/2019 SDA 3E Chapter 2 (2)

    24/40

    Grouped Data: Calculation of

    MeanSample

    Population

    In a frequency distribution, replace x i with a representative value (e.g.,midpoint)

    n

    x f x

    n

    iii

    1

    N

    x f N

    iii

    1m

  • 8/4/2019 SDA 3E Chapter 2 (2)

    25/40

    Grouped Data: Calculation of

    VarianceSample

    Population

  • 8/4/2019 SDA 3E Chapter 2 (2)

    26/40

    Coefficient of VariationCV = Standard Deviation / Mean

    CV is dimensionless, and therefore is useful when

    comparing data sets that are scaled differently.

  • 8/4/2019 SDA 3E Chapter 2 (2)

    27/40

    SkewnessCoefficient of skewness (CS)

    -0.5 < CS < 0.5 indicates relativesymmetry

    Relatively Symmetric Positively skewed

  • 8/4/2019 SDA 3E Chapter 2 (2)

    28/40

    Excel Tool: Descriptive

    StatisticsExcel menu > Tools > Data Analysis > Descriptive Statistics

  • 8/4/2019 SDA 3E Chapter 2 (2)

    29/40

    Data Profiles (Fractiles)Describe the location and spread of data overits range

    Quartiles a division of a data set into four equalparts; shows the points below which 25%, 50%,75% and 100% of the observations lie (25% isthe first quartile, 75% is the third quartile, etc.)Deciles a division of a data set into 10 equal

    parts; shows the points below which 10%, 20%,etc. of the observations liePercentiles a division of a data set into 100equal parts; shows the points below which k percent of the observations lie

  • 8/4/2019 SDA 3E Chapter 2 (2)

    30/40

    ProportionFraction of data that has a certaincharacteristicUse the Excel function COUNTIF( data range, criteria ) to count observationsmeeting a criterion to compute

    proportions.

  • 8/4/2019 SDA 3E Chapter 2 (2)

    31/40

    Box and Whisker PlotsDisplay minimum, first quartile (Q 1), median,third quartile (Q 3), and maximum values

    graphically

    min 1 st quartile median 3 rd quartile max

  • 8/4/2019 SDA 3E Chapter 2 (2)

    32/40

    PHStat Tool: Box and Whisker

    PlotPHStat menu > Descriptive Statistics > Box and Whisker Plot

    Enter data range

    Choose type of data set

    Check box for fivenumber summary

  • 8/4/2019 SDA 3E Chapter 2 (2)

    33/40

    Stem and Leaf DisplayEach number is divided into two parts:x y x = stem, and y = leaf Stem = cell; leaf = value within cell

    11 37

    12 46

    Stem and leaf display aggregates and sortsall leaves within the same stem:

    Number Stem Leaf

    117 11 7

    113 11 3

    124 12 4

    125 12 6

  • 8/4/2019 SDA 3E Chapter 2 (2)

    34/40

    Stem and Leaf Stem unit is a power of 10; the higher thestem unit, the more aggregation of data

  • 8/4/2019 SDA 3E Chapter 2 (2)

    35/40

    PHStat Tool: Stem and Leaf

    DisplayPHStat menu > Descriptive Statistics > Stem and Leaf Display

    Enter data range

    Select stem unit orautocalculation

    Check SummaryStatistics box

  • 8/4/2019 SDA 3E Chapter 2 (2)

    36/40

    Dot Scale DiagramPHStat menu > Descriptive Statistics > Dot Scale Diagram

  • 8/4/2019 SDA 3E Chapter 2 (2)

    37/40

    Statistical RelationshipsCorrelation a measure of strength of linearrelationship between two variablesCorrelation coefficient

    Covariance average of the products of thedeviations of each variable from its mean;

    describes how two variables move together

    Sample correlation coefficient

    y x y x

    Y X

    ),cov(,

    N

    y xY X

    N

    i yi xi

    1),cov(m m

    y x

    n

    iii

    ssn

    y y x xr

    )1(

    1

  • 8/4/2019 SDA 3E Chapter 2 (2)

    38/40

    Examples of Correlation

    Negative correlation

    Positive correlation

    No

    correlation

  • 8/4/2019 SDA 3E Chapter 2 (2)

    39/40

    Excel Tool: CorrelationExcel menu > Tools > Data Analysis >Correlation

  • 8/4/2019 SDA 3E Chapter 2 (2)

    40/40

    Correlation Tool Results