Presentation Fbook Version

Embed Size (px)

Citation preview

  • 8/7/2019 Presentation Fbook Version

    1/22

    2-1

    Chapter 2

    Examining YourData

  • 8/7/2019 Presentation Fbook Version

    2/22

    2-2

    LEARNING OBJECTIVESLEARNING OBJECTIVES

    Upon completing this chapter, you should be able to do theUpon completing this chapter, you should be able to do the

    following:following:

    Select the appropriate graphical method to examine theSelect the appropriate graphical method to examine thecharacteristics of the data or relationships of interest.characteristics of the data or relationships of interest. Assess the type and potential impact of missing data.Assess the type and potential impact of missing data. Understand the different types of missing data processes.Understand the different types of missing data processes.

    Explain the advantages and disadvantages of theExplain the advantages and disadvantages of theapproaches available for dealing with missing data.approaches available for dealing with missing data.

    Chapter 2 Examining Your DataChapter 2 Examining Your DataChapter 2 Examining Your DataChapter 2 Examining Your Data

  • 8/7/2019 Presentation Fbook Version

    3/22

    2-3

    LEARNING OBJECTIVES continued . . .LEARNING OBJECTIVES continued . . .

    Upon completing this chapter, you should be able to do theUpon completing this chapter, you should be able to do the

    following:following:

    IdentifyIdentify univariateunivariate,, bivariatebivariate, and multivariate outliers., and multivariate outliers. Test your data for the assumptions underlying mostTest your data for the assumptions underlying most

    multivariate techniques.multivariate techniques.

    Understand Transformation.Understand Transformation.

    Chapter 2 Examining Your DataChapter 2 Examining Your DataChapter 2 Examining Your DataChapter 2 Examining Your Data

  • 8/7/2019 Presentation Fbook Version

    4/22

    2-4

    Examination PhasesExamination Phases

    Graphical examination.Graphical examination.

    Identify and evaluate missing values.Identify and evaluate missing values.

    Identify and deal with outliers.Identify and deal with outliers.

    Check whether statistical assumptions are met.Check whether statistical assumptions are met.

    Develop a preliminary understanding of your data.Develop a preliminary understanding of your data.

  • 8/7/2019 Presentation Fbook Version

    5/22

    2-5

    Shape:Shape:

    HistogramHistogram

    Bar ChartBar Chart Box & Whisker plotBox & Whisker plot

    Stem and Leaf plotStem and Leaf plot

    Relationships:Relationships: ScatterplotScatterplot

    OutliersOutliers

    Graphical ExaminationGraphical Examination

  • 8/7/2019 Presentation Fbook Version

    6/22

    2-6

    Missing DataMissing Data

    Missing Data = information not available for a subject (orMissing Data = information not available for a subject (orcase) about whom other information is available. Typicallycase) about whom other information is available. Typicallyoccurs when respondent fails to answer one or moreoccurs when respondent fails to answer one or more

    questions in a survey.questions in a survey.

    Systematic?Systematic?

    Random?Random?

    Researchers Concern =Researchers Concern = to identify the patterns andto identify the patterns andrelationships underlying the missing data in order torelationships underlying the missing data in order to

    maintain as close as possible to the original distribution ofmaintain as close as possible to the original distribution of

    values when any remedy is applied.values when any remedy is applied.

    Impact . . .Impact . . . Reduces sample size available for analysis.Reduces sample size available for analysis.

    Can distort results.Can distort results.

  • 8/7/2019 Presentation Fbook Version

    7/22

    2-7

    FourFour--Step Process forStep Process for

    IdentifyingMissing DataIdentifyingMissing Data

    Step 1: Determine the Type of Missing DataStep 1: Determine the Type of Missing Data

    Step 2: Determine the Extent of Missing DataStep 2: Determine the Extent of Missing Data

    Step 3: Diagnose the Randomness of the MissingStep 3: Diagnose the Randomness of the Missing

    Data ProcessesData Processes

    Step 4: Select the Imputation MethodStep 4: Select the Imputation Method

  • 8/7/2019 Presentation Fbook Version

    8/22

    2-8

    Missing Data

    Missing Data

    Strategies for handling missing data . . .Strategies for handling missing data . . .

    use observations with complete datause observations with complete data

    only;only;

    delete case(s) and/or variable(s);delete case(s) and/or variable(s);

    estimate missing values.estimate missing values.

  • 8/7/2019 Presentation Fbook Version

    9/22

    2-9

    Rules of Thumb 2Rules of Thumb 211

    How Much Missing Data Is Too Much?How Much Missing Data Is Too Much?

    yy Missing data under 10% for an individual case orMissing data under 10% for an individual case or

    observation can generally be ignored, exceptobservation can generally be ignored, except

    when the missing data occurs in a specificwhen the missing data occurs in a specific

    nonrandom fashion (e.g., concentration in anonrandom fashion (e.g., concentration in a

    specific set of questions, attrition at the end ofspecific set of questions, attrition at the end of

    the questionnaire, etc.).the questionnaire, etc.).

    yy The number of cases with no missing data mustThe number of cases with no missing data must

    be sufficient for the selected analysis technique ifbe sufficient for the selected analysis technique if

    replacement values will not be substitutedreplacement values will not be substituted

    (imputed) for the missing data.(imputed) for the missing data.

  • 8/7/2019 Presentation Fbook Version

    10/22

    2-10

    Rules of Thumb 2Rules of Thumb 233

    Imputation of Missing DataImputation of Missing Data

    yy Under 10%Under 10% Any of the imputation methods can be applied whenAny of the imputation methods can be applied when

    missing data is this low, although the complete casemissing data is this low, although the complete case

    method has been shown to be the least preferred.method has been shown to be the least preferred.

    yy 10 to 20%10 to 20% The increased presence of missing data makes the allThe increased presence of missing data makes the allavailable, hot deck case substitution and regressionavailable, hot deck case substitution and regression

    methods most preferred for MCAR data, while modelmethods most preferred for MCAR data, while model--

    based methods are necessary with MAR missing databased methods are necessary with MAR missing data

    processesprocesses

    yy Over 20%Over 20% If it is necessary to impute missing data when the level isIf it is necessary to impute missing data when the level isover 20%, the preferred methods are:over 20%, the preferred methods are:

    oo the regression method for MCAR situations, andthe regression method for MCAR situations, and

    oo modelmodel--based methods when MAR missing data occurs.based methods when MAR missing data occurs.

  • 8/7/2019 Presentation Fbook Version

    11/22

    2-11

    Outlier = an observation/response with a uniqueOutlier = an observation/response with a unique

    combination of characteristics identifiablecombination of characteristics identifiable

    as distinctly different from the otheras distinctly different from the other

    observations/responses.observations/responses.

    Issue: Is the observation/response representativeIssue: Is the observation/response representative

    of the population?of the population?

    OutlierOutlier

  • 8/7/2019 Presentation Fbook Version

    12/22

    2-12

    Why Do OutliersOccur?Why Do OutliersOccur?

    Procedural Error.Procedural Error.

    Extraordinary Event.Extraordinary Event.

    Extraordinary Observations.Extraordinary Observations.

    Observations unique in theirObservations unique in their

    combination of values.combination of values.

  • 8/7/2019 Presentation Fbook Version

    13/22

    2-13

    Dealing with OutliersDealing with Outliers

    Identify outliers.Identify outliers.

    Describe outliers.Describe outliers.

    Delete or Retain?Delete or Retain?

  • 8/7/2019 Presentation Fbook Version

    14/22

    2-14

    IdentifyingOutliersIdentifyingOutliers

    Standardize data and then identify outliers in termsStandardize data and then identify outliers in termsof number of standard deviations.of number of standard deviations.

    Examine data using Box Plots, Stem & Leaf, andExamine data using Box Plots, Stem & Leaf, andScatterplots.Scatterplots.

    Multivariate detection (DMultivariate detection (D22).).

  • 8/7/2019 Presentation Fbook Version

    15/22

    2-15

    Rules of Thumb 2Rules of Thumb 244

    Outlier DetectionOutlier Detectionyy Univariate methodsUnivariate methods examine all metric variables to identify unique or extremeexamine all metric variables to identify unique or extreme

    observations.observations.

    yy For small samples (80 or fewer observations), outliers typically are defined as casesFor small samples (80 or fewer observations), outliers typically are defined as cases

    with standard scores of 2.5 or greater.with standard scores of 2.5 or greater.

    yy For larger sample sizes, increase the threshold value of standard scores up to 4.For larger sample sizes, increase the threshold value of standard scores up to 4.

    yy If standard scores are not used, identify cases falling outside the ranges of 2.5If standard scores are not used, identify cases falling outside the ranges of 2.5

    versus 4 standard deviations, depending on the sample size.versus 4 standard deviations, depending on the sample size.

    yy Bivariate methodsBivariate methods focus their use on specific variable relationships, such as thefocus their use on specific variable relationships, such as the

    independent versus dependent variables:independent versus dependent variables:

    oo use scatterplots with confidence intervals at a specified Alpha level.use scatterplots with confidence intervals at a specified Alpha level.

    yy Multivariate methodsMultivariate methods best suited for examining a complete variate, such as thebest suited for examining a complete variate, such as the

    independent variables in regression or the variables in factor analysis:independent variables in regression or the variables in factor analysis:

    oo threshold levels for the D2/df measure should be very conservative (.005 orthreshold levels for the D2/df measure should be very conservative (.005 or

    .001), resulting in values of 2.5 (small samples) versus 3 or 4 in larger samples..001), resulting in values of 2.5 (small samples) versus 3 or 4 in larger samples.

  • 8/7/2019 Presentation Fbook Version

    16/22

    2-16

    Multivariate Assumptions

    Multivariate Assumptions

    NormalityNormality

    LinearityLinearity

    HomoscedasticityHomoscedasticity

    NonNon--correlated Errorscorrelated Errors

    Data Transformations?Data Transformations?

  • 8/7/2019 Presentation Fbook Version

    17/22

    2-17

    Testing AssumptionsTesting Assumptions

    Normality assumptionsNormality assumptions Visual check of histogram.Visual check of histogram.

    Kurtosis.Kurtosis.

    Normal probability plot.Normal probability plot.

    HomoscedasticityHomoscedasticity Equal variances across independentEqual variances across independent

    variables.variables.

    LeveneLevene test (test (univariateunivariate).).

    Boxs M (multivariate).Boxs M (multivariate).

  • 8/7/2019 Presentation Fbook Version

    18/22

    2-18

    Rules of Thumb 2Rules of Thumb 255

    Testing Statistical AssumptionsTesting Statistical Assumptionsyy Normality can have serious effects in small samples (less than 50cases), butNormality can have serious effects in small samples (less than 50cases), but

    the impact effectively diminishes when sample sizes reach 200 cases orthe impact effectively diminishes when sample sizes reach 200 cases or

    more.more.

    yy Most cases of heteroscedasticity are a result of nonMost cases of heteroscedasticity are a result of non--normality in one ornormality in one or

    more variables. Thus, remedying normality may not be needed due tomore variables. Thus, remedying normality may not be needed due to

    sample size, but may be needed to equalize the variance.sample size, but may be needed to equalize the variance.

    yy Nonlinear relationships can be very well defined, but seriously understatedNonlinear relationships can be very well defined, but seriously understated

    unless the data is transformed to a linear pattern or explicit modelunless the data is transformed to a linear pattern or explicit model

    components are used to represent the nonlinear portion of the relationship.components are used to represent the nonlinear portion of the relationship.

    yy Correlated errors arise from a process that must be treated much likeCorrelated errors arise from a process that must be treated much like

    missing data. That is, the researcher must first define the causes amongmissing data. That is, the researcher must first define the causes amongvariables either internal or external to the dataset. If they are not found andvariables either internal or external to the dataset. If they are not found and

    remedied, serious biases can occur in the results, many times unknown toremedied, serious biases can occur in the results, many times unknown to

    the researcher.the researcher.

  • 8/7/2019 Presentation Fbook Version

    19/22

    2-19

    Data Transformations ?Data Transformations ?

    Data transformations . . . provide a means ofData transformations . . . provide a means of

    modifying variables for one of two reasons:modifying variables for one of two reasons:

    1.1. To correct violations of the statisticalTo correct violations of the statisticalassumptions underlying the multivariateassumptions underlying the multivariate

    techniques, ortechniques, or

    2.2. To improve the relationship (correlation)To improve the relationship (correlation)

    between the variables.between the variables.

  • 8/7/2019 Presentation Fbook Version

    20/22

    2-20

    Examining DataExamining Data

    Learning CheckpointLearning Checkpoint

    1.1. Why examine your data?Why examine your data?

    2.2. What are the principal aspects of dataWhat are the principal aspects of datathat need to be examined?that need to be examined?

    3.3. What approaches would you use?What approaches would you use?

  • 8/7/2019 Presentation Fbook Version

    21/22

    SPSS

    CLASS 2: EXPLORING DATA

  • 8/7/2019 Presentation Fbook Version

    22/22

    Exploring data

    Step 1: Check for Missing Data

    Step 2: Check for Outliers

    Step 3: Check Assumptions