32
43 O nce your data—whether derived from survey, experimental, or archival methods—are in hand and have been entered into SPSS, you must resist the temptation to plunge ahead with sophisticated multivariate statistical analyses without first critically examining the quality of the data you have collected. This chapter will focus on some of the most salient issues facing researchers before they embark on their multivariate journey. Just like a summer vacationer proceeds in a purposeful way by stopping the mail and newspaper for a period of time, confirming reservations, and checking to see that sufficient funds are available to pay for all the fun, so too must the researcher take equally crucial precautions before proceeding on with the data analysis. Some of these statistical considerations and precautions take the following form: Do the data accurately reflect the responses made by the participants of my study? Are all the data in place and accounted for, or are some of the data absent or missing? Is there a pattern to the missing data? Are there any unusual or extreme responses present in the data set that may distort my understanding of the phenomena under study? Do these data meet the statistical assumptions that underlie the multivariate technique I will be using? What can I do if some of the statistical assumptions turn out to be violated? 3A Data Screening CHAPTER 3A-Meyers-4722.qxd 5/27/2005 10:22 AM Page 43

Data Screening - SAGE Publications Inc

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Data Screening - SAGE Publications Inc

–43▼

Once your data—whether derived from survey, experimental,or archival methods—are in hand and have been enteredinto SPSS, you must resist the temptation to plunge ahead

with sophisticated multivariate statistical analyses without first criticallyexamining the quality of the data you have collected. This chapter willfocus on some of the most salient issues facing researchers before theyembark on their multivariate journey. Just like a summer vacationerproceeds in a purposeful way by stopping the mail and newspaperfor a period of time, confirming reservations, and checking to seethat sufficient funds are available to pay for all the fun, so too must theresearcher take equally crucial precautions before proceeding on with thedata analysis.

Some of these statistical considerations and precautions take thefollowing form:

Do the data accurately reflect the responses made by the participantsof my study?Are all the data in place and accounted for, or are some of the dataabsent or missing?Is there a pattern to the missing data?Are there any unusual or extreme responses present in the data setthat may distort my understanding of the phenomena under study?Do these data meet the statistical assumptions that underlie themultivariate technique I will be using?What can I do if some of the statistical assumptions turn out to beviolated?

▼▼

▼▼

▼▼

3A

Data Screening

C H A P T E R

3A-Meyers-4722.qxd 5/27/2005 10:22 AM Page 43

Page 2: Data Screening - SAGE Publications Inc

This chapter provides the answers to these questions. Chapter 3B willprovide a parallel discussion to show how the procedures discussed herecan be performed using SPSS.

Code and Value Cleaning

The data cleaning process ensures that once a given data set is in hand, averification procedure is followed that checks for the appropriateness ofnumerical codes for the values of each variable under study. This processcan be referred to as code and value cleaning.

The cleaning process begins with a consideration of the research pro-ject’s unit of analysis. Typically, in behavioral science research the “units ofanalysis”—that is, the entities to which your data are specifically related—are human respondents (in survey or archival research) and human partici-pants (in experimental research). In such situations, the score for eachvariable that you have recorded in the data file (e.g., the response to aparticular item on one of your inventories) represents the behavior of anindividual sampled by your research methodology. Collectively (generally)these units can be referred to as cases. Examples of other kinds of casesthat can be the source of research data include individual mental healthservice providers, school districts, census tracts, and cities. Here, the valuerecorded for a variable in your data file represents one of these larger enti-ties (e.g., the number of hours of individual psychotherapy provided to allclients on site in the month of April for a particular mental health facility, theaverage reading comprehension score on the statewide achievement test ofall students in a particular district).

The challenge in code cleaning is to determine, for every case, whethereach variable contains only legitimate numerical codes or values and, secon-darily, whether these legitimate codes seem reasonable. For example, respon-dent gender (a nominal level variable) can be arbitrarily coded as 0 for malesand 1 for females. To the extent that all cases on the gender variable are codedas either 0 or 1, we can say that this variable is “clean.” Notice that code clean-ing does not address the veracity or correctness of an appropriately codedvalue, only whether or not the variable’s code is within the specified range.

Conversely, suppose we had collected data from a sample of 100 com-munity mental health consumers on their global assessment of function-ing (GAF) Axis V rating of the Diagnostic and Statistical Manual of

Mental Disorders, IV-TR (DSM-IV-TR; American Psychiatric Association,2000). GAF scale values can range from 1 (severe impairment) to 100 (goodgeneral functioning). Now, further suppose that our experience with theseconsumers has shown us that the modal (most frequent) GAF score was

44– –APPLIED MULTIVARIATE RESEARCH

3A-Meyers-4722.qxd 5/27/2005 10:22 AM Page 44

Page 3: Data Screening - SAGE Publications Inc

about 55 with minimum and maximum scores of approximately 35 and 65,respectively. If during the cleaning process we discover a respondent witha GAF score of 2, a logically legitimate but certainly an unusual score, wewould probably want to verify its authenticity through other sources. Forexample, we might want to take a look at the original questionnaire or theactual computer-based archival record for that individual.

Such a cleaning process leads us to several options for future action.Consider the situation in which we find a very low GAF score for an individ-ual. Under one scenario, we may confirm that the value of 2 is correct andleave it alone for the time being. Or after confirming its veridicality, we mightconsider that data point to be a candidate for elimination on the propositionthat it is an outlier or extreme score because we view the case as not beingrepresentative of the target population under study. On the other hand, ifour investigation shows the recorded value to be incorrect, we would sub-stitute the correct value (e.g., 42) in its stead. Last, if we deem the value tobe wrong but do not have an appropriate replacement, we can treat it as amissing value (by either coding directly in SPSS the value of 2 as missing or,as is more frequently done, by replacing it with a value that we have alreadyspecified in SPSS to stand for a missing value).

Distribution Diagnosis

With small data sets containing a few cases, data cleaning can be accom-plished by a simple visual inspection process. However, with the typicallylarge data sets required for most multivariate analyses, using computerizedcomputational packages such as SPSS before you start your statistical analy-sis provides a more efficient means for screening data.

These procedures provide output that display the way in which the dataare distributed. We will discuss six types of output commonly used for thispurpose: frequency tables, histograms and bar graphs, bar stem-and-leaf dis-plays, box plots, and scatterplot matrices. In Chapter 3B, we will demonstratehow to use these procedures to conduct these analyses with SPSS.

Frequency Tables

A frequency table is a convenient way to summarize the obtained valuesfor variables that contain a small number of different values or attributes.Demographic variables such as gender (with two codes), ethnicity (withbetween half a dozen and a dozen codes), and marital status (usually withno more than about six categories), along with other nominal level variables,including questions that require simple, dichotomous “yes” “no” options, all

Data Screening– –45▼

3A-Meyers-4722.qxd 5/27/2005 10:22 AM Page 45

Page 4: Data Screening - SAGE Publications Inc

have a limited number of possible values that are easily summarized in afrequency table.

An example of a demographic variable with five codes is “highestacademic degree achieved.” Table 3a.1 depicts the raw and unorganizeddata of 51 community mental health center providers described in a studyby Gamst, Dana, Der-Karabetian, and Kramer (2001). Each respondent wasassigned an arbitrary number at the time of data entry. To make it fit within

46– –APPLIED MULTIVARIATE RESEARCH

Table 3a.1 Terminal Degree Status of Fifty-One Community Mental HealthCenter Providers

Respondent Degree Respondent Degree

1 MA 27 HS2 MA 28 MA3 HS 29 MA4 MA 30 HS5 HS 31 MA6 DOC 32 BA7 MA 33 MA8 DOC 34 BA9 MA 35 MA

10 MA 36 BA11 MA 37 MA12 HS 38 MA13 DOC 39 DOC14 MA 40 MA15 DOC 41 BA16 42 MA17 MA 43 MA18 HS 44 DOC19 DOC 45 OTH20 MA 46 MA21 HS 47 OTH22 HS 48 MA23 DOC 49 DOC24 DOC 50 HS25 MA 51 MA26 MA

SOURCE: Gamst, Dana, Der-Karabetian, and Kramer (2001).

NOTE: MA = master’s degree; HS = high school graduate; DOC = doctorate (PhD or PsyD);BA = bachelor’s degree, OTH = other.

3A-Meyers-4722.qxd 5/27/2005 10:22 AM Page 46

Page 5: Data Screening - SAGE Publications Inc

a relatively small space, we have structured this table to display the data intwo columns. A cursory inspection of the information contained in Table3a.1 suggests a jumble of degree statuses scattered among 50 of the 51 men-tal health practitioners. A coherent pattern is hard to discriminate.

A better and more readable way to display these data appears as thefrequency distribution or frequency table that can be seen in Table 3a.2. Eachrow of the table is reserved for a particular value of the variable called “ter-minal degree.” It aggregates the number of cases with a given value and theirpercentage of representation in the data array. A frequency table such as thatillustrated in Table 3a.2 enables researchers to quickly decipher the impor-tant information contained within a distribution of values. For example, wecan see that 50% of the mental health providers had master’s degrees and20% had doctorates. Thus, a simple summary statement of this table mightnote that, “70% of the mental health providers had graduate-level degrees.”

Table 3a.2 also shows how useful a frequency table can be in the datacleaning process. Because the researchers were using only code values 1through 5, the value of 6 (coded for Respondent 16) represents an anomalouscode in that it should not exist at all. It is for that reason that the value of 6 hasno label in the table. In all likelihood, this represents a data entry error. To dis-cover which case has this anomalous value if the data file was very large, onecould have SPSS list the case number and the terminal degree variable foreveryone (or, to make the output easier to read, we could first select for codeson this variable that were greater than 5 and then do the listing).

Histograms and Bar Graphs

Some variables have a large number of possible values (e.g., GAFscores, monthly income in dollars and cents, etc.). Typically, these are vari-ables that are measured on one of the quantitative scales of measurement.

Data Screening– –47▼

Table 3a.2 Frequency Table Showing One Out-of-Range Value

Code Terminal Degree n Percentage

1 High school 9 182 Bachelor 4 83 Master’s 25 504 Doctorate 10 205 Other 2 46 — 1 1

Total 51 100

3A-Meyers-4722.qxd 5/27/2005 10:22 AM Page 47

Page 6: Data Screening - SAGE Publications Inc

Such variables can also be screened through the same process we used inthe previous example.

Table 3a.3 presents a frequency table for GAF scores for a portion ofan Asian American community mental health consumers sample. Based onthis output, we can see two issues of concern. One individual has a score of2, suggesting very severe impairment of functioning. As mentioned above,such an extreme value is worth checking the raw data and, if valid, possiblyconsidering that person to be an outlier. Another problem is that there arefour cases with a value of 0, an out-of-range value on the GAF. There aresome possible causes that might explain this latter problem, including (a) asimple misalignment of the data in the data file for those cases and (b) a dataentry mistake. In any case, the researchers should be armed with enoughclues at this juncture to discover and correct the problem.

Frequency tables summarizing quantitative variables are also useful as away to gauge the very general shape of the distribution. For the data shownin Table 3a.3, for example, we can see that the largest concentration ofscores is in the 40s and 50s. Based on visual inspection, the distributionappears to be relatively close to normal with perhaps a somewhat negativeskew. SPSS can produce a graphic representation of the distribution. Whenthe distribution is based on a frequency count of a categorical variable, weshould request a bar graph. Conversely, when the distribution is based on afrequency count of a continuing variable we should request a histogram.Because the distribution in Table 3a.3 is a continuous variable (GAF scores),we will produce a histogram that provides a visual approximation of thedistribution’s shape (see Figure 3a.1). One advantage of asking for a his-togram is that SPSS can superimpose a drawing of the normal curve on thedistribution so that we can visually determine how close our scores are to anormal distribution. We have done this in Figure 3a.1 and we can see thatthe distribution, although not perfect, is certainly “normal-like.” However,the descriptive statistics that can be generated by SPSS, such as skewnessand kurtosis, would provide a more precise description of the distribution’sshape (values of skewness and kurtosis of 0 indicate a normal distribution).

Descriptive Statistics: Skewness and Kurtosis

A variety of opinions can be found concerning what is an unacceptablelevel of skewness (the symmetry of a distribution) and kurtosis (the cluster-ing of scores toward the center of a distribution) for a particular variable—that is, how far from zero the value needs to be before it is considered asubstantial enough departure from normality to be mentioned (we willdiscuss those concepts later in this chapter). Some statisticians are more

48– –APPLIED MULTIVARIATE RESEARCH

3A-Meyers-4722.qxd 5/27/2005 10:22 AM Page 48

Page 7: Data Screening - SAGE Publications Inc

Data Screening– –49▼

Table 3a.3 An Abbreviated Frequency Table of Global Assessment ofFunction (GAF) Scores for Asian American CommunityMental Health Consumers

GAF Scores n Percentage

0 4 2.02 1 0.5

15 1 0.516 2 1.021 3 1.525 1 0.528 3 1.531 4 2.033 5 2.535 2 1.036 6 3.037 5 2.539 7 3.540 6 3.041 8 4.043 10 5.045 9 4.546 5 2.547 10 5.048 8 4.049 9 4.552 13 6.553 12 6.054 12 6.057 11 5.558 8 4.059 7 3.560 8 4.065 5 2.566 3 1.568 4 2.073 3 1.575 3 1.578 1 0.580 1 0.5

Total 200 100.0

SOURCE: Gamst et al. (2003).

3A-Meyers-4722.qxd 5/27/2005 10:22 AM Page 49

Page 8: Data Screening - SAGE Publications Inc

comfortable with a conservative threshold of ± 0.5 as indicative of depar-tures from normality (e.g., Hair et al., 1998; Runyon, Coleman, & Pittenger,2000), whereas others prefer a more liberal interpretation of ± 1.00 for skew-ness, kurtosis, or both (e.g., George & Mallery, 2003; Morgan, Griego, &Gloeckner, 2001). Tabachnick and Fidell (2001a) suggest a more definitiveassessment strategy for detecting normality violations by dividing a skew-ness or kurtosis value by its respective standard error and evaluating thiscoefficient with a standard normal table of values (z scores). But such anapproach has its own pitfalls as Tabachnick and Fidell (2001a) note:

But if the sample is large, it is better to inspect the shape of thedistribution instead of using formal inference because the equationsfor standard error of both skewness and kurtosis contain N, andnormality is likely to be rejected with large samples even when thedeviation is slight. (p. 44)

50– –APPLIED MULTIVARIATE RESEARCH

00 10 20 30 40 50 60 70 80

10

20

30

40

Fre

qu

ency

Histogram

gafs

Figure 3a.1 Histogram Showing the Frequency Count of the GAF Scores

3A-Meyers-4722.qxd 5/27/2005 10:22 AM Page 50

Page 9: Data Screening - SAGE Publications Inc

Another helpful heuristic, at least regarding skewness, comes from theSPSS help menu, which suggests that any skewness value more than twiceits standard error is taken to indicate a departure from symmetry.” Unfortu-nately, no such heuristics are provided for determining normality violationsdue to extreme kurtosis. The shape of the distribution becomes of interestwhen researchers are evaluating their data against the assumptions of thestatistical procedures they are considering using.

Stem-and-Leaf Plots

Figure 3a.2 provides the “next of kin” to a histogram display called astem-and-leaf plot. This display, introduced by the statistician John Tukey(1977) in Exploratory Data Analysis, represents hypothetical GAF scoresthat might have been found for a sample of individuals arriving for their firstsession for mental health counseling. Stem-and-leaf plots provide informa-tion about the frequency of a quantitative variable’s values by incorporatingthe actual values of the distribution. These plots are composed of three maincomponents.

On the far left side of Figure 3a.2 is the frequency with which a particu-lar value (the one shown for that row) occurred. In the center of the figure

Data Screening– –51▼

Frequency Stem & Leaf

1.00 1 . 53.00 2 . 0013.00 2 . 5557.00 3 . 00112229.00 3 . 555555588

20.00 4 . 0000000000000012234426.00 4 . 5555555555555555555555578829.00 5 . 0000000000000000000000111111328.00 5 . 555555555555555555555555588828.00 6 . 000000000000000001111122222210.00 6 . 55555555589.00 7 . 0000000002.00 7 . 553.00 8 . 000

Stem width: 10

Each leaf: 1 case(s)

Figure 3a.2 Hypothetical Stem-and-Leaf Plot of GAF Scores at Intake

3A-Meyers-4722.qxd 5/27/2005 10:22 AM Page 51

Page 10: Data Screening - SAGE Publications Inc

is the “stem” and the far right portion is the “leaf.” The stem is the basevalue that which we combine with the leaf portion to derive the full value.For example, for the first row of Figure 3a.2 we note that the lowest GAFscore value has a frequency of 1.00. With a stem of 1 and a leaf of 5, werecognize a GAF score value of 15. The next row represents scores in thelow 20s. The three scores depicted here are 20, 20, and 21.

Stem-and-leaf plots ordinarily combine a range of individual valuesunder a single stem. In Figure 3a.2, intervals of 5 values are tied to a singlestem. Depending on how tightly the scores are grouped, one can haveeither a finer or more global picture of the distribution.

By observing the distribution of “leaves,” researchers can quicklyassess the general shape of the distribution; that is, they can form animpression as to whether it is normal, positively skewed (scores aremore concentrated toward the low end of the distribution), or negativelyskewed (scores are more concentrated toward the high end of thedistribution). It is also possible to generally see whether its kurtosis ismore positive (a peaked distribution among the middle values) or morenegative (a relatively flat distribution) than a normal curve. Again, theskewness and kurtosis statistics that can be obtained through SPSSwill sharpen the judgment made on the basis of visual inspection of thestem-and-leaf plot.

Box plots

Box plots or box and whiskers plots were also introduced by Tukey(1977) to help researchers identify extreme scores. Extreme scores canadversely affect many of the statistics one would ordinarily compute inthe course of performing routine statistical analyses. For example, themean of 2, 4, 5, 6, and 64 is 16. The presence of the extreme score of64 has resulted in a measure of central tendency that does not reallyrepresent the majority of the scores. In this case, the median value of5 is a more representative value of the central tendency of this smalldistribution. As we will discuss at length later, extreme scores, known asoutliers, are often removed or somehow replaced by more acceptablevalues. Box plots convey a considerable amount of information about thedistribution in one fairly condensed display, and it is well worth master-ing the terminology associated with box plots so that they become a partof your data screening arsenal. An excellent description of this topic isprovided by Cohen (1996), and what we present here is heavily drawnfrom his treatment.

52– –APPLIED MULTIVARIATE RESEARCH

3A-Meyers-4722.qxd 5/27/2005 10:22 AM Page 52

Page 11: Data Screening - SAGE Publications Inc

The General Form of the Box Plot

The general form of a box and whiskers plot is shown in Figure 3a.3.According to Cohen (1996), the box plot is based on the median rather thanthe mean because, as we just saw, this former measure is unaffected byextreme scores in the distribution. The “box” part of the box and whiskersplot is drawn in the middle of Figure 3a.3. The median (which is the 50th per-centile or second quartile) is shown by the heavy dark line inside the box. Inour drawing, the median is not at the center of the box but a bit toward itslower portion. This indicates that the distribution is somewhat negativelyskewed (more scores are toward the low end of the scoring continuum).

The borders of the box are set at the 25th percentile (first quartile) andthe 75th percentile (third quartile) for the lower and upper border, respec-tively, because in our box plot, lower scores are toward the bottom and

Data Screening– –53▼

Upper Inner Fence

Lower Inner Fence

Lower Adjacent Value

Upper Adjacent Value

Upper Whisker

Lower Whisker

Median

Outliers

25th Percentile (First Quartile), also called Lower Tukey’s Hinge.

Outliers

75th Percentile (Third Quartile), also called Upper Tukey’s Hinge.

1.5 IQR

1.5 IQR

Interquartile Range (IQR)

Lower Scores

Higher Scores

Figure 3a.3 The General Form of a Box and Whiskers Plot Based on Cohen’s(1996) Description

3A-Meyers-4722.qxd 5/27/2005 10:22 AM Page 53

Page 12: Data Screening - SAGE Publications Inc

higher scores are toward the top. These quartiles are a little less than ± 1standard deviation unit but nonetheless capture the majority of the cases.As shown in Figure 3a.3, these borders are called Tukey’s hinges, and thespan of scores between the hinges (the distance between the first and thirdquartiles) is the interquartile range (IQR).

The two boundary lines appearing above and below it in Figure 3a.3 arecalled inner fences. The one toward the top is the upper inner fence and theone toward the bottom is the lower inner fence. These fences are drawn atthe positions corresponding to ± 1.5 IQRs. That is, once we know the valuefor the interquartile range, we just multiply it by 1.5. Scores inside thesefences are considered to within the bounds of the distribution are thereforenot considered to be extreme.

The “whiskers” portion of the box and whiskers plot are the verticallines perpendicular to the orientation of the box. The one at the top of thebox is the upper whisker, and the one at the bottom of the box is the lowerwhisker. These whiskers extend only as far as the smallest and largest valuesthat fall within the upper and lower inner fences. The upper whisker endsat the upper adjacent value and the lower whisker ends at the lower adja-cent value. Because the whiskers can end before they reach the innerfences, we can tell the “compactness” of the distribution.

The regions beyond the inner fences are considered to be extremescores by this plotting method. SPSS divides this area into two regions. Adata point that is farther than ± 1.5 IQRs but less than ± 3.0 IQRs is labeledby SPSS as an outlier and is shown in its output as “O.” A data point thatexceeds this ± 3.0 IQR distance is considered to be an extreme score and isgiven the symbol “E” in its output.

An Example of a Box Plot

Figure 3a.4 provides a SPSS box and whiskers plot of the previous hypo-thetical GAF score data. As was true for the above example, the medianseems to be a little off center and toward the lower end of the box. This sug-gests a somewhat negative skew to the distribution. In our example, thewhiskers here extend all the way to the inner fences. No scores were foundin the extreme range (between ± 1.5 IQRs and ± 3 IQRs) but some scores inthe lower portion of the distribution were identified by SPSS as extreme.These are marked as “Os” in Figure 3a.4.

Scatterplot Matrices

We have been discussing various data cleaning and data screeningdevices that are used to assess one variable at a time; that is, we have assessed

54– –APPLIED MULTIVARIATE RESEARCH

3A-Meyers-4722.qxd 5/27/2005 10:22 AM Page 54

Page 13: Data Screening - SAGE Publications Inc

variables in a univariate manner. Because of the complex nature of the statis-tical analyses we will be employing throughout this book, it is also incumbenton us to screen variables in a bivariate and multivariate manner—that is, toexamine the interrelationship of two or more variables for unusual patterns ofvariability in combination with each other. For example, we can ask about therelationship between GAF intake and GAF termination or between years livedin the United States and GAF intake. These sorts of questions can be routinelyaddressed with a scatterplot matrix of continuous variables.

We show an example of a scatterplot matrix in Figure 3a.5. In this case,we used four variables and obtained scatterplots for each combination. Forease of viewing, we present only the upper half of the matrix. Each entryrepresents the scatterplot of two variables. For example, the left-most ploton the first row shows the relationship of Variables A and B. It would appear,from the plot, that they might be related in a curvilinear rather than a linearmanner. On the other hand, B and C seem to be related linearly. As we willsee later in this chapter, these plots are often used to look for multivariateassumption violations of normality and linearity. An alternative approach toaddressing linearity with SPSS, which we will not cover here, is to use theregression curve estimation procedure.

Data Screening– –55▼

hgaf

10

20

30

40

50

60

70

80

1

32

4

Figure 3a.4 Box Plot of the Hypothetical GAF Scores at Intake

3A-Meyers-4722.qxd 5/27/2005 10:22 AM Page 55

Page 14: Data Screening - SAGE Publications Inc

A

Variable A

Variable B

Variable C

Variable D

B

5004003002001000–100

50

40

30

20

10

0

C

3002001000

50

40

30

20

10

0

400

500

300

200

100

0

–100

3002001000

D

0 1000 2000 3000 4000 5000 6000

50

40

30

20

10

0

300

200

100

0

400

500

300

200

100

0

–100

0 1000 2000 3000 4000 5000 6000

0 1000 2000 3000 4000 5000 6000

Figure 3a.5 Scatterplot Matrix of Four Continuous Variables

Dealing With Missing Values

During the data screening process, we encounter missing data for a varietyof reasons. Respondents may refuse to answer personal questions pertain-ing to their income, sexual orientation, or current illegal drug use. Conver-sely, some respondents may not be competent to respond because of a lackof knowledge regarding a particular topic. Participants in an experiment maysuffer from fatigue or lack of motivation and simply stop responding.Archival data may be missing because of data entry errors or equipmentmalfunctions. The paramount question concerning the issue of missing datais whether these missing values are a function of a random or a systematicprocess.

56– –APPLIED MULTIVARIATE RESEARCH

3A-Meyers-4722.qxd 5/27/2005 10:22 AM Page 56

Page 15: Data Screening - SAGE Publications Inc

Missing Values Patterns: Random Patterns of Missing Data

Allison (2002) suggests two broad classes of randomly missing data.Observations are said to be missing completely at random (MCAR) if noneof the variables in the data set (including all independent and dependentvariables) contain missing values related to the values of the variable underscrutiny. For example, missing data for GAF scores at the termination of atreatment program might be considered MCAR if nonresponse (anotherterm for missing data) was no more or less likely across major diagnosticclassifications.

Although often claimed or implied, the MCAR assumption is seldomachieved (Allison, 2002). A weaker and potentially more achievable versionof this assumption is that the missing data are missing at random (MAR).This assumption suggests that a variable’s missing values are said to berandom if after controlling for other variables the variable cannot predictthe distribution of the missing data. Continuing with our previous example,suppose we found that respondents with missing data on GAF scores atthe termination of a treatment program were more likely to be classified inthe severe as opposed to the moderate diagnostic categories. These miss-ing GAF scores would then be related to diagnostic category and we wouldconclude that the GAF score data are not missing at random.

Keeping these missing data assumptions in mind should help guide youin the process of determining the seriousness of your missing data situation.If the data conform to the MCAR or MAR criteria, then perhaps you havewhat is termed an ignorable missing data situation (you probably do nothave a problem with the distribution of the missing data but you may stillhave a problem if you have a great deal of missing data). If the missing datadistributions do not meet these criteria, then you are faced with a missingdata scenario that you probably cannot ignore. Let’s look at several ways ofassessing this latter possibility.

Looking for Patterns

Consider the hypothetical clinical outcome study for 15 cases repre-sented in Table 3a.4. The left portion of the table shows the data collectedby the agency for four variables. A GAF score is obtained at intake (shownas GAF-T1 for “Time 1”) and again 6 months later (GAF-T2 for “Time 2”). Alsorecorded is the age of the client at intake and the number of counseling ses-sions for which the client was present. The right portion of Table 3a.4 is atabulation of the missing data situation; it shows the absolute number ofmissing data points and their percentage with respect to a complete set of

Data Screening– –57▼

3A-Meyers-4722.qxd 5/27/2005 10:22 AM Page 57

Page 16: Data Screening - SAGE Publications Inc

four data points. For example, the third case was missing a second GAFscore and a record of the number of therapy sessions he or she attended.Thus, these two missing values made up 50% of the total of four data pointsthat should have been there.

A quick inspection of Table 3a.4 indicates that missing data are scatteredacross all four of the variables under study. We can also note that 9 of the15 cases (60%) have no missing data. Because almost all the multivariateprocedures we talk about in this book need to be run on a complete dataset, the defaults set by the statistical program would select only these 9 caseswith full data, excluding all cases with a missing value on one or morevariables. This issue of sample size reduction as a function of missing data,especially when this reduction appears to be nonrandom, can threaten theexternal validity of the research.

Further inspection of Table 3a.4 shows that one third of the sample(5 respondents) had missing data on the GAF-T2 (the 6-month postintakemeasurement) variable. Normally, such a relatively large proportion ofnonresponse for one variable would nominate it as a possible candidate

58– –APPLIED MULTIVARIATE RESEARCH

Table 3a.4 Missing Data Table (Hypothetical Data)

Variables Pattern

Case GAF-T1 GAF-T2 Age Sessions # Missing % Missing

1 51.0 51.0 30 8 0 02 55.0 — 63 4 1 253 40.0 — 57 — 2 504 38.0 50.0 31 10 0 05 80.0 80.0 19 11 0 06 40.0 — 50 2 1 257 55.0 55.0 19 8 0 08 50.0 70.0 20 8 0 09 62.0 70.0 20 10 0 0

10 65.0 75.0 19 7 0 011 — — 38 — 3 7512 50.0 61.0 65 9 0 013 40.0 55.0 — 8 1 2514 40.0 50.0 46 9 0 015 32.0 — 44 3 1 25# Missing in column 1 5 1 2 9% Missing in column 6.7 33.3 6.7 13.3 15.0

NOTE: GAF – T1 = GAF at intake; GAF – T2 = GAF at Time 2, Sessions = number of treatmentsessions.

3A-Meyers-4722.qxd 5/27/2005 10:22 AM Page 58

Page 17: Data Screening - SAGE Publications Inc

for deletion (by not including that variable in the analysis, more cases wouldhave a complete data set and thus contribute to the analysis). Nonresponseto a postmeasure on a longitudinal (multiply measured) variable is a fairlycommon occurrence. If this variable is crucial to our analyses, then someform of item imputation (i.e., estimation of what the missing value mighthave been and replacement of the missing value with that estimate) proce-dure may be in order.

We should also be attentive to how much missing data are associatedwith each measure. As a general rule, variables containing missing data on5% or fewer of the cases can be ignored (Tabachnick & Fidell, 2001b). Thesecond GAF variable, showing a third of its values as missing, has a missingvalue rate substantially greater than this general rule of thumb; this variabletherefore needs a closer look. In the present example, although the remain-ing three variables exceed this 5% mark as well, the relative frequency ofcases with missing data is small enough to ignore.

An important consideration at this juncture, then, is the randomness ofthe missing data pattern for the GAF-T2 variable. A closer inspection of themissing values for GAF-T2 indicates that they tend to occur among the olderrespondents. This is probably an indication of a systematic pattern of non-response and thus is probably not ignorable.

Finally, we can see from Table 3a.4 that Case 11 is missing data on threeof the four variables. Missing such a large proportion of data points wouldmake a strong argument for that case to be deleted from the data analysis.

Methods of Handling Missing Data

Although there are a number of established procedures for dealingwith item nonresponse or missing data, experts differ on their personalrecommendations for which techniques to use under varying degrees ofrandomness of the missing data process. For excellent introductoryoverviews of this topic see Graham, Cumsille, and Elek-Fisk (2003), Hairet al. (1998), Schaefer and Graham (2002), and Tabachnik and Fidell(2001b). More advanced coverage of missing data can be found in Allison(2002), Little and Rubin (2002), and Schafer (1997).

Here are some of the more common approaches used to address missingdata situations.

Listwise Deletion

This method involves deleting from the particular statistical analysis allcases that have missing data. We call this method listwise because we are

Data Screening– –59▼

3A-Meyers-4722.qxd 5/27/2005 10:22 AM Page 59

Page 18: Data Screening - SAGE Publications Inc

deleting cases with missing data on any variable in our list. In this method, asingle missing value on just a single variable in the analysis is cause for a caseto be excluded from the statistical analysis. As we mentioned previously, thisis a standard practice for most computer statistical packages, including SPSS.

A practical advantage of listwise deletion is that this method can be usedin a variety of multivariate techniques (e.g., multiple regression, structuralequation modeling), and it ordinarily requires no additional commands orcomputations. One obvious concern about this approach involves the lossof cases that could have been very difficult and expensive (in time or otherresources) to obtain. Another concern is that the sample size reduction mayincrease the estimate of measurement error (standard errors increase withlower sample sizes). Finally, lowering the sample size may drop it below therelatively large N needed for most multivariate procedures.

Allison (2002) gives the listwise deletion method a very strong endorse-ment when he notes:

Listwise deletion is not a bad method for handling missing data.Although it does not use all of the available information, at leastit gives valid inferences when the data are MCAR. . . . whenever theprobability of missing data on a particular independent variabledepends on the value of that variable (and not the dependent vari-able), listwise deletion may do better than maximum likelihood ormultiple imputation. (p. 7)

Pairwise Deletion

This approach computes summary statistics (e.g., means, standard devi-ations, correlations) from all available cases that have valid values (it is theSPSS default method of handling missing values for these computations inmost of the procedures designed to produce descriptive statistics). Thus, nocases are necessarily completely excluded from the data analysis. Cases withmissing values on certain variables would still be included when other vari-ables (on which they had valid values) were involved in the analysis. If wewanted to compute the mean for variables X, Y, and Z using, for example, theFrequencies procedure in SPSS, all cases with valid values on X would bebrought into the calculation of X’s mean, all cases with valid values on Y

would be used for the calculation of Y’s mean, and all cases with valid valuesof Z would be used for the calculation of Z’s mean. It is therefore possiblethat the three means could very well be based on somewhat different casesand somewhat different Ns.

60– –APPLIED MULTIVARIATE RESEARCH

3A-Meyers-4722.qxd 5/27/2005 10:22 AM Page 60

Page 19: Data Screening - SAGE Publications Inc

Correlation presents another instance where pairwise deletion is thedefault. To be included in computing the correlation of X and Y, cases musthave valid values on both variables. Assume that Case 73 is missing a valueon the Y variable. That case is therefore excluded in the calculation of thecorrelations between X and Y and between Y and Z. But that case will beincluded in computing the correlation between X and Z if that case has validvalues for those variables. It is therefore not unusual for correlations pro-duced by the Correlations procedure in SPSS to be based on somewhatdifferent cases. Note that when correlations are computed in one of themultivariate procedures where listwise deletion is in effect, that methodrather than pairwise deletion is used so that the correlations in the resultingcorrelation matrix are based on exactly the same cases.

Although pairwise deletion can be successfully used with linear regres-sion, factor analysis, and structural equation modeling (see Allison, 2002),this method is clearly most reliable when the data are MCAR. Furthermore,statistical software algorithms for computing standard errors with pairwisedeletion show a considerable degree of variability and even bias (Allison,2002). Our recommendation is not to use pairwise deletion when conduct-ing multiple regression, factor analysis, or structural equation modeling.

Imputation Procedures

The next approaches to missing data that we describe are collectivelyreferred to as imputation procedures. These methods attempt to imputeor substitute for a missing value some other value that they deem to bea reasonable guess or estimate. The statistical analysis is then conductedusing these imputed values. Although these imputation methods do pre-serve sample size, we urge caution in the use of these somewhat intoxicat-ing remedies for missing data situations. Permanently altering the raw datain a data set can have potentially catastrophic consequences for the begin-ning and experienced multivariate researcher alike. For good overviews ofthese procedures, see Allison (2002), Hair et al. (1998), and Tabachnick andFidell (2001b).

Mean Substitution

Mean substitution calls for replacing all missing values of a variable withthe mean of that variable. The mean to be used as the replacement value, ofcourse, must be based on all the valid cases in the data file. This is both themost common and most conservative of the imputation practices.

Data Screening– –61▼

3A-Meyers-4722.qxd 5/27/2005 10:22 AM Page 61

Page 20: Data Screening - SAGE Publications Inc

The argument for using mean substitution is based on the acceptedrubric that the sample mean is the best estimate of the population mean. Ananalogous argument is used for the mean substitution procedure. The bestestimate of what a missing value might be is the mean of the values that wehave. Now, we know that not every missing value would fall on this mean—some values would be lower, and some values would be higher. But in theabsence of contradictory information, we estimate that the average of thesemissing values would be equal to the average of the valid values. Based onthat reasoning, we then substitute the mean that we have for the values thatare missing.

At the same time, it is important to recognize that the true values forthese missing cases would almost certainly vary over at least a modest rangeof scores. By substituting the same single value (the mean of our observedvalues) for a missing value, even granting that it is a reasonable estimate, wemust accept the consequence that this procedure artificially reduces thevariability of that variable.

As you might expect from what we have just said, there are at least threedrawbacks to the mean substitution strategy. First, the assumption that themissing values are randomly distributed among the cases is an assumptionthat is not always fully tested. Second, although the mean of a distribution isthe best estimate we have of the population parameter, it is still likely to occurwithin a certain margin of error (e.g., ±1.96 standard error units). Thus, thesample mean, although being our best estimate, may not fall at the true valueof the parameter. Third, the variance of the variable having its missing valuesreplaced is necessarily reduced when we remove the missing value and sub-stitute it with the mean of the valid cases. That narrowing of the variance,in turn, can distort the variable’s distribution of values (Hair et al., 1998;Tabachnick & Fidell, 2001b) and can therefore bias the statistical analysis.

An offshoot of this approach is to use a subgroup mean rather than a fullsample mean in the substitution process. For example, if we know the eth-nicity, diagnostic category, or some other information about the cases that isdetermined to be useful in facilitating prediction, we could calculate themean of that subgroup and substitute that for the missing value for thoseindividuals in that subgroup. For example, if we had reason to believe thatsex was the most relevant variable with respect to Variable K, then we wouldobtain separate means on K for women and men and substitute the formerfor any missing K values associated with women and the latter for any miss-ing K values associated with men. This approach may be more attractive thansample-wise mean substitution because it narrows the configuration of caseson which the imputation is based but does require that the researchers artic-ulate their reasoning for selecting the subgroup that they did.

62– –APPLIED MULTIVARIATE RESEARCH

3A-Meyers-4722.qxd 5/27/2005 10:22 AM Page 62

Page 21: Data Screening - SAGE Publications Inc

Multiple Regression Imputation

Multiple regression will be discussed in Chapters 5A and 5B. Basically,this approach to item nonresponse builds a multiple regression equation topredict missing values. With multiple regression, we use several indepen-dent variables to build a model (i.e., generate a regression equation) thatallows us to predict a dependent variable value. When we wish to replacemissing values on a particular variable, we use that variable as the depen-dent variable in a multiple regression procedure. A prediction equation isthus produced based on the cases with complete data. With this equation,we predict (i.e., impute, figure out the particular values to substitute for) themissing values on that variable.

This regression method is a better, more sophisticated approach thanmany of the previous methods we have reviewed. However, problems canarise when missing values occur on multiple independent variables or onthe dependent variable (Allison, 2002). There is also a tendency to “overfit”the missing values because they are predicted from other independent vari-ables (Tabachnick & Fidell, 2001b). Such overfitting produces samples thatmay not reflect or generalize to the population from which they were drawn.This same theme is echoed by Allison (2002) who notes, “Analyzing imputeddata as though it were complete data produces standard errors that areunderestimated and test statistics that are overestimated” (p. 12).

Combining Imputation Procedures

Some researchers prefer using combinations of the approaches we havethus far described to address the specific issue that they face in their owndata set. Such an approach sometimes overcomes the limitations found inusing any one of the imputation methods in isolation. Another way to con-solidate the approaches is to make use of expectation maximization.

Expectation Maximization Imputation

Recent software innovations, such as SPSS’s optional module Missing

Value Analysis, has made expectation maximization (EM) imputation anattractive synthesis and extension of some of the approaches that we havedescribed (other statistical software package modules as well as stand-aloneprograms are also available). The EM imputation approach used by the SPSSMissing Value Analysis module uses a maximum likelihood approach forestimating missing values (Little & Rubin, 2002). As we will see in Chapter14A, maximum likelihood results are very similar to those obtained throughleast squares linear regression (Allison, 2002).

Data Screening– –63▼

3A-Meyers-4722.qxd 5/27/2005 10:22 AM Page 63

Page 22: Data Screening - SAGE Publications Inc

The EM algorithm is a two-step iterative process. During the E step,regression analyses are used to estimate the missing values. Using maximumlikelihood procedures, the M step makes estimates of parameters (e.g., cor-relations) using the missing data replacements. The SPSS program iteratesthrough E and M steps until convergence or no change occurs between thesteps (Allison, 2002; Hair et al., 1998; Tabachnick & Fidell, 2001b).

When comparing EM with regression imputation procedures, Allison(2002) notes some important advantages:

The EM algorithm avoids one of the difficulties with conventionalregression imputations—deciding which variables to use as predic-tors and coping with the fact that different missing data patternshave different sets of available predictors. Because EM always startswith the full covariance matrix, it is possible to get regression esti-mates for any set of predictors, no matter how few cases there maybe in a particular missing data pattern. Hence, EM always uses all theavailable variables as predictors for imputing the missing data. (p. 20)

Beyond EM Imputation

A more recently developed approach to estimating missing values is amultiple imputation (MI) procedure (see Allison, 2002; Graham et al., 2003;Schaefer & Graham, 2002; West, 2005). Although at the time we are writingour book, this MI procedure has been implemented by SAS, SPSS has notyet incorporated it. MI continues where EM stops. Very briefly stated, EMgenerates an estimate of the missing value. Recognizing that this is not aprecise estimate but, rather, has some estimation error associated with it, MIadds to or subtracts from the EM estimate the value of a randomly chosenresidual from the regression analysis, thus building some error variance intothe estimated value. MI cycles through the prediction and residual selectionprocess in an iterative manner until convergence on estimated value of theparameters of the model are reached.

This whole process is usually performed between 5 and 10 times (West,2005), generating many separate parameter estimates. The final values ofthe estimated parameters of interest are then computed based on the infor-mation from these various estimates. Because we are attempting to bestpredict the missing values, this approach allows us to include in this effortvariables that are not necessarily related to the research question but thatmight enhance our prediction of the missing values themselves providedthat such variables were thoughtfully included in the original data collectionprocess.

64– –APPLIED MULTIVARIATE RESEARCH

3A-Meyers-4722.qxd 5/27/2005 10:22 AM Page 64

Page 23: Data Screening - SAGE Publications Inc

Recommendations

We agree with both the sage and tongue-in-cheek advice of Allison(2002) that “the only really good solution to the missing data problem is notto have any” (p. 2). Because the likelihood of such an eventuality is low, weencourage you to explore your missing values situation. As a first step, youcould compare cases with and without missing values on variables of inter-est using independent samples t tests. For example, cases with missing gen-der data could be coded as 1 and cases with complete gender data could becoded as 0. Then you could check to see if any statistically significant differ-ences emerge for this “dummy” coded independent variable on a depen-dent variable of choice such as respondent’s GAF score. Such an analysismay give you confidence that the missing values are or are not related to agiven variable under study.

If you elect to use some form of missing value imputation process, it isworthwhile to compare your statistical analysis with cases using only com-plete data (Tabachnick & Fidell, 2001b). If no differences emerge between“complete” versus “imputed” data sets, then you can have confidence thatyour missing value interventions reflect statistical reality. If they are differ-ent, then further exploration is in order.

We recommend the use of listwise case deletion when you have smallnumbers of missing values that are MCAR or MAR. Deleting variables thatcontain high proportions of missing data can also be desirable if those vari-ables are not crucial to your study. Mean substitution and regression impu-tation procedures can also be profitably employed when missing values areproportionately small (Tabachnick & Fidell, 2001b) but we recommend care-ful pre-post data set appraisal as outlined above as well as consultation witha statistician as you feel necessary. If you have the SPSS Missing Values

Analysis, then we suggest using that module to handle your missing values;it will do an adequate job of estimating replacement values. Finally, if youhave access to an MI procedure (e.g., SAS) and feel comfortable enoughworking with it and subsequently importing your data to SPSS to performthe statistical analyses, then we certainly recommend using that procedure.

Outliers

Cases with extreme or unusual values on a single variable (univariate) or ona combination of variables (multivariate) are called outliers. These outliersprovide researchers with a mixed opportunity in that their existence maysignal a serendipitous presence of new and exciting patterns within a dataset, yet they may also signal anomalies within the data that may need to be

Data Screening– –65▼

3A-Meyers-4722.qxd 5/27/2005 10:22 AM Page 65

Page 24: Data Screening - SAGE Publications Inc

addressed before proceeding with the statistical analyses. However, extremesplits on dichotomous variables are more the norm than the exception inclinical and applied research. Accordingly, if the sample size is sufficientlylarge, these extreme bifurcations should not pose a great problem.

Causes of Outliers

Hair et al. (1998) identify four reasons for outliers in a data set.

1. Outliers can be caused by data entry errors or improper attributecoding. These errors are normally caught in the data cleaning stage.

2. Some outliers may be a function of extraordinary events or unusualcircumstances. For example, in a human memory experiment, a par-ticipant may recall all 80 of the stimulus items correctly or illnessstrikes a participant during the middle of a clinical interview, chang-ing the nature of her responses when she returns the following weekto finish the interview. Most of the time, the safest course is to elimi-nate outliers produced by these circumstances—but not always. Thefundamental question you should ask yourself is, “Does this outlierrepresent my sample?” If “yes,” then you should include it.

3. There are some outliers for which we have no explanation. Theseunexplainable outliers are good candidates for deletion.

4. There are multivariate outliers whose uniqueness occurs in theirpattern of combination of values on several variables, for example,unusual combined patterns of age, gender, and number of arrests.

Detection of Univariate Outliers

Univariate outliers can be identified by an inspection of the frequencydistribution or box plot for each variable. Dichotomous variables (e.g., “yes,”“no”) with extreme splits (e.g., 90%–10%) between response options shouldbe deleted (Tabachnick & Fidell, 2001b).

For continuous variables, several options exist for determining a thresh-old for outlier designation. Hair et al. (1998) recommend converting the val-ues of each variable to standard (i.e., z) scores with a mean of 0 and a standarddeviation of 1. This can be accomplished easily with SPSS’s Explore orDescriptives programs where z scores can be computed and saved in thedata file for later profiling. As a general heuristic, Hair et al. (1998) recom-mend considering cases with z scores exceeding ± 2.5 to be outliers. Theseshould be carefully considered for possible deletion. Conversely, Cohen et al.

66– –APPLIED MULTIVARIATE RESEARCH

3A-Meyers-4722.qxd 5/27/2005 10:22 AM Page 66

Page 25: Data Screening - SAGE Publications Inc

(2003) provide this tip on outliers stating that “if outliers are few (less than 1%or 2% of n) and not very extreme, they are probably best left alone” (p. 128).

An alternative approach to univariate detection of outliers involvesinspecting histograms, box plots, and normal probability plots (Tabachnick& Fidell, 2001b). Univariate outliers reveal themselves through their visibleseparation from the bulk of the cases on a particular variable when profiledwith these graphical techniques.

Detection of Multivariate Outliers

After inspecting the data set for univariate outliers, an assessmentfor multivariate outliers is in order. As a first step in looking for outlierson a combination of variables, we recommend running bivariate (i.e., two-variable) scatterplots for combinations of key variables. In these plots (suchas we showed earlier in Figure 3a.5), each case is represented as a point onthe X and Y axis. Most cases fall within the elliptical (oval-shaped) swarm orpattern mass. Outliers are those cases that tend to lie outside the oval.

A more objective way of assessing for the presence of multivariateoutliers is to compute each case’s Mahalanobis distance. The Mahalanobisdistance statistic D2 measures the multivariate “distance” between eachcase and the group multivariate mean (known as a centroid). Each case isevaluated using the chi-square distribution with a stringent alpha level of.001. Cases that reach this significance threshold can be considered multi-variate outliers and possible candidates for elimination.

Multivariate Statistical Assumptions

Statistical assumptions underlie most univariate and multivariate statisticaltests. Of special significance to multivariate analyses are the assumptions ofnormality, linearity, and homoscedasticity. Should one or more of theseassumptions be violated, then the statistical results may become biased ordistorted (Hair et al., 1998; Keppel, 1991; Tabachnick & Fidell, 2001b).

Normality

The shape of a distribution of continuous variables in a multivariateanalysis should correspond to a (univariate) normal distribution. That is, thevariable’s frequency distribution of values should roughly approximate abell-shaped curve. Both Stevens (2002) and Tabachnick and Fidell (2001b)indicate that univariate normality violations can be assessed with statisticalor graphical approaches.

Data Screening– –67▼

3A-Meyers-4722.qxd 5/27/2005 10:22 AM Page 67

Page 26: Data Screening - SAGE Publications Inc

Statistical Approaches

Statistical approaches that assess univariate normality often begin withmeasures of skewness and kurtosis. Skewness is a measure of the symmetryof a distribution; positive skewness indicates that a distribution’s mean lieson the right side of the distribution, and negative skewness indicates that adistribution’s mean lies on the left side of the distribution. Kurtosis is ameasure of the general peakedness of a distribution. Positive kurtosis, alsocalled leptokurtosis, and indicates an extreme peak in the center of thedistribution; negative kurtosis, also called platykurtosis, suggests anextremely flat distribution. A normally distributed variable (one exhibitingmesokurtosis) will generate skewness and kurtosis values that hover aroundzero. These values can be obtained with SPSS through its Frequencies,Descriptives, and Explore procedures; the latter two procedures also pro-duce significance tests, which are typically evaluated at a stringent alphalevel of .01 or .001 (Tabachnick & Fidell, 2001b).

Additional statistical tests include the Kolmogorov-Smirnov test and theShapiro-Wilk test. Although both tests can be effectively employed, Stevens(2002) recommends the use of the Shapiro-Wilk test because it appears tobe “the most powerful in detecting departures from normality” (p. 264).Both tests can be obtained through the SPSS Explore procedure. Statisticalsignificance with these measures, ideally with a stringent alpha level( p < .001) indicates a possible univariate normality violation.

Graphical Approaches

Graphical approaches that assess univariate normality typically beginwith an inspection of histograms or stem-and-leaf plots for each variable.However, such cursory depictions do not provide a definitive indication ofa normality violation. A more precise graphical method is to use a normalprobability plot, where the values of a variable are rank ordered and plottedagainst expected normal distribution values (Stevens, 2002). In these plots,a normal distribution produces a straight diagonal line, and the plotted datavalues are compared with this diagonal. Normality is assumed if the datavalues follow the diagonal line.

Multivariate Approaches

So far we have been discussing the assumption of univariate normality.The assumption of multivariate normality, although somewhat more com-plicated, is intimately related to its univariate counterpart. Stevens (2002)

68– –APPLIED MULTIVARIATE RESEARCH

3A-Meyers-4722.qxd 5/27/2005 10:22 AM Page 68

Page 27: Data Screening - SAGE Publications Inc

cautions researchers that just because one has demonstrated univariatenormality on each variable in a data set, the issue of multivariate normality—the observations among all combinations of variables are normally distrib-uted—may not always be satisfied. As Stevens (2002) notes, “Although it isdifficult to completely characterize multivariate normality, normality on

each of the variables separately is a necessary, but not sufficient,

condition for multivariate normality to hold” (p. 262). Thus, although uni-variate normality is an essential ingredient to achieve multivariate normality,Stevens argues that two other conditions must also be met: (a) that linearcombinations of the variables (e.g., variates) should be normally distributedand (b) that all pairwise combinations of variables should also be normallydistributed.

As we noted previously, SPSS offers a procedure to easily examinewhether or not univariate normality is present among the variables with vari-ous statistical tests and graphical options. But it does not offer a statistical testfor multivariate normality. We therefore recommend a thorough univariatenormality examination coupled with a bivariate scatterplot examination of keypairs of variables. If the normality assumption appears to be violated, it maybe possible to “repair” this problem through a data transformation process.

Linearity

Many of the multivariate techniques we cover in this text (e.g., multipleregression, multivariate analysis of variance [MANOVA], factor analysis)assume that the variables in the analysis are related to each other in a linearmanner; that is, they assume that the best fitting function representing thescatterplot is a straight line. Based on this assumption, these proceduresoften compute the Pearson correlation coefficient (or a variant of it) as partof the calculations needed for the multivariate statistical analysis. As we willdiscuss in Chapter 4A, the Pearson r assesses the degree of linear relation-ship observed between two variables. Nonlinear relationships between twovariables cannot be assessed by the Pearson correlation coefficient. To theextent that such nonlinearity is present, the observed Pearson r would be aless representative index of the strength of the association between the twovariables—it would identify less relationship strength than existed becauseit could capture only the linear component of the relationship.

The use of bivariate scatterplots is the most typical way of assessinglinearity between two variables. Variables that are both normally distributedand linearly related to each other will produce scatterplots that are ovalshaped or elliptical. If one of the variables is not normally distributed,linearity will not be achieved. We can recognize this situation because the

Data Screening– –69▼

3A-Meyers-4722.qxd 5/27/2005 10:22 AM Page 69

Page 28: Data Screening - SAGE Publications Inc

resulting scatterplot will be nonelliptical (Tabachnick & Fidell, 2001b).However, there is a downside to running a plethora of bivariate scatterplots,as Tabachnick and Fidell (2001b) aptly note: “Assessing linearity throughbivariate scatterplots is reminiscent of reading tea leaves, especially withsmall samples. And there are many cups of tea if there are several variablesand all possible pairs are examined” (p. 78).

Another approach (often used in the context of multiple regression)is to run a regression analysis and examine the residuals plot. Residualsdepict the portion (or “left over”) of the dependent variable’s variancethat was not explained by the regression analysis (i.e., the error compo-nent). We will see this in Chapter 3B. The “cure” for nonlinearity lies in datatransformation.

Homoscedasticity

The assumption of homoscedasticity suggests that quantitative depen-dent variables have equal levels of variability across a range of (either con-tinuous or categorical) independent variables (Hair et al., 1998). Violation ofthis assumption results in heteroscedasticity. Heteroscedasticity typicallyoccurs when a variable is not distributed in a normal manner or when adata transformation procedure has produced an unanticipated distribu-tion for a variable (Tabachnick & Fidell, 2001b).

In the univariate analysis of variance (ANOVA) context (with onequantitative dependent variable and one or more categorical independentvariables), this homoscedasticity assumption is referred to as homogeneity

of variance in which it is assumed that equal variances of the dependentmeasure are observed across the levels of the independent variables(Keppel, 1991; Keppel, Saufley, & Tokunaga, 1992).

Several statistical tests can be used to detect homogeneity of varianceviolations, including Fmax and Levene’s test. The Fmax test is computed byworking with the variance of each group and dividing the largest variance bythe smallest variance. Keppel et al. (1992) note that any Fmax value of 3.0 orgreater is indicative of an assumption violation, and they recommend theuse of the more stringent alpha level of p < .025 when evaluating an F ratio.Alternatively, Levene’s test assesses the statistical hypothesis of equal vari-ances across the levels of the independent variable. Rejection of the nullhypothesis (at p < .05) indicates an assumption violation or unequal vari-ability. Stevens (2002) cautions about the use of the Fmax test because of itsextreme sensitivity to violations of normality. The Fmax statistic can beproduced through the ANOVA procedure, and the Levene test can beproduced with the SPSS Explore and Oneway procedures.

70– –APPLIED MULTIVARIATE RESEARCH

3A-Meyers-4722.qxd 5/27/2005 10:22 AM Page 70

Page 29: Data Screening - SAGE Publications Inc

When more than one quantitative dependent variable is being assessed(as in the case of MANOVA) then Box’s M test for equality of variance-covariance matrices is used to test for homoscedasticity. Akin to its univari-ate counterpart, Levene’s test, Box’s M tests the statistical hypothesis thatthe variance-covariance matrices are equal. A statistically significant (p < .05)Box’s M test indicates a homoscedasticity assumption violation, but it is verysensitive to any departures of normality among the variables under scrutiny(Stevens, 2002).

Typically, problems related to homoscedasticity violations can be attrib-uted to issues of normality violations for one or more of the variables underscrutiny. Hence, it is probably best to first assess and possibly remediatenormality violations before addressing the issue of equal variances or vari-ance-covariance matrices (Hair et al., 1998; Tabachnick & Fidell, 2001b). Ifheteroscedasticity is present, this too can be remedied by means of datatransformations. We address this topic next.

Data Transformations

Data transformations are mathematical procedures that can be used tomodify variables that violate the statistical assumptions of normality, linear-ity, and homoscedasticity, or that have unusual outlier patterns (Hair et al.,1998; Tabachnick & Fidell, 2001b). First you determine the extent to whichone or more of these assumptions are violated. Then you decide whetheror not the situation calls for a data transformation to correct this matter.If so, then you actually instruct SPSS to change every value of the variableor variables you wish to transform. Once the numbers have been changedin this manner, you would then perform the statistical analysis on thesechanged or transformed data values.

Much of our current understanding of data transformations has beeninformed by the earlier seminal work of Box and Cox (1964) and Mostellerand Tukey (1977). These data transformations can be easily achieved withSPSS through its Compute procedure (see Chapter 3B).

A note of caution should be expressed here. Data transformations aresomewhat of a “double-edged sword.” On the one hand, their use can sig-nificantly improve the precision of a multivariate analysis. At the same time,however, using a transformation can pose a formidable data interpretationproblem. For example, a logarithmic transformation of a mental health con-sumer’s GAF score or number of mental health service sessions will producenumbers quite different from the ordinary raw values we are used to seeingand may therefore pose quite a challenge to the average journal reader toproperly interpret (Tabachnick & Fidell, 2001b). Because of this apparent

Data Screening– –71▼

3A-Meyers-4722.qxd 5/27/2005 10:22 AM Page 71

Page 30: Data Screening - SAGE Publications Inc

dialectical quandary, we wish to point out at the start of our discussion thatwe recommend judicious use of data transformations.

A variety of data transformations are available. In many fields of study,certain data transformations (e.g., log transformations) are well acceptedbecause of the distribution of the dependent variables (e.g., reaction timestudies in psychology or personal income studies in economics). Some ofthe more popular transformations, generally, are the square root, logarithm,inverse, square of X, reflect and square root, reflect and logarithm, andreflect and inverse. Table 3a.5 provides some illustrations of these varioustransformations for some hypothetical GAF score data. The main purpose ofTable 3a.5 is to remind the reader that although all these transformationswere based on the same original set of five GAF scores, the resulting datavalues can appear quite strange at first glance. For example, a GAF score of50 has a square root of 7.07, a log of 3.91, an inverse root of .02, and so on.Journal readers familiar with the GAF measure may be quite uncertainabout the meaning of group or variable means reported in terms of thesetransformations.

Table 3a.5 underscores concretely the potential interpretation difficultieswith which researchers are faced when they attempt to discuss even simpledescriptive statistics (e.g., means and standard deviations) that are basedon transformed data. One way to avoid the possibility of making confusingor misleading statements pertaining to transformed data is to provide thereader with the original variable’s statistical context (e.g., minimum and max-imum values or means and standard deviations reported in raw score values).

Statisticians appear to be divided on their recommendations as to whichtransformation to use for a particular circumstance (e.g., compare Hairet al., 1998 with Tabachnick & Fidell, 2001b). Nevertheless, a basic strategyin using transformations can be outlined in which a progression (escalation)

72– –APPLIED MULTIVARIATE RESEARCH

Table 3a.5 Comparison of Common Data Transformations With Hypothetical GAF Scores

Transformation

Reflect &

Original Square Square Reflect & Reflect &

Case Value Root Logarithm Inverse Square Root Logarithm Inverse

1 1.00 1.00 0.00 1.00 1.00 10.00 2.00 .012 5.00 2.24 1.61 .20 25.00 9.80 1.98 .013 25.00 5.00 3.22 .04 625.00 8.72 1.88 .014 50.00 7.07 3.91 .02 2,500.00 7.14 1.71 .025 100.00 10.00 4.61 .01 10,000.00 1.00 0.00 1.00

3A-Meyers-4722.qxd 5/27/2005 10:22 AM Page 72

Page 31: Data Screening - SAGE Publications Inc

of transformation strategies is employed depending on the perceived severityof the statistical assumption violation. For example, Tabachnick and Fidell(2001b) and Mertler and Vannatta (2001) lobby for a data transformationprogression from square root (to correct a moderate violation), to loga-rithm (for a more substantial violation), and then to inverse square root(to handle a severe violation). In addition, arc sine transformations can beprofitably employed with proportional data, and squaring one variable ina nonlinear bivariate relationship can effectively alleviate a nonlinearityproblem (Hair et al., 1998).

Recommended Readings

Allison, P. D. (2002). Missing data. Thousand Oaks, CA: Sage.Barnett, V., & Lewis, T. (1978). Outliers in statistical data. New York: Wiley.Berry, W. D. (1993). Understanding regression assumptions. Newbury Park, CA:

Sage.Box, G. E. P., & Cox, D. R. (1964). An analysis of transformations. Journal of the

Royal Statistical Society, 26(Series B), 211–243.Duncan, T. E., Duncan, S. C., & Li, F. (1998). A comparison of model- and multiple

imputation-based approaches to longitudinal analyses with partial missingness.Structural Equation Modeling, 5, 1–21.

Enders, C. K. (2001). The impact of nonnormality on full information maximum-likeli-hood estimation for structural equation models with missing data. Psychological

Methods, 6, 352–370.Enders, C. K. (2001). A primer on maximum likelihood algorithms available for use

with missing data. Structural Equation Modeling, 8, 128–141.Fox, J. (1991). Regression diagnostics. Thousand Oaks, CA: Sage.Gold, M. S., & Bentler, P. M. (2000). Treatments of missing data: A Monte Carlo com-

parison of RBHDI, iterative stochastic regression imputation, and expectation-maximization. Structural Equation Modeling, 7, 319–355.

Roth, P. L. (1994). Missing data: A conceptual review from applied psychologists.Personnel Psychology, 47, 537–560.

Rousseeuw, P. J., & van Zomeren, B. C. (1990). Unmasking multivariate outliers andleverage points. Journal of the American Statistical Association, 85, 633–639.

Rubin, D. (1996). Multiple imputation after 18 + years. Journal of the American

Statistical Association, 91, 473–489.Schafer, J. L., &, Graham, J. W. (2002). Missing data: Our view of the state of the art.

Psychological Methods, 7, 147–177.Stevens, J. P. (1984). Outliers and influential data points in regression analysis.

Psychological Bulletin, 95, 334–344.Tukey, J. W. (1977). Exploratory data analysis. Reading, MA: Addison-Wesley.

Data Screening– –73▼

3A-Meyers-4722.qxd 5/27/2005 10:22 AM Page 73

Page 32: Data Screening - SAGE Publications Inc

3A-Meyers-4722.qxd 5/27/2005 10:22 AM Page 74