Sackett Et Al 2001 High-Stakes Testing in Employment, Credentialing, And Higher Education - Prospects in a Post-Affirmative-Action World

7/30/2019 Sackett Et Al 2001 High-Stakes Testing in Employment, Credentialing, And Higher Education - Prospects in a Post-Affirmative-Action World

1/17

High-Stakes Testing in Employment, Credentialing,and Higher Education

Prospects in a Post-Affirmative-Action WorldPaul R. SackettNeal SchmittJill E. EllingsonMelissa B. Kabin

University of Minnesota, Twin Cities CampusMichigan State UniversityThe Ohio State UniversityMichigan State University

Cognitive!}- loaded tests of knowledge, skill, and abilityoften contribute to decisions regarding education, jobs,licensure, or certification. Users of such tests often facedifficult choices when trying to optimize both the perfor-mance and ethnic diversity of chosen individuals. Theauthors describe the nature of this quandary, review re-search on different strategies to address it, and recommendusing selection materials that assess the full range ofrelevant attributes using a format that minimizes verbalcontent as much as is consistent with the outcome one istrying to achieve. They also recommend the use of testpreparation, face-valid assessments, and the considerationof relevant job or life experiences. Regard less of the strat-egy adopted, it is unreasonable to expect that one canmaximize both the performance and ethnic diversity ofselected individuals.

C ognitively loaded tests of knowledge, skill, andability are commonly used to help make employ-ment, academic admission, licensure, and certifi-cation decisions (D'Costa, 1993; Dwyer & Ramsey, 1995;Frierson, 1986; Mehrens, 1989). Law school applicantssubmit scores on the Law School Admission Test (LSAT)for consideration when making admission decisions. Upongraduation, the same individuals must pass a state-admin-istered bar exam to receive licensure to practice. Organi-zations commonly rely on cognitive ability tests whenmaking entry-level selection decisions and tests of knowl-edge and skill when conducting advanced-level selection.High-school seniors take the Scholastic Assessment Test(SAT) for use when determining college admissions andthe distribution of scholarship funds. Testing in these set-tings is termed high stakes, given the central role played bysuch tests in determining who will and who will not gainaccess to employment, education, and licensure or certifi-cation (jointly referred to as credentialing) opportunities.The use of standardized tests in the knowledge, skill,ability, and achievement domains for the purpose of facil-itating high-stakes decision making has a history charac-terized by three dominant features. First, extensive researchhas demonstrated that well-developed tests in these do-

mains are valid for their intended purpose. They are useful,albeit imperfect, descriptors of the current level of knowl-edge, skill, ability, or achievement. Thus, they are mean-ingful contributors to credentialing decisions and usefulpredictors of future performance in employment and aca-demic settings (Mehrens, 1999; Neisser et al., 1996;Schmidt & Hunter, 1998; Wightman, 1997; Wilson, 1981).Second, racial group differences are repeatedly ob-served in scores on standardized knowledge, skill, ability,and achievement tests. In education, employment, and cre-dentialing contexts, test score distributions con sistently re-veal significant mean differences by race (e.g., Bobko,Roth, & Potosky, 1999; Hartigan & Wigdor, 1989; Jensen,1980; Lynn, 1996; Neisser et al., 1996; Scarr, 1981;Schmidt, 1988; N. Schmitt, Clause, & Pulakos, 1996;Wightman, 1997; Wilson, 1981). Blacks tend to scoreapproximately one standard deviation lower than Whites,and Hispanics score approxim ately two thirds of a standarddeviation lower than Whites. Asians typically score higherthan Whites on measures of mathematical-quantitativeability and lower than W hites on m easures of verbal abilityand comprehension. These mean differences in test scorescan translate into large adverse impact against protectedgroups when test scores are used in selection and creden-tialing decision making. As subgroup mean differences intest scores increase, it becomes more likely that a smallerproportion of the lower scoring subgroup will be selectedor granted a credential (Sackett & Wilk, 1994).

Third, the presence of subgroup differences leads toquestions regarding whether the differences observed bias

Editor's note. Sheldon Zedeck served as action editor for this article.Author's note. Paul R. Sackett, Department of Psychology, Universityof Minnesota, Twin Cities Campus; Neal Schmitt and Melissa B. Kabin,Department of Psychology, Michigan State University; Jill E. Ellingson,Department of Management and Human Resources, The Ohio State Uni-versity. Authorship order for Paul R. Sackett and Neal Schmitt wasdetermined by coin toss.Correspondence concerning this article should be addressed to PaulR. Sackett, Department of Psychology, University of Minnesota, N475Elliott Hall, 75 East River Road, M inneapolis, MN 55455. Electronic m ailmay be sent to [email protected].

302 April 2001 American PsychologistCopyright 2001 by the American Psychological Association, Inc. 0O03-066XA)l/$5.00Vol. 56. No. 4, 302-3 18 DOI: I0.IO37/AJO03-O66X.56.4.302


2/17

Paul R.Sackett

resulting decisions. An extensive body of research in boththe employment and education literatures has demonstratedthat these tests generally do not exhibit predictive bias. Inother words, standardized tests do not underpredict theperformance of minority group members (e.g., AmericanEducational Research Association, American Psychologi-cal Association, & National Council of Measurement inEducation. 1999; Cole, 1981; Jensen. 1980: Neisser et al..1996; O'Conner, 1989; Sackett & Wilk, 1994; Wightman.1997; Wilson, 1981).These features of traditional tests cause considerable

tension for many organizations and institutions of higherlearning. Most value thai which is gained through the useof tests valid for their intended purpose (e.g., a higherperforming workforce, a higher achieving student body, acadffl of credentiated teachers who meet knowledge, skill,and achievement standards). Yet, most also value racial andethn;c diversity in the workforce or student body, withrationales ranging from a desire to mirror the compositionOf the community to a belief that academic experiences orworkplace effectiveness are enhanced by exposure to di-verse perspectives. What quickly becomes clear is thatthese two valuesperformance and diversitycome intoconflict. Increasing emphasis on the use of tests in theinterest of gaining enhanced performance has predictablenegative consequences for the selection of Blacks andHispanics. Conversely, decreasing emphasis on the use oftests in the interest of achieving a diverse group of selecteesoften results in a substantial reduction in the performancegains that can be recognized through test use (e.g..Schmidt. Mack, & Hunter, 1984; N. Schmitt et al., 1996).

This dilemma is well-known, and a variety of resolu-tion strategies have been proposed. One class of strategiesinvolves some form of minority group preference; thesestrategies were the subject of an American Psychologist

article by Sackett and Wilk (1994) that detailed the history,rationale, consequences, and legal status of such strategies.However, a variety of recent developments indicate agrowing trend toward bans on preference-based forms ofaffirmative action. The passage of the Civil Rights Act of1991 made it unlawful for employers to adjust test scores asa function of applicants' membership in a protected group.The U.S. Supreme Court's decisions in City oj Richmond v,J. A. Croson Co, (1989) and Ada rand Constructors, Inc. v.Penci (1995) to overturn set-aside programs that reserved apercentage of contract work for minority-owned businessessignaled the Co urt's stance toward preference-based affir-mative action (Mishkin. 1996). The U.S. Fifth CircuitCourt of Appeals ruled in Hopwood v. Stale of Texas(1996) that race could not be used as a factor in universityadmissions decisions (Kier & Davenport, 1997; Mishkin.1996). In 1996, the state of California passed Proposition209 prohibiting the use of group membership as a basis forany selection decisions made by the state, thus affectingpublic sector employment and California college admis-sions (Pear, 1996). Similarly, state of Washington votersapproved Initiative 200 that bars the use of race in statehiring, contracting, and college admissions (Verhovek &Ayres, 1998).Strategies for Achieving DiversityWithout Minority PreferenceIn light of this legal trend toward restrictions on preference-based routes to diversity, a key question emerges: W hat arethe prospects for achieving diversity without minority pref-erence and without sacrificing the predictive accuracy andcontent relevancy present in knowledge, skill, ability, andachievement tests? Implicit in this question is the premisethat one values both diversity and the performance out-comes that an organization or educational institution mayrealize through the use of tests. If one is willing to sacrificequality of measurement and predictive accuracy, there aremany routes to achieving diversity including random se-lection, setting a low cut score, or the use of a low-impactpredictor even though it may possess little to no predictivepower. On the other hand, if one values performance out-comes but does not value diversity, maximizing predictiveaccuracy can be the sole focus. We suggest that mostorganizations and educational institutions espouse neitherof these extreme views and instead seek a balance betweendiversity concerns and performance outcomes. Clearly, theuse of traditional tests without race-based score adjustmentfails to achieve such a b alance. How ever, what alternativesare available for use in high-stakes, large-scale assessmentcontexts? In this article, we review various alternativestrategies that have been put forth in the employment,education, and credentialing literatures. We note that somestrategies have been examined more carefully in somedomains than in others, and thus the attention we devote toemployment, education, and credentialing varies acrossthese alternatives.

The first strategy involves the measurement of con-structs with little or no adverse impact along with tradi-tional cognitively loaded knowledge, skill, ability, andApril 2001 American Psychologist 303


3/17

Neal Schmitt

achievement measures. The notion is that if we considerother relevant constructs along with knowledge, skill, abil-ity, and achievement measures when making high-stakesdecisions, subgroup differences should be lessened becausealternatives such as measures of interpersonal skills orpersonality usually exhibit smaller differences between eth-nic and racial subgroups. A second strategy investigateslust items in an effort to identify and remove those itemsthat are Culturally laden, k is generally believed that be-cause those ilems likely reflect irrelevant, culture-boundfactors, their removal will improve minority passing rates.The use of computer or video technology to present teststimuli and collect examinee responses constitutes a thirdstrategy. Using these technologies usually serves to mini-mize the reading and writing requirements of a test. Re-duction of adverse impact may be possible when the read-ing or writing requirements are inappropriately high. Also,video technology may permit the presentation of stimulusmaterials in a fashion that more closely matches the per-formance situation of interest. Attempts to modify howexaminees approach the test-taking experience constitutes afourth strategy. To the extent that individuals of varyingethnic and racial groups exhibit different levels of test-taking motivation, attempts to enhance examinee motiva-tion levels may reduce subgroup differences. Furthermore,changing the way in which the test and its questions arepresented may impact how examinees respond, a result thatcould also facilitate mino rity test performan ce. A fifthstrategy has been to document relevant knowledge, accom-plishments, or achievements via portfolios, performanceassessments, or accomplishment records. Proponents ofthis strategy maintain that this approach is directly relevantto desired outcomes and hence should constitute a m ore fairassessment of the knowledg e, skill, ability, or achievement

domain of interest for members of all subgroups. Finally,we also review the use of coaching or orientation programsthat provide examinees with information about the test andstudy materials or aids to facilitate optimal performance. Inaddition, we consider whether modifying the time limitsprescribed for testing helps reduce subgroup differences. Inthe following sections, we review the literature relevant toeach of these efforts in order to understand the nature ofsubgroup differences on knowledge, skill, ability, andachievement tests and to ascertain the degree to whichthese efforts have been effective in reducing thesedifferences.Use of Measures of Additional R elevantConstructsCognitively loaded knowledge, skill, ability, and achieve-ment tests are among the mosi valid predictors availablewhen selecting individuals across a wide variety of educa-tional and employment situations (Schmidt & Hunier,1981. 1998). Therefore, a strategy for resolving the di-lemma that allows for the use of such tests is readilyappealing. To that end, previous research has identified anumber of noncognitive predictors that are also valid whenmaking selection decisions in most educational and em-ployment contexts. Measures of personality and interper-sonal skills generally exhibit smaller mean differences byethnicity and race and also are related to performance onthe job or in school (e.g., Barrick & M ount, 1991; Bobkoctal., 1999; Mount & Barrick, I995;Sackett & Wilk. 1994;Salgado, 1997; N. Schmitt et al., 1996; Wolfe & Johnson.1995). The use of valid, noncognitive predictors, in com-bination with cognitive predictors, serves as a very desir-able strategy in that it offers the possibility of simulta-neously meeting multiple objectives. If additional con-structs, beyond those measured by the traditional test, arerelevant for the job or educational outcomes of interest,supplementing cognitive tests offers the prospect of in-creased validity when predicting those outcomes. If thoseadditional constructs are ones on which subgroup differ-ences are smaller, a compo site of the traditional test and theadditional measures will often exhibit smaller subgroupdifferences than the traditional test alone. The prospect ofsimultaneously increasing validity and reducing subgroupdifferences makes this a strategy worthy of careful study.Several different approaches have been followedwhen examining this strategy. On the basis of the psycho-metric theory of composites. Sackett and Ellingson (1997)developed a set of implications helpful in estimating theeffect of a supplemental strategy on adverse impact. First,consider a composite of two uncorrelated measures, whered, = 1.0 and d2 0.0. Although intuition may suggest thata composite of the two will split the difference (i.e.. resultin a d of 0.5), the computed value is 0.71. Thus, whereassupplementing a cognitively loaded test with an uncorre-lated measure exhibiting no subgroup differences will re-duce the composite subgroup difference, this reduction willbe less than some might expect. Second, a composite mayresult in a d larger than either of the components making upthe composite if the two measures are moderately corre-

304 April 2001 American Psychologist


4/17


5/17

Melissa B.Kabin

weight of task versus contextual performance. De Conereached similar conclusions.The Hattrup el al. (1997) and De Corte (1999) anal-yses were formulated in employment terms, yet the sameprinciples hold for educational and credentialing tests aswell. For example, when licensing lawyers, the licensingbody is concerned about both technical competence andprofessional ethics. The general principle relevant acrosssettings is that when multiple criterion dimensions are ofinterest, the weights given to the criterion dimensions canhave important effects on the relationship between thepredictors and the overall criterion. The higher the weigh!given to cognitively loaded criterion dimensions, the higherthe resuming weight given to cognitively loaded predictors.The higher the weight given to cognitively loaded predic-tors, the greater the resulting subgroup difference. In re-sponse, one may be tempted to simply choose criteriondimension weights on the basis of their potential for reduc-ing subgroup differences. Such a strategy would be errant,however, as criterion weights should be determined primar-ily on the basis of an analysis of the performance domainof interest and the values that an institution places on thevarious criterion dimensions.

Summary. Research on the strategy of measuringdifferent relevant construc ts illustrates that it does m atterwhich individual differences are assessed in high-stakesdecision making if one is concerned about maximizingminority subgroup opportunity. Minority pass rates can beimproved by including noncognitive predictors in a testbaitery. However, adding predictors with iiitle or no impactwill not eliminate adverse impact from a battery of teststhat includes cognitively loaded knowledge, skill, ability,anil achievement measures. Reduction in adverse impactresults from a complex interaction between the validity of

the individual predictors, their intercorrelation, the size ofsubgroup differences on the combination of tests used, theselection ratio, and the manner in which the tests are used.In fact, in most situations wherein a variety of knowledge,skills, and abilities are considered when making selectiondecisions, adverse impact will remain at legally unaccept-able levels and subgroup mean differences on the predictorbattery will no! be a great deal lower than the differencesobserved for cognitive ability alone. The composition ofthe test battery should reflect the individual differencesrequired to perform in the domain of interest. If institutionsfocus mainly or solely on task performance, then cognitiveability will likely be the most important predictor andadverse impact will be great. If. however, they focus on abroader domain that involves motivational, interpersonal,or personality dimensions as well as cognitive ability, thenadverse impact may be reduced.Identification and Removal of CulturallyBiased Test ItemsA second strategy pursued in an attempt to resolve theperformance versus diversity dilemma involves investigat-ing the possibility that certain types of test items are biased.The traditional focus of studies examining differentia! itemfunctioning (DIF; Berk, I9S2) has been on the identifica-tion of items that function differently for minority versusmajority test takers. Conceivably, such items would con-tribute to misleading test scores for members of a particularsubgroup. Statistically. DIF seeks items that vary in diffi-culty for members of subgroups that are actually evenlymalched on the measured construct. That is. an attempt ismade to identify characteristics of items that lead to poorerperformance tor minority-group test lakers than for equallyable majority-group test takers. Assuming such a subset ofitems or item characteristics exist, they must define arace-related construe! that is distinguishable from the con-strue! the test is intended to measure (McCauley & Men-doza, 1985). Perhaps because of the availability of largersample sizes in large-scale testing programs, much of theDIF research conducted to date has been done using edu-cational and credentialing tests.

Initial evidence for DIF was provided by Medley andQuirk (1974), who found relatively large group by iteminteractions in a study of the performance of Black andWhite examinees on the National Teacher Examinationitems reflecting African American art, music, and litera-ture. One should note, however, that these results are basedon examinee performance on a tesi consiructed using cul-turally specific content. Ironson andSubkoviak (!979) alsofound evidence for DIF when they evaluated five cognitivesubtesis administered as part of the National LongitudinalStudy of 1972. The verbal subtest, measuring vocabularyand reading comprehension, contained the largest numberof items flagged as biased. Items at the end of each of thesubtests also tended to be biased against Black examinees,presumably because of lower levels of reading ability orspeed in completing these tests. More recently, T. Freedleand Kostin (1990; R. Freedle & Kostin, 1997) showed thatBlack examinees were more likely to get difficult verbal



6/17

items on the Graduate Record Examination (GRE) and theSAT correct when compared with equally able White ex-aminees. However, the Black examinees were less likely toget the easy items right. In explanation, they suggested thatthe easier items possessed m ultiple meanings more familiarto White examinees, whose culture was most dominant inthe test items and the educational system.Whitney and Schmitt (1997) investigated the extent towhich DIF may be present in biographical data measuresdeveloped for use in an employment context. In an effort tofurther DIF research, Whitney and Schmitt focused notonly on identifying whether biodata items m ay exhibit DIF,but also whether the presence of DIF can be traced todifferences in cultural values between racial subgroups.More than one fourth of the biodata items exhibited DIFbetween Black and White examinees. Moreover, the Blackand White examinees differentially endorsed item responseoptions designed to reflect differing cultural notions ofhuman nature, the environment, time orientation, and in-terpersonal relations. However, there was only limited ev-idence that the differences observed between subgroups incultural values were actually associated with DIF. Afterremoval of all DIF items, the observed disparity in testscores between Blacks and Whites was eliminated, a dis-parity that, incidentally, favored Blacks over Whites.

Other studies evaluating the presence of DIF haveproved less interpretable. Scheuneman (1987) developeditem pairs that reflected seven major hypotheses aboutpotential sources of item bias in the experimental portion ofthe GRE. The results were inconsistent in showing thatthese manipulations produced a more difficult test for mi-nority versus majority examinees. In some instances, themanipulations produced a greater impact on Whites thanBlacks. In other instances, a three-way interaction betweengroup, test version, and items indicated that some uncon-trolled factor (e.g., content of a passage or item) wasresponsible for the subgroup difference. Scheuneman andGerritz (1990) examined verbal items from the GRE andthe SAT that consisted of short passages followed byquestions. Although they did identify several item featurespossibly linked to subgroup differences (e.g., content deal-ing with science, requiring that examinees identify themajor thesis in a paragraph), the results, as a whole, yieldedno clear-cut explanations. Scheuneman and Gerritz con-cluded that DIF may result from a combination of itemfeatures, the most important of which seems to be thecontent of the items. Similarly inconclusive results werereported by A. P. Schmitt and Dorans (1990) in a series ofstudies on SAT-Verbal performance. Items that involvedthe use of homographs (i.e., words that are spelled likeother words with a different meaning) were more difficultfor otherwise equally able racial and ethnic group mem-bers. Yet, when nonnative English speakers were removedfrom the analyses, there were few remaining DIF items.

Schmeiser and Ferguson (1978) examined the Englishusage and social studies reading tests of the AmericanCollege Test (ACT) and found little support for DIF. TwoEnglish tests and three social studies tests were developedto contain different content, while targeting the same cog-

nitive skills. None of the interactions between test contentand racial and ethnic group were statistically significant.Similarly, Scheuneman and Grima (1997) reported that theverbal characteristics of word problems (e.g., readabilityindexes, the nature of the arguments, and propositions) inthe quantitative section of the GRE were not related to DIFindexes.These results indicate that although DIF may be de-tected for a variety of test items, it is often the case that themagnitude of the DIF effect is very small. Furthermore,there does not appear to be a consistent pattern of itemsfavoring one group versus another. Results do not indicatethat removing these items would have a large impact onoverall test scores. In addition, we know little about howDIF item removal will affect test validity. However, certainthemes across these studies suggest the potential for someDIF considerations. Familiarity with the content of itemsappears to be important. The verbal com plexity of the itemsis also implicated, yet it is not clear what constitutes verbalcomplexity. Differences in culture are often cited as im-portant determinants of DIF, but beyond the influence ofhaving English as one's primary language, we know littleabout how cultural differences play a role in test itemperformance.Use of Alternate Modes o f Presenting TestStimuliA third strategy to reduce subgroup differences in tests ofknowledge, skill, ability, and achievement has been tochange the mode in which test items or stimulus materialsare presented. Most often this involves using video orauditory presentation of test items, as opposed to present-ing test items in the normal paper-and-pencil mode. Im-plicit in this strategy is the assumption that reducing irrel-evant written or verbal requirements will reduce subgroupdifferences. In support of this premise, Sacco et al. (2000)demonstrated the relationship between reading level andsubgroup differences by assessing the degree to which thereadability level of situational judgment tests (SJTs) wascorrelated with the size of subgroup differences in SJTperformance. They estimated that 10th-, 12th-, and 14th-grade reading levels would be associated with Black -Wh iteds of 0.51, 0.62, and 0.74, respectively; reading levels atthe 8th, 10th, and 13th grades would be associated withHispanic-White ds of 0.38, 0.48, and 0.58, respectively.This would suggest that reducing the readability level of atest should in turn reduce subgroup differences. One mustbe careful, however, to remove only verbal requirementsirrelevant to the criterion of interest, as Sacco et al. alsodemonstrated that verbal ability may partially account forSJT validities. In reviewing the research on this strategy,we focused on three key studies investigating video as analternative format to a traditional test. The three selectedstudies illustrate issues central in examining the effects offormat changes on subgroup differences. For other usefulstudies that also touch on this issue, see Weekly and Jones(1997) and N. Schmitt and Mills (in press).

Pulakos and Schmitt (1996) examined three measuresof verbal ability. A paper-and-pencil measure testing verbalApril 2001 American Psychologist 307


7/17

analogies, vocabulary, and reading comprehension had avalidity of .19 when predicting job performance and aBlack-White d = 1.03. A measure that required exam ineesto evaluate written material and write a persuasive essay onthat material had a validity of .22 and a d = 0.91 . The thirdmeasure of verbal ability required examinees to draft adescription of what transpired in a short video. The validityof this measure was .19 with d = 0.45. All three measureshad comparable reliabilities (i.e., .85 to .92). Comparingthe three measures, there was some reduction in d whenwritten materials involving a realistic reproduction of tasksrequired on the job were used rather than a multiple-choicepaper-and-pencil test. When the stimulus material was vi-sual rather than written, there was a much greater reductionin subgroup differences (i.e., Black-White d valuesdropped from 1.03 to 0.45 , with parallel findings reportedfor Hispanics). In terms of validity, the traditional verbalability test was the most predictive (r = .39), whereas thevideo test was less so (r = .29, corrected for range restric-tion and criterion unreliability).

Chan and Schmitt (1997) evaluated subgroup differ-ences in performance on a video-based SJT and on awritten SJT identical to the script used to produce the videoenactment. The written version of the test displayed asubgroup difference of 0.95 favoring Whites over Blacks,whereas the video-based version of the test produced asubgroup difference of only 0.21. Corrections for unreli-ability produced differences of 1.19 and 0.28. These dif-ferences in d were matched by subgroup differences inperceptions of the two tests. Both groups were more favor-ably disposed (as indicated by perceptions of the tests' facevalidity) to the video version of the test, but Blacks signif-icantly more so than Whites.

Sackett (1998) summarized research on the use of theMultistate Bar Examination (MBE), a multiple-choice testof legal knowledge and reasoning, and research conductedby Klein (1983) examining a video-based alternative to theMBE. The video test presented vignettes of lawyers takingaction in various settings. After each vignette, examineeswere asked to evaluate the actions taken. Millman, Mehrens,and Sackett (1993) reported a Black-White d of 0.89 forthe MBE; an identical value (d = 0.89) was estimated bySackett for the video test.Taken at face value, the Sackett (1998), Chan andSchmitt (1997), and Pulakos and Schmitt (1996) resultsappear contradictory regarding the impact of changingfrom a paper-and-pencil format to a video format. Thesecontradictions are more apparent than real. First, it is im-portant to consider the nature of the focal construct. In thePulakos and Schmitt study of verbal ability and the Sackettstudy of legal knowledge and reasoning, the focal constructwas clearly cognitive in nature, falling squarely into thedomain of traditional know ledge, skill, ability and achieve-ment tests. In Chan and Schmitt, however, the focal con-struct (i.e., interpersonal skill) was, as the name implies,not heavily cognitively loaded. In fact, the SJT used was ofthe type often suggested as a potential additional measurethat might supplement a traditional test. Chan and Schmittprovided data supporting the hypothesis that the relatively

large drop in d was due in part to the removal of an im plicitreading comprehension component present in the writtenversion of the SJT. When that component was removedthrough video presentation, the measure became less cog-nitively loaded and d was reduced.But what of the differences between the verbal skill(Pulakos & Schmitt, 1996) and legal knowledge and rea-soning studies (Sackett, 1998)? We use this comparison tohighlight the importance of examining the correlation be-tween the traditional written test and the alternative videotest. In the legal knowledge and reasoning study, in whichthe video did not result in reduced subgroup differences,the correlation between the traditional test and the videotest was .89, corrected for unreliability. This suggests thatthe change from paper-and-pencil to video was essentiallya format change only, with the two tests measuring thesame constructs. In the verbal skills study, in which thevideo resulted in markedly smaller subgroup differences,the correlation between the two, corrected for unreliabilityin the two measures, was .31. This suggests that the video-based test in the verbal skills study reflected not only achange in format, but a change in the constructs measuredas well. An examination of the scoring procedures for theverbal skills video test supports this conclusion. Examineeessays describing video content were rated on features ofverbal ability (e.g., sentence structure, spelling) and oncompleteness of details reported. We suggest that scoringfor completeness introduced into the construct domain per-sonality cha racteristics such as conscientiousness and detailorientation, both traits exhibiting smaller subgroup differ-ences. Consistent with the arguments made above withregard to the ability of composites to reduce subgroupdifferences, the reduction in d observed for the verbal skillsvideo test was due not to the change in format, but rather tothe introduction of additional constructs that amelioratedthe influence of verbal ability when determining d. Thus,the research to date indicates that changing to a videoformat does not per se lead to a reduction in subgroupdifferences. Future research into the influence of alternativemodes of testing should take steps to control for the unin-tended introduction of additional constructs beyond thosebeing evaluated. Failure to separate test content from testmode will confound results, blurring our ability to under-stand the actual mechanism responsible for reducing sub-group differences.

Last, we wish to highlight the role of reliability whencomparing traditional tests with alternatives. Focusing onthe legal knowledge and reasoning study (Sackett, 1998),recall that the Black-White difference was identical (d =0.89) for both tests. We now add reliability data to ourdiscussion. Internal consistency reliability for the MBEwas . 91 ; correcting the subgroup difference for unreliabil-ity results in a corrected value of d = 0.93. Internalconsistency reliability for the video test was .64; correctingthe subgroup difference for unreliability results in a cor-rected value of d = 1.11. In other words, after takingdifferences in reliability into account, the alternative videotest results in a larger subgroup difference than the tradi-tional paper-and-pencil test. Such an increase is possible in

30 8 April 2001 American Psychologist


8/17

that a traditional test may tap declarative knowledge,whereas a video alternative may require the application ofthat knowledge, an activity that will likely draw on higherorder cognitive skills resulting in higher levels of d.Clearly, any conclusion about the effects of changing testformat on subgroup differences must take into accountreliability of measurement. What appears to be a formateffect may simply be a reliability effect. Because differentreliability estimation methods focus on different sources ofmeasurement error (e.g., inconsistency across scorers, con-tent sampling), taking reliability into account will requireconsidering the most likely sources of measurement errorin a particular setting.

The results reported in this section document the largedifferences in d that are typically observed when compar-isons are made between different test formats. Examina-tions of reasons for these differences are less conclusive. Itis not clear that the lower values of d are a function of testformat. Cognitive ability requirements and the reading re-quirements of these tests seem to be a major explanation.Clearly more research would be helpful. Separating themethod of assessment from the content measured is a majorchallenge when designing studies to evaluate the impact oftest format. However, it is a challenge that must be sur-mounted in order to understand whether group differencescan be reduced by altering test format. Future researchshould also consider investigating how a change in formatinfluences validity. Only one of the three studies reviewedhere evaluated validity effects; clearly, that issue warrantsadditional attention.Use of Motivation and Instructional SetsA fourth strategy places the focus not on the test itself, buton the mental state adopted by examinees and its role indetermining test performance. An individual's motivationto complete a test has the potential to influence test perfor-mance. To the extent that there are racial group differencesin test-taking motivation, energizing an individual to per-severe when completing a test may reduce subgroup dif-ferences. At a m ore general level, a test taker's attitude andapproach toward testing may also influence test perfor-mance. Such test-taking impressions could be partiallydetermined by the instructional sets provided when com-pleting tests and the context in which actual test questionsare derived. Manipulating instructional sets and item con-texts has the potential to alter observed group differences ifthese aspects of the testing experience allow individuals ofvarying racial groups to draw on culture-specific cognitiveprocesses.

A number of studies have demonstrated racial groupdifferences in test-taking motivation. O'Neil and Brown(1997) found that eighth-grade Black students reportedexerting the least amount of effort when completing a mathexam. Hispanic students reported exerting slightly more,whereas White students reported exerting the most effortwhen completing the exam. Chan, Schmitt, DeShon,Clause, and Delbridge (1997) demonstrated that the rela-tionship between race and test performance was partiallymediated by test-taking motivation, although the mediating

effect accounted for a very small portion of the variance intest performance. In a follow-up study, Chan, Schmitt,Sacco, and DeShon (1998) found that pretest reactionsaffected test performance and mediated the relationshipbetween belief in tests and test performance. Their sub-group samples were not large, but motivational effectsoperated similarly for Whites and Blacks.As a test of the influence of item context, a uniquestudy by DeShon, Smith, Chan, and Schmitt (1998) inves-tigated whether presenting test questions in a certain waywould reduce subgroup differences. They tested the hy-pothesis proposed by Helms (1992) that cognitive abilitytests fail to adequately assess Black intelligence becausethey do not account for the emphasis in Black culture onsocial relations and social context, an observation offeredat a more general level by others as well (e.g., Miller-Jones,1989; O'Connor, 1989). Contrary to Helms's argument,racial subgroup performance differences on a set of Wasonconditional reasoning problems were not reduced by pre-senting the problems in a social relationship form.

Another working hypothesis is that the mere knowl-edge of cultural stereotypes may affect test performance. Inother words, making salient to test takers their ethnic andracial or their gender identity may alter both women's andminorities' test-taking motivation, self concept, effortlevel, and expectation of successful performance. Steeleand colleagues (Steele, 1997; Steele & Aronson, 1995)proposed a provocative theory of stereotype threat thatsuggests that the way in which a test is presented toexaminees can affect examinee performance. The theoryhypothesizes that when a person enters a situation whereina stereotype of the group to which that person belongsbecomes salient, concerns about being judged according tothat stereotype arise and inhibit performance. When mem-bers of racial minority groups encounter high-stakes tests,their awareness of commonly reported group differencesleads to concerns that they may do poorly on the test andthus confirm the stereotype. This concern detracts fromtheir ability to focus all of their attention on the test,resulting in poorer test performance. Steele hypothesized asimilar effect for gender in the domain of mathematics. Aboundary condition for the theory is that individuals mustidentify with the domain in question. If the domain is notrelevant to the individual's self-image, the testing situationwill not elicit stereotype threat.

Steele and Aronson (1995) found support for the the-ory in a series of laboratory experiments. The basic para-digm used was to induce stereotype threat in a sample ofhigh-achieving majority and minority students statisticallyequated in terms of their prior performance on the SAT.One mechanism for inducing threat is via instructional set.In the stereotype threat condition, participants were toldthat they would be given a test of intelligence; in thenonthreat condition, they were told they would be given aproblem-solving task. In fact, all of the participants re-ceived the same test. Steele and Aronson found a largermajority-minority difference in the threat condition than inthe nonthreat condition, a finding supportive of the idea

April 2001 American Psychologist 309


9/17

that the presence of stereotype threat inhibits minoritygroup performance.These findings are well replicated (Steele, 1997) butcommonly misinterpreted. For exa mple, in the fall of 1999,the PBS show "Frontline" broadcast a one-hour specialentitled "Secrets of the SAT," in which Steele's researchwas featured. The program's narrator noted the largeBlack-White gap on standardized tests, described the ste-reotype threat manipulation, and concluded, "Blacks whobelieved the test was merely a research tool did the same asWhites. But Blacks who believed the test measured theirabilities did half as well." The critical fact excluded wasthat whereas a large score gap exists in the population ingeneral, Steele studied samples of Black and White stu-dents who had been statistically equated on the basis ofSAT scores. Thus, rather than eliminating the large scoregap, the research actually showed something very different.Absent stereotype threat, the Black-White difference wasjust what one would expect (i.e., zero), as the two groupshad been equated on the basis of SAT scores. However, inthe presence of stereotype threat, the Black-White differ-ence was larger than would be expected, given that the twogroups were equated.There are a variety of additional issues that cloudinterpretation and application of Steele's (1997) findings.One critical issue is whether the SA T scores used to equatethe Black and W hite students are themselves influenced bystereotype threat, thus confounding interpretation of studyfindings. A second issue involves questions as to the pop-ulations to which these findings generalize (e.g., Whaley,1998). The work of Steele and coworkers focused onhigh-ability college students; Steele (1999) noted that theeffect is not replicable in the broader population. A thirdissue is the conflict between a stereotype threat effect andthe large literature cited earlier indicating a lack of predic-tive bias in test use. If stereotype threat results in observedscores for minority group members that are systematicallylower than true scores, one would expect underpredictionof minority group performance, an expectation not sup-ported in the predictive bias literature. An additional prag-matic issue is the question of how one might reduce ste-reotype threat in high-stakes testing settings when the pur-pose of testing is clear.

These issues aside, Steele's (1997, 1999) research isimportant in that it clearly demonstrates that the instruc-tional set under which examinees approach a test can affecttest results. However, research has yet to demonstratewhether and to what degree this effect generalizes beyondthe laboratory. Thus, we caution against overinterpretingthe findings to date, as they do not warrant the conclusionthat subgroup differences can be explained in whole or inlarge part by stereotype threat.The research on test-taker motivation and instructionalsets has been conducted primarily in laboratory settings.The effects observed on subgroup differences are not large.Future research should attempt to replicate these findings ina field context so we may better understand the extent towhich group differences can be reduced using this alterna-tive. Given the relatively small effects obtained in con-

trolled environments, it seems doubtful that motivationaland social effects will account for much of the subgroupdifferences observed. Nonetheless, it may make sense fortest users to institute mechanisms for enhancing motiva-tion, such as the use of more realistic test stimuli clearlyapplicable to school or job requirements, for the purpose ofmotivating all examinees.Use of Portfolios, Accomplishment Records,and Performance AssessmentsResearchers have experimented with methods that directlymeasure an individual's ability to perform aspects of thejob or educational dom ain of interest as a fifth alternative tousing paper-and-pencil measures of knowledge, skill, abil-ity, and achievement. Portfolios, accomplishment records,and performance assessments have each been investigatedas potential alternatives to traditional tests. Performanceassessments (sometimes referred to in the employmentdomain as job or work samples) require an examinee tocomplete a set of tasks that sample the performance domainof interest. The intent is to obtain and then evaluate arealistic behavior sample in an environment that closelysimulates the work or educational setting in question. Per-formance assessments may be comprehensive and broad-based, designed to obtain a wide ranging behavior samplereflecting many aspects of the performance domain inquestion or narrow with the intent of sampling a singleaspect of the domain in question. Accomplishment recordsand portfolios differ from performance assessments in thatthey require examinees to recount past endeavors or pro-duce work products illustrative of an examinee's ability toperform across a variety of contexts. Often examineesprovide examples demonstrative of their progress towardskill mastery and knowledge acquisition.

Performance assessments, as a potential solution forresolving subgroup differences, were examined in the em-ployment domain by Schmidt, Greenthal, Hunter, Berner,and Seaton (1977), who reported that performance assess-ments corresponded to substantially smaller Black-Whitesubgroup differences when compared with a written tradestest (d = 0.81 vs. 1.44). N. Schmitt et al. (1996) updatedthis estimate to d = 0.38 on the basis of a meta-analyticreview of the literature, although they combined tests of jobknowledge and job samples in their review. The use ofperformance assessments in the context of reducing sub-group differences has been extended to multiple high-stakes situations in the credentialing, educational, and em-ployment arena. We outline three such efforts here.Lega l skills assessment center. Klein andBolus (1982; described in Sackett, 1998) examined anassessment center developed as a potential alternative tothe traditional bar examination. Each day of the two-daycenter involved a separate trial, with a candidate represent-ing the plaintiff on one day and the defendant on thesecond. The center consisted of 11 exercises, such as con-ducting a client interview, delivering an opening argument,conducting a cross-examination, and preparing a settlementplan. Exercises were scored by trained attorneys. Sackettreported a Black-White d = 0.76 and an internal consis-



10/17

tency reliability estimate of .67, resulting in a d correctedfor unreliability of 0.93.Accomplished teacher assessment. Jaeger(1996a, 1996b) examined a complex performance assess-ment process developed to identify and certify highly ac-complished teachers under the auspices of the NationalBoard for Professional Teaching Standards. Different as-sessment packages are developed for different teachingspecialty areas. Jaeger examined assessments for EarlyChildhood Generalists and Middle Childhood Generalists.The assessment process required candidates to complete anassessment center and prepare in advance a portfolio thatincluded videotaped samples of their performance. From afrequency count of scores, we computed Black-White


11/17

raters. The differences between subgroups were almostidentical across the two types of tests. Mean scores forWhite students were about one standard deviation higherthan those of Hispanic and Black students. Furthermore,changing test type or question type had no effect on thescore differences between the groups.Reviews conducted in the employment domain sug-gest that performance assessments are among the mostvalid predictors of performance (Asher & Sciarrino, 1974;Hunter & Hunter, 1984; Robertson & Kandola, 1982;Schmidt & Hunter, 1998; N. Schmitt, Gooding, Noe, &Kirsch, 1984; Smith, 1991). In addition, exam inees, partic-ularly minority individuals, have reported more favorableimpressions of performance assessments than more tradi-tional cognitive ability or achievem ent tests (Schmidt et al.,1977). Given the motivational implications associated withpositive applicant reactions, the use of performance assess-ments wherein test content and format replicate the perfor-mance domain as closely as possible may be advantageousregardless of the extent to which subgroup differences arereduced. However, performance assessments tend to becostly to develop, administer, and score reliably. Workstations can be expensive to design and assessment centerexercises can be expensive to deliver. Stecher and Klein(1997) indicated that it is often difficult and expensive toachieve reliable scoring of performance assessments in alarge-scale testing context. Furthermore, to obtain a perfor-mance assessment that is both reliable and generalizablerequires that examinees complete a number of tasks, arequirement that can triple the amount of testing timenecessary compared with traditional tests (Dunbar, Koretz,& H oover, 19 91; Linn, 1993).

The accomplishment record (Hough, Keyes, & Dun-nette, 1983) was developed in part to surmount some of thedevelopment and administration cost issues characteristicof performance assessments. Accomplishment records askexaminees to describe major past accomplishments that areillustrative of competence on multiple performance dimen-sions. These accom plishments are then scored using behav-iorally defined scales. Accomplishment records can be usedto assess competence in a variety of work and nonworkcontexts. We do note that a very similar approach wasdeveloped by Schmidt et al. (1979) under the label of the"behavioral consistency method"; a meta-analysis by Mc-Daniel, Schmidt, and Hunter (1988) reported useful levelsof predictive validity across 15 studies using this approach.

Hough et al. (1983) used accomplishment records toevaluate attorneys and validated these instruments againstperformance ratings. The accomplishment records werescored with a high degree of interrater reliability (.75 to .85across the different performance dimensions and the totalscore). These scores were then correlated with attorneyexperience (average r = .24), in order to partial out expe-rience from the relationship between accomplishmentrecord scores and performance ratings. These partialedvalidity coefficients ranged from .17 to .25 across thedimensions. Validities for a small group of minority attor-neys were larger than those for the majority group. Hough(1984), describing the same data, reported Black-White

subgroup differences of d = 0.33 for the accomplishmentrecords. The performance ratings exhibited almost exactlythe same difference (i.e., d = 0.35). It is interesting that theaccomplishment records correlated near zero with theLSAT, scores on the bar exam, and grades in law school.These more traditional measures of ability would mostlikely have exhibited greater d when compared with theaccomplishment records, although Hough did not presentthe relevant subgroup means and standard deviations. Be-cause the accomplishment records likely measured con-structs in addition to ability (e.g., motivation and person-ality), it is perhaps not surprising that d was lower than thatfound for more traditional cognitively oriented tests.Similar to accomplishment records, portfolios repre-sent examinees' past achievements through a collection ofwork samples indicative of one's progress and ability.Although it is applicable to adult assessment, much of theresearch involving portfolios has taken place in schools.LeM ahieu, Gitomer, and Eresh (1995) reported on a projectin the Pittsburgh schools in which portfolios were used toassess students' writing ability in Grades 6-12. Their ex-perience indicated that portfolios could be rated with sub-stantial interrater agreement. They also reported that Blackexaminees' scores were significantly lower than those ofWhite examinees, but did not report subgroup means pre-cluding the estimation of an effect size. Supovitz andBrennan (1997) reported on an analysis of writing portfo-lios assembled by first and second graders in the Rochester,New York, schools. Scores on two standardized tests werecompared with scores based on their portfolios. Interraterreliability of the scoring of the language arts portfolios was.73 and .78 for first and second graders, respectively,whereas the reliability of the two standardized tests was .92and .91 . Differences between Black and White studentswere about twice as large on the standardized tests as theywere on the w riting sam ples. On both tests, the differencesbetween subgroups were smaller (0.25 to 0.50 in standarddeviation units depending on the test type) than is usuallyreported.

Although accomplishment records and portfolioslikely have fewer development costs when compared withperformance assessments, the cost of scoring, especially ifmultiple raters are used, may be high. A nother issue presentwith accomplishment records specifically is the reliance onself-report, although an attempt to secure verification of therole of the examinee in each accomplishm ent m ay diminishthe tendency to overreport or enhance one's role. Theremay also be differing levels of opportunity to engage inactivities appropriate for portfolios or accomplishmentrecords. To the extent that examinees feel that they do nothave the resources available to assemble these extensivedocuments, they may find the experience demotivating andfrustrating. This concern is important inasmuch as there issome evidence (Ryan, Ployhart, Greguras, & Schmit, 1998;Schmit & R yan, 1997) that a greater proportion of minoritythan majority individuals withdraw during the various hur-dles in a selection system.Summary. Use of more realistic or authentic as-sessments does not eliminate or even diminish subgroup312 April 2001 American Psychologist


12/17

differences in many of the educational studies. Also, all ofthe studies report that the reliable scoring of these tests isdifficult and expensive to achieve in any large-scale testingapplication. Problems with the standardization of the ma-terial placed in portfolios and the directions and opportu-nities afforded students are also cited in studies of studentreactions to the use of these tests (Dutt-Doner & Gilman,1998) as well as by professionals. A ccomplishment recordsand job samples used in employment con texts show smallersubgroup differences in some studies than do cognitivelyloaded tests. The attribution that these smaller subgroupdifferences are due to test type are probably unwarranted,as scores on most job samples and accomplishment recordsmost likely reflect a mix of constructs that go beyond thosemeasured by traditional knowledge, skill, ability, andachievement tests.Use of Coaching or Orientation ProgramsAnother strategy for reducing subgroup differences is theuse of coaching or orientation programs. The purpose ofthese programs is to inform examinees about test content,provide study materials, and recommend test-taking strat-egies, with the ultimate goal of enabling optimal examineeperformance. The term coaching is at times used to refer toboth orientation programs that focus on general test-takingstrategies and programs featuring intensive drill on sampletest items. We use the term orientation programs to refer toshort duration programs, dealing with broad test-takingstrategies, that introduce examinees to the types of itemsthey will encounter. We use the term coaching to refer tomore extensive programs, commonly involving practiceand feedback, in addition to the material included in ori-entation programs. A review by Sackett, Burris, and Ryan(1989) indicated that coaching programs involving drill andpractice do show evidence of modest score gains abovethose expected due simply to retesting. Although there islittle literature on the differential effectiveness of coachingand orientation programs by subgroup, a plausible hypoth-esis is that subgroups differ in their familiarity with testcontent and test-taking skills. This difference in familiaritymay contribute to observed subgroup differences in testscores. Conceivably, coaching or orientation programswould reduce error variance in test scores due to testanxiety, unfamiliar test formats, and poor test-taking skills(Frierson, 1986; Ryan et al., 1998) that would in turnreduce the extent of subgroup differences. However, thereis evidence suggesting the presence of a larger coachingeffect for individuals with higher precoaching test scores, afinding that argues against the likelihood that coaching willnarrow the gap between a lower scoring subgroup and ahigher scoring subgroup. With that caveat, we discussbelow the coaching and orientation literature investigatingthe influence of this strategy on subgroup differences.

Ryan et al. (1998) studied an optimal orientation pro-gram that familiarized firefighter job applicants with testformat and types of test questions. The findings indicatedthat Blacks, women, and more anxious examinees weremore likely to attend the orientation sessions, but attendingthe orientation program was unrelated to test performance

or motivation. Ryer, Schm idt, and Schmitt (1999) studied amandatory orientation program for entry-level jobs in amanufacturing organization at two locations, with eachlocation having a control group and a test orientationgroup. The results showed a small positive impact of ori-entation on the test scores of minority examinees, approx-imately 0.15 in standard deviation units, and the applicantsdid indicate that they view organizations that provide theseprograms favorably. How ever, the orientation program hadgreater benefits for nonminority members than minoritymem bers at one of the two locations. Schmit (1994) studieda voluntary orientation program for police officers thatconsisted of a review of test items and content, recommen-dations for test-taking strategies, practice on sample testitems, and suggestions on material to study. Attendance atthe program was unrelated to race, and whereas everyonewho attended the program scored higher on the examina-tion than did nonattenders, Black gains were twice as largeas were those of Whites. No standard deviation was pro-vided for the test performance variable, so d could not beestimated.

The educational literature includes a relatively largenumber of efforts to evaluate coaching initiatives. At leastthree reviews of this literature have been conducted. Mes-sick and Jungeblut (1981) reported that the average differ-ence between coached examinees and noncoached exam-inees taking the SAT was about 0.15 standard deviationunits. The length of the coaching program and the amountof score gains realized were positively correlated. Messickand Jungeblut estimated that a gain of close to 0.25 stan-dard deviation units could be achieved with a program thatwould approach regular schooling. DerSimonian and Laird(1983) reported an average effect size of 0.10 standarddeviation units for coaching programs directed at the SATtest, an aptitude test. In an analysis of coaching programsdirected at achievement tests, Bangert-Downs, Kulik, andKulik (1983) reported gains of about 0.25 standard devia-tion units as a function of coaching. Thus, the effects ofcoaching on performance on traditional paper-and-penciltests of aptitude and achievement appear to be small, butreplicable.

Frierson (1986) outlined results from a series of fourstudies investigating the effects of test-taking interventionsdesigned to enhance minority examinee test performanceon various standardized medical examinations (e.g., Med-ical College Admissions Test |MCAT], Nursing StateBoard Examination). The programs taught examinees test-taking strategies and facilitated the formation of learning-support groups. Those minorities who experienced the in-terventions showed increased test scores. However, thesamples used in these studies included very few Whiteexaminees, making it difficult to discern whether coachingproduced a differential effect on test scores in favor ofminorities. Powers (1987) reexamined data from a study onthe effects of test preparation involving practice, feedbackon results, and test-taking strategies using the initial ver-sion of the GRE analytical ability test (Powers & Swinton,1982, 1984). The findings indicated that when suppliedwith the same test preparation materials, no particularApril 2001 American Psychologist 313


13/17


14/17

measures cognitively loaded constructs. If such differencesare not observed, the reduction can often be traced to analternative that exhibits low levels of reliability or intro-duces noncognitive constructs. In fact, certainly the mostdefinitive conclusion one can reach from this review is thatadverse impact is unlikely to be eliminated as long as oneassesses domain-relevant constructs that are cognitivelyloaded. This conclusion is no surprise to anyone who hasread the literature in this area over the past three or moredecades. Subgroup differences on cognitively loaded testsof knowledge, skill, ability, and achievement simply doc-ument persistent inequities. Complicating matters further,attempts to overcome issues associated with reliable mea-surement often result in a testing procedure that is cost-prohibitive when conducted on a large scale. In spite ofthese statements, there are a number of actions that can betaken by employ ers, academic admissions officers, or otherdecision makers who are faced with the conflict betweendiversity goals and a demand that only those who are mostable should be given desirable educational and em ploymentopportunities. Although elimination of subgroup differ-ences via methods reviewed in this article is not feasible,reduction in subgroup differences, if it can be achievedwithout loss of validity, would be of considerable value.

First, in constructing test batteries, the full range ofperformance goals and organizational interests should beconsidered. In the employment arena, researchers havetended to focus on measures of maximum performance(i.e., ability), rather than on measures of typical perfor-mance (perhaps most related to motivation factors), whenconsidering what knowledge, skills and abilities to mea-sure. These maximum performance constructs were easy tomeasure using highly reliable and valid instruments. Withthe recent literature espousing the value of personality,improvements in interviews, and better methods for docu-menting job-related experience, valid methods for measur-ing less cognitively o riented constructs are becoming avail-able. When these constructs are included in test batteries,there is often less adverse impact. We m ust also emphasizethe importance of clearly identifying the performance con-struct one is hoping to predict. The weighting of differentaspects of performance and organizational goals shoulddetermine the nature of the constructs measured in a high-stakes testing situation. It is important to measure what isrelevant, not what is convenient, easy, or cheap.Second, research on the identification and removal ofitems that may be unfairly biased against one group oranother does not indicate that any practically significantreductions in d can be achieved in this fashion. Studies of

DIF are characterized by small effects, with items notconsistently favoring one group versus another. The effectsof removing biased items on overall test characteristics areusually minimal. It does seem apparent that one shouldwrite as simply as possible consistent with the constructone is hoping to measure and that content that is obviouslycultural should be removed.Research on the mode of presenting test stimuli sug-gests that video-based p rocedures, which broaden the rangeof constructs assessed, or reducing the verbal component

(or reading level) of tests may have a positive effect onsubgroup differences, although d is often still large enoughto produce adverse impact, particularly when the selectionratio is low. Results are not consistent across studies, andclearly more research would be helpful. Such studies areparticularly difficult to conduct. Separation of the mode oftesting and the construct tested is a challenge; conflictingresults across studies may be due to an inability to differ-entiate between constructs and methods. With improve-ments in technology, alternatives to traditional paper-and-pencil tests are clearly feasible and worthy of exploration.It is also important to note that verbal ability may be a skillrelated to important outcomes and hence considered adesirable component of test performance. In these cases, itwould be best to include a measure that specifically as-sesses verbal ability so that one may remove its influencewhen measuring other job-related constructs. The entiretest battery can then be constructed to reflect an appropriateweighting and combination of relevant attributes given therelative importance of the various performance outcomes.Whenever possible, it seems desirable to measureexperiences that reflect necessary knowledge, skills, and

abilities required in the target situation. Accomplishmentrecords reduced differences compared with the usual dobtained with cognitively loaded tests in a manner that waspractically important as well. This is very likely becauseadditional constructs are targeted and assessed in the ac-complishment record. Results for portfolio and perfor-mance assessments in the educational arena have beenmixed. Some studies indicate lower levels of d, whereasother studies indicate no difference or even greater differ-ences on portfolio or performance assessments when com-pared with the typical multiple-choice measure of achieve-ment. Differences in the level of d across studies may bedue partly to the degree to which test scores are a functionof ability and motivation. If partly a function of motivation,we would expect d to be smaller. Again, if relativelycomplex and realistic performance assessments involvecognitive skills as opposed to interpersonal skills, the levelof d will likely be the same as a traditional cognitivelyloaded measure. In addition, problems in attaining reliablescores at reasonable expense questions the feasibility ofthis strategy.It seems reasonable to recommend that some form oftest preparation or orientation course be provided to exam-inees. The effects of coaching appear to be minimallypositive over all groups, even though coaching does notseem to reduce d. Reactions to test preparation and coach-ing efforts among job applicants have been universally

positive. Insofar as some candidates do not have access toinformal networks that provide information on the nature ofexams, these programs could serve to place all examineeson the same playing field. At the very least, it would seemthat such positive reactions would lead to less complaintsabout the test and probably less litigationalthough wehave little research documenting the relationship betweenreactions and organizational outcomes.Finally, we recommend that test constructors pay at-tention to face validity. When tests look appropriate for theApril 2001 American Psychologist 315


15/17

performance situation in which examinees will be expectedto perform, they tend to react positively. Such positivereactions seem to produce a small reduction in the size ofd. Equally important, perhaps, may be the perception thatone is fairly treated. This is the same rationale underlyingour recommendation that test preparation programs beused.

In sum, subgroup differences can be expected oncognitively loaded tests of knowledge, skill, ability, andachievement. We can, however, take some useful actions toreduce such differences and to create the perception thatone's attributes are being fairly and appropriately assessed.We note that in this article we have focused on describingsubgroup differences resulting from different measurementapproaches. We cannot in the space available here addresscrucial questions of interventions to remedy subgroup dif-ferences in the life opportunities that affect the develop-ment of the knowledge, skill, ability, and achievementdomains that are the focus of this article. The researchdiscussed in this article, suggesting that subgroup differ-ences are not simply artifacts of paper-and-pencil testingtechnologies, highlights the need to consider those largerquestions.

REFERENCESAdarand Constructors, Inc. v. Pena. 115 S. Ct. 2097, 2113 (1995).American Educational Research Association, American PsychologicalAssociation, & National Council on M easurement in Education. (1999).Standards for educational and psychological testing. Washington, DC:American Psychological Association.Applebee, A. N ., Langer, J. A., Jenkins. L. B., Mullis, I. V. S., & Foertsch,M. A. (1990). Learning to w rite in our nation's schools: Instruction andachievement in 1988 at grades 4, 8, and 12 (NAEP Rep. No. 19-W-02).Princeton. NJ: Educational Testing Service.Asher. J. J.. & Sciarrino. J. A. (1974). Realistic work sample tests: Areview. Personnel Psychology, 27, 519-533 .Bangert-Downs. R. L., Kulik. J. A., &Kulik. C-L. C. (1983). Effects ofcoaching programs on achievement test scores. Review ofEducationalResearch. 53, 571-585.Barrick. M. R.. & Mount, M. K. (1991). The Big-Five personality dimen-sions in job performance: Ameta-analysis. Personnel Psychology, 44,1-26.Berk, R. A. (1982). Handbook ofmethods for detecting test bias. Balti-more: Johns Hopkins University Press.Bobko. P.. Roth, P. L.. & Potosky. D. (1999). Derivation and implicationsof a meta-analytic matrix incorporating cognitive ability, alternativepredictors, and job performance. Personnel Psychology, 52, 561-590.Bond, L. (1995). Unintended consequences of performance assessment:Issues of bias and fairness. Educational Measurement: Issues andPractice. 14. 21-24.Chan, D., & Schmitt. N. (1997). Video-based versus paper-and-pencilmethod of assessment in situational judgment tests: Subgroup differ-

ences in test performance and face validity perceptions. Journal ofApplied Psychology. 82, 143-159.Chan. D., Schmitt, N.. DeShon. R. P.. Clause, C. C, & Delbridge, K.(1997). Reactions to cognitive ability tests: The relationships betweenrace, test performance, face validity perceptions, and test-taking moti-vation. Journal of Applied Psychology, 82, 300-310.Chan, D .. Schmitt. N., Sacco, J. M, & DeShon, R. P. (1998). Understand-ing pretest andposttest reactions to cognitive ability and personalitymeasures. Journal of Applied Psychology, 83, 471-485 .City of Richmond v. J. A. Croson Co., 488 U.S. 469 (1989).Cole. N. S. (1981). Bias in testing. American Psychologist, 36, 1067-1077.D'Costa. A. G. (1993). The impact of courts on teacher competence

testing. Theory into Practice: Assessing Tomorrow's Teachers, 32,104-112.De Corte, W. (1999). Weighing job performance predictors to bothmaximize the quality of the selected workforce and control the level ofadverse impact. Journal of Applied Psychology, 84, 695-702.DerSimonian, R., & Laird, N. (1983). Evaluating the effect of coaching onSAT scores: Ameta-analysis. Harvard Educational Review, 18, 6 9 4 -734.DeShon, R. P., Smith, M., Chan, D., & Schmitt, N. (1998). Can adverseimpact on cognitive ability and personality tests be reduced by present-ing problems in a social context? Journal of Applied Psychology, 83,4 3 8 - 4 5 1 .Dunbar, S. B., Koretz, D. M ., & Hoover. H. D. (1991). Quality control inthe development and use of performance assessments. Applied Mea-surement in Education, 4, 289-303 .Dutt-Doner. K., & Gilman, D. A. (1998). Students react to portfolioassessment. Contemporary Education, 69, 159-165.Dwyer, C. A., &Ramsey, P. A. (1995). Equity issues in teacher assess-ment. In M. T. Nettles & A. L. Nettles (Eds.), Equity and excellence ineducational testing and assessment (pp. 327-342). Boston: KluwerAcademic.Evans, F. R., &Reilly, R. R. (1973). A study of test speededness as apotential source of bias in the quantitative score of the admission testfor graduate study in business. Research in Higher Education, I,173-183.Ford, J. K., Kraiger, K., & Schechtman, S. L. (1986). Study of race effectsin objective indices and subjective evaluations of performance: Ameta-analysis of performance criteria. Psychological Bulletin, 99, 3 3 0 -337.Freedle, T., & Kostin, I. (1990). Item difficulty of four verbal item typesand an index of differential item functioning for Black and Whiteexaminees. Journal ofEducational Measurement, 27, 329-343 .Freedle, R., &Kostin, I. (1997). Predicting Black and White differentialitem functioning in verbal analogy performance. Intelligence, 24, AllAAA.Frierson, H. T. (1986). Enhancing minority college students' performanceon educational tests. Journal ofNegro Education, 55, 3 8 - 4 5 .Goldstein, H. W., Yusko, K. P., Braverman, E. P., Smith, D. B ., & Chung,B. (1998). The role of cognitive ability in the subgroup differences andincremental validity of assessment center exercises. Personnel Psychol-ogy, 51, 357-374.Harmon, M. (1991). Fairness in testing: Are science education assess-ments biased? In G. Kulm & S. M. Malcom (Eds.), Science assessmentin the service of reform (pp. 31-54). Washington, DC:AmericanAssociation for the Advancement of Science.Hartigan. J. A., & Wigdor, A. K. (1989). Fairness in employment testing.Washington, DC: National Academy Press.Hattrup, K., Rock, J., & Scalia, C. (1997). The effects of varying con-ceptualizations of job performance on adverse impact, minority hiring,and predicted performance. Journal of Applied Psychology, 82, 6 5 6 -664.Helms. J. E. (1992). Why is there no study of cultural equivalence instandardized cognitive ability testing? American Psychologist, 47,1083-1101.Hopwood v. State of Texas, 78 F. 3d 932, 948 (5th Cir. 1996).Hough, L. M. (1984). Development and evaluation of the "accomplish-ment record" method of selecting and promoting professionals. Journalof Applied Psychology, 69, 135-146.Hough, L. M ., Keyes, M. A., & Dunnette, M. D. (1983). An evaluation ofthree "alternative" selection procedures. Personnel Psychology, 36,261-276.Hunter, J. E., &Hunter, R. F. (1984). Validity and utility of alternativepredictors of job performance. Psychological Bulletin, 96, 72-88 .Ironson. G. H., & Subkoviak, M. J. (1979). A comparison of severalmethods of assessing item bias. Journal ofEducational Measurement,16 , 209-225.Jaeger, R. M. (1996a). Conclusions on the technical measurement qualityof the 1995-1996 operational version of the National Board for Pro-fessional Teaching Standards' Early Childhood Generalist Assessment.Center for Educational Research and Evaluation, University of NorthCarolina at Greensboro.Jaeger, R. M. (1996b). Conclusions on the technical measurement quality



16/17

of the 1995-1996 operational version of the National Board for P ro-fessional Teaching Standards' Middle Childhood Generalist Assess-ment. Center for Educational Research and Evaluation, University ofNorth Carolina at Greensboro.Jensen, A. R. (1980). Bias in mental testing. New York: Free Press.Kier, F. J., & Davenport, D. S. (1997). Ramifications of Hopwood v.Texason the process of applicant selection in APA-accredited professionalpsychology programs. Professional Psychology: Research and Prac-tice, 28, 4 8 6 - 4 9 1 .Klein, S. P. (1983). An analysis of the relationship between trial practiceskills and bar examination results. Unpublished manuscript.Klein, S. P., & Bolus, R. E. (1982). An analysis of the relationshipbetween clinical legal skills and bar examination results. Unpublishedmanuscript.Klein, S. P., Jovano vic, J., Stecher, B. M., McCaffrey, D., Shavelson,R. J., Haertel, E., Solano-Flores, G., & Comfort, K. (1997). Gender andracial/ethnic differences on performance assessments in science. Edu-cational Evaluation and Policy Analysis, 19, 83-97.Koenig, J. A., & Leger, K. F. (1997). A comparison of retest performancesand test-preparation methods for MCAT examinees grouped by genderand race-ethnicity. Academic Medicine, 72, S100-S102.Koenig, J. A., & Mitchell, K. J. (1988). An interim report on the MCATessay pilot project. Journal of Medical Education, 63, 21-29.Lee, O. (1999). Equity implications based on the conceptions of scienceachievement in major reform documents. Review of Educational Re-search, 69, 83-115.Linn, R. L. (1993). Educational assessment: Expanded expectations andchallenges. Educational Evaluation and Policy Analysis, 15, 1-16.LeMahieu, P. G., Gitomer, D. H., & Eresh, J. T. (1995). Portfolios inlarge-scale assessment: Difficult but not impossible. Educational Mea-surement: Issues and Practice, 14, 11-16, 25-28.Lynn, R. (1996). Racial and ethnic differences in intelligence in the U.S.on the Differential Ability Scale. Personality and Individual Differ-ences, 20, 271-273.McCauley, C. D., & Mendoza, J. (1985). A simulation study of item biasusing a two-parameter item response model. Applied PsychologicalMeasurement, 9, 389-400.McD aniel, M. A., Schmidt, F. L., & Hunter, J. E. (1988). A m eta-analysisof the validity of methods for rating training and experience in person-nel selection. Personnel Psychology, 41, 283-309.Medley, D. M., & Quirk, T. J. (1974). The application of a factorial designto the study of cultural bias in general culture items on the NationalTeacher Examination. Journal of Educational Measurement, 11, 2 3 5 -245.Mehrens, W. A. (1989). Using test scores for decision making. In B. R.Gifford (Ed.), Test policy and test performance: Education, language,and culture (pp. 93-99). Boston: Kluwer Academic.Mehrens, W. A. (1999). The CBEST saga: Implications for licensure andemployment testing. The Bar Examiner, 68, 23-32.Messick, S. M., & Jungeblut, A. (1981). Time and method in coaching forthe SAT. Psychological Bulletin. 89, 191-216.Miller-Jones, D. (1989). Culture and testing. American Psychologist, 44,360-366.Millman, J., Mehrens, W. A., & Sackett, P. R. (1993). An evaluation of theNew York State Bar Examination. Unpublished manuscript.Mishkin, P. J. (1996). Foreword: The making of a turning pointMetroand Adarand. California Law Review, 84, 875-886.Motow idlo, S. J., Dunnette, M. D., & C arter, G. W. (1990). An alternativeselection procedure: The low-fidelity simulation. Journal of AppliedPsychology, 75, 640-647 .

Mount, M. K., & Barrick, M. R. (1995). The Big Five personalitydimensions: Implications for research and practice in human resourcesmanagement. In G. Ferris (Ed.), Research in personnel and humanresources management (Vol. 13, pp. 153-200). Greenwich, CT: JAIPress.National Academy of Sciences. (1982). Ability testing: Uses, conse-quences, and controversies (Vol. 1). Washington, DC: National Acad-emy Press.Neill, M. (1995). Some prerequisites for the establishment of equitable,inclusive multicultural assessment systems. In M. T. Nettles & A. L.Nettles (Eds.), Equity and excellence in educational testing and assess-ment (pp. 115-157). Boston: Kluwer Academic.

Neisser, U., Boodoo, G., Bouchard, T. J., Jr., Boykin, A. W., Brody, N.,Ceci, S. J., Halpern, D. F., Loehlin, J. C , Perloff, R., Sternberg, R. J.,& Urbina, S. (1996). Intelligence: Knowns and unknowns. AmericanPsychologist, 51, 77-101 .O'Connor, M. C. (1989). Aspects of differential performance by minor-ities on standardized tests: Linguistic and sociocultural factors. In B. R.Gifford (Ed.), Test policy and test performance: Education, language,and culture (pp. 129-181). Boston: Kluwer Academic.O'Neil, H. F., & Brown, R. S. (1997). Differential effects of questionformats in math assessment on metacognition and effect (Tech. Rep.No . 449). Los Angeles: University of California, National Center forResearch on Evaluation, Standards, and Student Testing.Ones, D. S., Viswesvaran, C, & Schmidt, F. L. (1993). Comprehensivemeta-analysis of integrity test validities: Findings and implications forpersonnel selection and theories of job performance. Journal o f AppliedPsychology, 78, 656-664 .Pear, R. (1996, November 6). The 1996 elections: The nationthe states.The New Y ork Times, p. B7.Powers, D. E. (1987). Who benefits most from preparing for a "coachable"admissions test? Journal of Educational Measurement, 24, 247-262.Powers, D. E., & Swinton, S. S. (1982). The effects of self-study of testfamiliarization materials for the analytical section of the GRE AptitudeTest (GRE Board Research Report GREB No. 79-9). Princeton, NJ:Educational Testing Service.Powers, D. E., & Swinton, S. S. (1984). Effects of self-study for coachabletest item types. Journal of Educational Psychology, 76, 266-278.Pulakos, E. D., & Schmitt, N. (1996). An evaluation of two strategies forreducing adverse impact and their effects on criterion-related validity.Human Performance, 9, 241-258.Robertson, I. T., & Kandola, R. S. (1982). Work sample tests: Validity,adverse impact, and applicant reaction. Journal of Occupational Psy-chology, 55, 171-183.Ryan, A. M., Ployhart, R. E., & Friedel, L. A. (1998). Using personalitytesting to reduce adverse impact: A cautionary note. Journal of AppliedPsychology, 83, 298-307.Ryan, A. M., Ployhart, R. E., Greguras, G. J., & Schm it, M. J. (1998). Testpreparation programs in selection contexts: Self-selection and programeffectiveness. Personnel Psychology, 51, 599-622 .Ryer, J. A., Schmidt, D. B., & Schmitt, N. (1999, April). Candidateorientation programs: E ffects on test scores and adverse im pact. Paperpresented at the annual conference of the Society for Industrial andOrganizational Psychology, Atlanta, GA.Sacco, J. M., Scheu, C. R., Ryan, A. M., Schmitt, N., Schmidt, D. B., &Rogg, K. L. (2000). Reading level and verbal test scores as predictorsof subgroup differences and validities of situational judgment tests.Unpublished manuscript.Sackett, P. R. (1998). Performance assessment in education and profes-sional certification: Lessons for personnel selection. In M. D. Hakel(Ed.), Beyond multiple-choice: Evaluating alternatives to traditionaltesting for selection (pp. 113-129). Mahwah, NJ: Erlbaum.Sackett, P. R., B urris, L. R.. & Ryan, A. M. (1989). Coaching and practiceeffects in personnel selection. In C. L. Cooper & I. Robertson (Eds.),International review of industrial and organizational psychology 1989.London: Wiley.Sackett, P. R., & Ellingson, J. E. (1997). The effects of forming multi-predictor composites on group differences and adverse impact. Person-nel Psychology, 50, 707-722.Sackett, P. R., & Wilk, S. L. (1994). Within-group norming and otherforms of score adjustment in preemployment testing. American Psy-chologist, 49, 929-954.Salgado, J. F. (1997). The five factor model of personality and jobperformance in the European community. Journal of Applied Psychol-ogy, 82, 3 0 - 4 3 .Scarr, S. (1981). Race, social class, and individual differences in l.Q.Hillsdale, NJ: Erlbaum.Scheuneman, J. (1987). An experimental, exploratory study of causes ofbias in test items. Journal of Educational Measurement, 24, 97-118.Scheuneman, J., & Gerritz, K. (1990). Using differential item functioningprocedures to explore sources of item difficulty and group performancecharacteristics. Journal of Educational Measurement, 27, 109-131.Scheuneman, J., & Grima, A. (1997). Characteristics of quantitative word

April 2001 American Psychologist 317


17/17

items associated with differential performance for female and Blackexaminees. Applied Measurement in Education, 10, 299-319.Schmeiser, C. B., & Ferguson. R. L. (1978). Performance of Black andWhite students on test materials containing content based on Black andWhite cultures. Journal of Educational Measurement, 15, 193-200.Schmidt, F. L. (1988). The problem of group differences in ability testscores in employment selection. Journal of Vocational Behavior, 33,272-292.Schmidt, F. L., Greenthal, A. L.. Hunter, J. E., Berner, J. G., & Seaton,F. W. (1977). Job sample vs. paper-and-pencil trades and technicaltests: Adverse impact and examinee attitudes. Personnel Psychology,30 , 187-196.Schmidt, F. L.. & H unter, J. E. (1981). Em ployment testing: Old theoriesand new research findings. American Psychologist, 36, 1128-1137.Schmidt, F. L.. & Hunter, J. E. (1998). The validity and utility of selectionmethods in personnel psychology: Practical and theoretical implicationsof 85 years of research findings. Psychological Bulletin, 124, 262-274.Schmidt. F. L., Kaplan. J. R., Bemis, S. E.. Decuir, R., Dunn, L.. &Antone, L. (1979). The behavioral consistency method of unassembledexamining (TM-79-21). Washington, DC: U.S. Office of PersonnelManagement, Personnel Research and Development Center.Schmidt, F. L., Mack. M. J.. & Hunter, J. E. (1984). Selection utility in theoccupation of U.S. park ranger for three modes of test use. Journal ofApplied Psychology, 69, 490-497 .Schmit, M. J. (1994). Pre-employment processes and outcomes, applicantbelief systems, and minority-majority group differences. Unpublisheddoctoral dissertation. Bowling Green State University.Schmit, M. J., & Ryan. A. M. (1997). Applicant withdrawal: The role oftest-taking attitudes and racial differences. Personnel Psychology, 50,855-876.Schmitt, A. P., & Dorans, N. J. (1990). Differential item functioning forminority examinees on the SA T. Journal of Educational Measurement,27 , 6 7 - 8 1 .Schmitt. N.. Clause, C. S., & Pulakos, E. D. (1996). Subgroup differencesassociated with different measures of some job-relevant constructs. InC. R. Cooper & I. T. Robertson (Eds.), International review of indus-trial and organizational psychology (Vol. 11, pp. 115-140). New York:Wiley.Schmitt, N., Gooding, R. Z.. Noe, R. A., & Kirsch, M. P. (1984).Meta-an alyses of validity studies published between 1964 and 1982 andthe investigation of study characteristics. Personnel Psychology, 37,407-422.Schmitt, N., & Mills. A. E. (in press). Traditional tests and simulations:Minority and majority performance and test validities. Journal of Ap-plied Psychology.Schmitt, N., Rogers. W., Chan, D.. Sheppard. L., & Jennings, D. (1997).Adverse impact and predictive efficiency of various predictor combi-nations. Journal of Applied Psychology, 82, 719-730.Smith, F. D. (1991). Work samples as measures of performance. In A. K.

Wigdor & B. G. Green Jr. (Eds.), Performance a ssessment for theworkplace (pp. 27-52). Washington: DC:

Documents

Sackett Et Al 2001 High-Stakes Testing in Employment, Credentialing, And Higher Education - Prospects in a Post-Affirmative-Action World