High-Stakes Testing in Employment, Credentialing, and ...psych.wfu.edu/furr/362/Sackett et al 2001 Testing and bias -amer... · High-Stakes Testing in Employment, Credentialing, and

High-Stakes Testing in Employment, Credentialing,and Higher Education

Prospects in a Post-Affirmative-Action World

Paul R. SackettNeal Schmitt

Jill E. EllingsonMelissa B. Kabin

University of Minnesota, Twin Cities CampusMichigan State UniversityThe Ohio State UniversityMichigan State University

Cognitive!}- loaded tests of knowledge, skill, and abilityoften contribute to decisions regarding education, jobs,licensure, or certification. Users of such tests often facedifficult choices when trying to optimize both the perfor-mance and ethnic diversity of chosen individuals. Theauthors describe the nature of this quandary, review re-search on different strategies to address it, and recommendusing selection materials that assess the full range ofrelevant attributes using a format that minimizes verbalcontent as much as is consistent with the outcome one istrying to achieve. They also recommend the use of testpreparation, face-valid assessments, and the considerationof relevant job or life experiences. Regardless of the strat-egy adopted, it is unreasonable to expect that one canmaximize both the performance and ethnic diversity ofselected individuals.

C ognitively loaded tests of knowledge, skill, andability are commonly used to help make employ-ment, academic admission, licensure, and certifi-

cation decisions (D'Costa, 1993; Dwyer & Ramsey, 1995;Frierson, 1986; Mehrens, 1989). Law school applicantssubmit scores on the Law School Admission Test (LSAT)for consideration when making admission decisions. Upongraduation, the same individuals must pass a state-admin-istered bar exam to receive licensure to practice. Organi-zations commonly rely on cognitive ability tests whenmaking entry-level selection decisions and tests of knowl-edge and skill when conducting advanced-level selection.High-school seniors take the Scholastic Assessment Test(SAT) for use when determining college admissions andthe distribution of scholarship funds. Testing in these set-tings is termed high stakes, given the central role played bysuch tests in determining who will and who will not gainaccess to employment, education, and licensure or certifi-cation (jointly referred to as credentialing) opportunities.

The use of standardized tests in the knowledge, skill,ability, and achievement domains for the purpose of facil-itating high-stakes decision making has a history charac-terized by three dominant features. First, extensive researchhas demonstrated that well-developed tests in these do-

mains are valid for their intended purpose. They are useful,albeit imperfect, descriptors of the current level of knowl-edge, skill, ability, or achievement. Thus, they are mean-ingful contributors to credentialing decisions and usefulpredictors of future performance in employment and aca-demic settings (Mehrens, 1999; Neisser et al., 1996;Schmidt & Hunter, 1998; Wightman, 1997; Wilson, 1981).

Second, racial group differences are repeatedly ob-served in scores on standardized knowledge, skill, ability,and achievement tests. In education, employment, and cre-dentialing contexts, test score distributions consistently re-veal significant mean differences by race (e.g., Bobko,Roth, & Potosky, 1999; Hartigan & Wigdor, 1989; Jensen,1980; Lynn, 1996; Neisser et al., 1996; Scarr, 1981;Schmidt, 1988; N. Schmitt, Clause, & Pulakos, 1996;Wightman, 1997; Wilson, 1981). Blacks tend to scoreapproximately one standard deviation lower than Whites,and Hispanics score approximately two thirds of a standarddeviation lower than Whites. Asians typically score higherthan Whites on measures of mathematical-quantitativeability and lower than Whites on measures of verbal abilityand comprehension. These mean differences in test scorescan translate into large adverse impact against protectedgroups when test scores are used in selection and creden-tialing decision making. As subgroup mean differences intest scores increase, it becomes more likely that a smallerproportion of the lower scoring subgroup will be selectedor granted a credential (Sackett & Wilk, 1994).

Third, the presence of subgroup differences leads toquestions regarding whether the differences observed bias

Editor's note. Sheldon Zedeck served as action editor for this article.

Author's note. Paul R. Sackett, Department of Psychology, Universityof Minnesota, Twin Cities Campus; Neal Schmitt and Melissa B. Kabin,Department of Psychology, Michigan State University; Jill E. Ellingson,Department of Management and Human Resources, The Ohio State Uni-versity. Authorship order for Paul R. Sackett and Neal Schmitt wasdetermined by coin toss.

Correspondence concerning this article should be addressed to PaulR. Sackett, Department of Psychology, University of Minnesota, N475Elliott Hall, 75 East River Road, Minneapolis, MN 55455. Electronic mailmay be sent to [email protected].

302 April 2001 • American PsychologistCopyright 2001 by the American Psychological Association, Inc. 0O03-066XA)l/$5.00

Vol. 56. No. 4, 302-318 DOI: I0.IO37/AJO03-O66X.56.4.302

Paul R.Sackett

resulting decisions. An extensive body of research in boththe employment and education literatures has demonstratedthat these tests generally do not exhibit predictive bias. Inother words, standardized tests do not underpredict theperformance of minority group members (e.g., AmericanEducational Research Association, American Psychologi-cal Association, & National Council of Measurement inEducation. 1999; Cole, 1981; Jensen. 1980: Neisser et al..1996; O'Conner, 1989; Sackett & Wilk, 1994; Wightman.1997; Wilson, 1981).

These features of traditional tests cause considerabletension for many organizations and institutions of higherlearning. Most value thai which is gained through the useof tests valid for their intended purpose (e.g., a higherperforming workforce, a higher achieving student body, acadffl of credentiated teachers who meet knowledge, skill,and achievement standards). Yet, most also value racial andethn;c diversity in the workforce or student body, withrationales ranging from a desire to mirror the compositionOf the community to a belief that academic experiences orworkplace effectiveness are enhanced by exposure to di-verse perspectives. What quickly becomes clear is thatthese two values—performance and diversity—come intoconflict. Increasing emphasis on the use of tests in theinterest of gaining enhanced performance has predictablenegative consequences for the selection of Blacks andHispanics. Conversely, decreasing emphasis on the use oftests in the interest of achieving a diverse group of selecteesoften results in a substantial reduction in the performancegains that can be recognized through test use (e.g..Schmidt. Mack, & Hunter, 1984; N. Schmitt et al., 1996).

This dilemma is well-known, and a variety of resolu-tion strategies have been proposed. One class of strategiesinvolves some form of minority group preference; thesestrategies were the subject of an American Psychologist

article by Sackett and Wilk (1994) that detailed the history,rationale, consequences, and legal status of such strategies.However, a variety of recent developments indicate agrowing trend toward bans on preference-based forms ofaffirmative action. The passage of the Civil Rights Act of1991 made it unlawful for employers to adjust test scores asa function of applicants' membership in a protected group.The U.S. Supreme Court's decisions in City oj Richmond v,J. A. Croson Co, (1989) and Ada rand Constructors, Inc. v.Penci (1995) to overturn set-aside programs that reserved apercentage of contract work for minority-owned businessessignaled the Court's stance toward preference-based affir-mative action (Mishkin. 1996). The U.S. Fifth CircuitCourt of Appeals ruled in Hopwood v. Stale of Texas(1996) that race could not be used as a factor in universityadmissions decisions (Kier & Davenport, 1997; Mishkin.1996). In 1996, the state of California passed Proposition209 prohibiting the use of group membership as a basis forany selection decisions made by the state, thus affectingpublic sector employment and California college admis-sions (Pear, 1996). Similarly, state of Washington votersapproved Initiative 200 that bars the use of race in statehiring, contracting, and college admissions (Verhovek &Ayres, 1998).

Strategies for Achieving DiversityWithout Minority PreferenceIn light of this legal trend toward restrictions on preference-based routes to diversity, a key question emerges: What arethe prospects for achieving diversity without minority pref-erence and without sacrificing the predictive accuracy andcontent relevancy present in knowledge, skill, ability, andachievement tests? Implicit in this question is the premisethat one values both diversity and the performance out-comes that an organization or educational institution mayrealize through the use of tests. If one is willing to sacrificequality of measurement and predictive accuracy, there aremany routes to achieving diversity including random se-lection, setting a low cut score, or the use of a low-impactpredictor even though it may possess little to no predictivepower. On the other hand, if one values performance out-comes but does not value diversity, maximizing predictiveaccuracy can be the sole focus. We suggest that mostorganizations and educational institutions espouse neitherof these extreme views and instead seek a balance betweendiversity concerns and performance outcomes. Clearly, theuse of traditional tests without race-based score adjustmentfails to achieve such a balance. However, what alternativesare available for use in high-stakes, large-scale assessmentcontexts? In this article, we review various alternativestrategies that have been put forth in the employment,education, and credentialing literatures. We note that somestrategies have been examined more carefully in somedomains than in others, and thus the attention we devote toemployment, education, and credentialing varies acrossthese alternatives.

The first strategy involves the measurement of con-structs with little or no adverse impact along with tradi-tional cognitively loaded knowledge, skill, ability, and

April 2001 • American Psychologist 303

Neal Schmitt

achievement measures. The notion is that if we considerother relevant constructs along with knowledge, skill, abil-ity, and achievement measures when making high-stakesdecisions, subgroup differences should be lessened becausealternatives such as measures of interpersonal skills orpersonality usually exhibit smaller differences between eth-nic and racial subgroups. A second strategy investigateslust items in an effort to identify and remove those itemsthat are Culturally laden, k is generally believed that be-cause those ilems likely reflect irrelevant, culture-boundfactors, their removal will improve minority passing rates.The use of computer or video technology to present teststimuli and collect examinee responses constitutes a thirdstrategy. Using these technologies usually serves to mini-mize the reading and writing requirements of a test. Re-duction of adverse impact may be possible when the read-ing or writing requirements are inappropriately high. Also,video technology may permit the presentation of stimulusmaterials in a fashion that more closely matches the per-formance situation of interest. Attempts to modify howexaminees approach the test-taking experience constitutes afourth strategy. To the extent that individuals of varyingethnic and racial groups exhibit different levels of test-taking motivation, attempts to enhance examinee motiva-tion levels may reduce subgroup differences. Furthermore,changing the way in which the test and its questions arepresented may impact how examinees respond, a result thatcould also facilitate minority test performance. A fifthstrategy has been to document relevant knowledge, accom-plishments, or achievements via portfolios, performanceassessments, or accomplishment records. Proponents ofthis strategy maintain that this approach is directly relevantto desired outcomes and hence should constitute a more fairassessment of the knowledge, skill, ability, or achievement

domain of interest for members of all subgroups. Finally,we also review the use of coaching or orientation programsthat provide examinees with information about the test andstudy materials or aids to facilitate optimal performance. Inaddition, we consider whether modifying the time limitsprescribed for testing helps reduce subgroup differences. Inthe following sections, we review the literature relevant toeach of these efforts in order to understand the nature ofsubgroup differences on knowledge, skill, ability, andachievement tests and to ascertain the degree to whichthese efforts have been effective in reducing thesedifferences.

Use of Measures of Additional RelevantConstructsCognitively loaded knowledge, skill, ability, and achieve-ment tests are among the mosi valid predictors availablewhen selecting individuals across a wide variety of educa-tional and employment situations (Schmidt & Hunier,1981. 1998). Therefore, a strategy for resolving the di-lemma that allows for the use of such tests is readilyappealing. To that end, previous research has identified anumber of noncognitive predictors that are also valid whenmaking selection decisions in most educational and em-ployment contexts. Measures of personality and interper-sonal skills generally exhibit smaller mean differences byethnicity and race and also are related to performance onthe job or in school (e.g., Barrick & Mount, 1991; Bobkoctal., 1999; Mount & Barrick, I995;Sackett & Wilk. 1994;Salgado, 1997; N. Schmitt et al., 1996; Wolfe & Johnson.1995). The use of valid, noncognitive predictors, in com-bination with cognitive predictors, serves as a very desir-able strategy in that it offers the possibility of simulta-neously meeting multiple objectives. If additional con-structs, beyond those measured by the traditional test, arerelevant for the job or educational outcomes of interest,supplementing cognitive tests offers the prospect of in-creased validity when predicting those outcomes. If thoseadditional constructs are ones on which subgroup differ-ences are smaller, a composite of the traditional test and theadditional measures will often exhibit smaller subgroupdifferences than the traditional test alone. The prospect ofsimultaneously increasing validity and reducing subgroupdifferences makes this a strategy worthy of careful study.

Several different approaches have been followedwhen examining this strategy. On the basis of the psycho-metric theory of composites. Sackett and Ellingson (1997)developed a set of implications helpful in estimating theeffect of a supplemental strategy on adverse impact. First,consider a composite of two uncorrelated measures, whered, = 1.0 and d2 — 0.0. Although intuition may suggest thata composite of the two will split the difference (i.e.. resultin a d of 0.5), the computed value is 0.71. Thus, whereassupplementing a cognitively loaded test with an uncorre-lated measure exhibiting no subgroup differences will re-duce the composite subgroup difference, this reduction willbe less than some might expect. Second, a composite mayresult in a d larger than either of the components making upthe composite if the two measures are moderately corre-

304 April 2001 • American Psychologist

Jill E.Ellingson

!ated; in essence the composite reflects a more reliablemeasure of the underlying characteristic reflected in bothvariables. Third, adding additional supplemental measureshas diminishing returns. For example, when tl, = 1.0 andeach additional measure is uncorrelated with the originalmeasure and has d = 0.0, ihe composite ds adding asecond, third, fourth, and fifth measure are 0.71. 0.58, 0.50,and 11.45. respectively. (Note that adding additional predic-tors exhibiting lower d values would no! be done unlessthose additional predictors are themselves related to theoutcome one hopes to predict; see N. Schmilt, Rogers,Char:, Sheppard. & Jennings. 1997, for additional analyticwort on the effects of combining predictors.)

A second approach followed when examining thisstrategy has been the empirical tryout of different compos-ite alternatives. For example, in an employment context,Pula^os and Schmitt (1996) compared a traditional verbalabiliiy measure with three alternative predictors: a bio-grapliical data measure (Stokes, Mumford, & Owens,1994), a situational judgment test (Motowidlo. Dunnette, &Carter. 1990), and a structured interview. Whereas theBlack-White d for the verbal ability measure was 1.03, acomposite of all four predictors produced a d of 0.63. Acomposite of the three alternative predictors produced a elof only 0.23. The inclusion of the verbal ability measure inthe composite did increase the multiple correlation betweenthe predictors and a performance criterion from .41 to .43.However, it did so at the cost of a considerable increase insubgroup differences (0.23 vs. 0.63). A similar pattern offindings was observed when comparing Hispanics andWhiles. Similar results were found by Ryan, Ployhart. andFriedel (1998).

A third approach for evaluating the legitimacy of thecomposite strategy has relied on meta-analytic findings.

Building on estimates reported in Ford, Kraiger. andSchechtman (1986) and N. Schmitt et al. (1997). Bobko etal. (1999) refined previously cumulated validities, intercor-relations, meta-analytically derived estimates of subgroupdifferences in performance, and ds associated with fourpredictors commonly used in the employment domain(cognitive ability, biodata, interviews, and conscientious-ness). Using these refined estimates, Bobko et al. comparedthe validity implications of various composites with theeffect on subgroup differences. The four predictor compos-ite yielded a multiple correlation of .43 when predictingperformance, wilh d estimated at 0.76. The validity of thecognitive ability predictor alone was estimated at .30. with(/estimated at 1.00. Thus, using all four predictors reducedsubgroup differences by 0.24 standard deviation units andincreased the multiple correlation by .13. These resultsmirror earlier research conducted by Ones. Viswesvaran.and Schmidt (1993). in which integrity lest validities wereestimated and discussed with respect to their influence onadverse impact via composite formation.

Weighting of criterion components. Eachof these studies is predicated on the assumption that theoutcome measure against which the predictors are beingvalidated is appropriately constituted. Educational perfor-mance, job performance, and credentialing criteria, how-ever, may be defined in terms of a single dimension ormultiple dimensions. For example, if the outcome of inter-est involves only (he design of a new product or perfor-mance on an academic test, performance may be unidi-mensional. If. however, the outcome of interest requirescitizenship behavior or effective coordination as well, thenperformance will be more broadly defined. When perfor-mance is multidimensional, institutions may choose to as-sign different weights to those various dimensions in aneffort to reflect that they vary in importance.

De Corte (1999) and Hattrup. Rock, and Scalia (1997)have shown how the weighting of different elements of thecriterion space can affect the regression weights assignedto different predictors, the level of adverse impact, andpredicted performance. Hattrup et al. used cumulatedcorrelations from three different studies to estimate therelationship between contextual performance (behaviorssupporting an organization's climate and culture], task per-formance (behaviors that support the delivery of goods orservices), cognitive ability, and work orientation. Usingregression analyses, cognitive ability and work orientationwere used to predict criterion composites in which contex-tual performance and task performance were assigned vary-ing weights. The regression weight for cognitive abilitywas highest when task performance was weighted heavilyin the criterion composite, whereas work orientation re-ceived the largest regression weight when contextual per-formance was the more important part of the criterion. Asexpected, adverse impact was the greatest when task per-formance was weighted more heavily and it decreased ascontextual performance received more weight. Relative toa composite wherein task and contextual performance wereweighted equally, the percentage of minorities that wouldbe selected varied considerably depending on the relative

April 200) • American Psychologist 305

Melissa B.Kabin

weight of task versus contextual performance. De Conereached similar conclusions.

The Hattrup el al. (1997) and De Corte (1999) anal-yses were formulated in employment terms, yet the sameprinciples hold for educational and credentialing tests aswell. For example, when licensing lawyers, the licensingbody is concerned about both technical competence andprofessional ethics. The general principle relevant acrosssettings is that when multiple criterion dimensions are ofinterest, the weights given to the criterion dimensions canhave important effects on the relationship between thepredictors and the overall criterion. The higher the weigh!given to cognitively loaded criterion dimensions, the higherthe resuming weight given to cognitively loaded predictors.The higher the weight given to cognitively loaded predic-tors, the greater the resulting subgroup difference. In re-sponse, one may be tempted to simply choose criteriondimension weights on the basis of their potential for reduc-ing subgroup differences. Such a strategy would be errant,however, as criterion weights should be determined primar-ily on the basis of an analysis of the performance domainof interest and the values that an institution places on thevarious criterion dimensions.

Summary. Research on the strategy of measuringdifferent relevant constructs illustrates that it does matterwhich individual differences are assessed in high-stakesdecision making if one is concerned about maximizingminority subgroup opportunity. Minority pass rates can beimproved by including noncognitive predictors in a testbaitery. However, adding predictors with iiitle or no impactwill not eliminate adverse impact from a battery of teststhat includes cognitively loaded knowledge, skill, ability,anil achievement measures. Reduction in adverse impactresults from a complex interaction between the validity of

the individual predictors, their intercorrelation, the size ofsubgroup differences on the combination of tests used, theselection ratio, and the manner in which the tests are used.In fact, in most situations wherein a variety of knowledge,skills, and abilities are considered when making selectiondecisions, adverse impact will remain at legally unaccept-able levels and subgroup mean differences on the predictorbattery will no! be a great deal lower than the differencesobserved for cognitive ability alone. The composition ofthe test battery should reflect the individual differencesrequired to perform in the domain of interest. If institutionsfocus mainly or solely on task performance, then cognitiveability will likely be the most important predictor andadverse impact will be great. If. however, they focus on abroader domain that involves motivational, interpersonal,or personality dimensions as well as cognitive ability, thenadverse impact may be reduced.

Identification and Removal of CulturallyBiased Test ItemsA second strategy pursued in an attempt to resolve theperformance versus diversity dilemma involves investigat-ing the possibility that certain types of test items are biased.The traditional focus of studies examining differentia! itemfunctioning (DIF; Berk, I9S2) has been on the identifica-tion of items that function differently for minority versusmajority test takers. Conceivably, such items would con-tribute to misleading test scores for members of a particularsubgroup. Statistically. DIF seeks items that vary in diffi-culty for members of subgroups that are actually evenlymalched on the measured construct. That is. an attempt ismade to identify characteristics of items that lead to poorerperformance tor minority-group test lakers than for equallyable majority-group test takers. Assuming such a subset ofitems or item characteristics exist, they must define arace-related construe! that is distinguishable from the con-strue! the test is intended to measure (McCauley & Men-doza, 1985). Perhaps because of the availability of largersample sizes in large-scale testing programs, much of theDIF research conducted to date has been done using edu-cational and credentialing tests.

Initial evidence for DIF was provided by Medley andQuirk (1974), who found relatively large group by iteminteractions in a study of the performance of Black andWhite examinees on the National Teacher Examinationitems reflecting African American art, music, and litera-ture. One should note, however, that these results are basedon examinee performance on a tesi consiructed using cul-turally specific content. Ironson andSubkoviak (!979) alsofound evidence for DIF when they evaluated five cognitivesubtesis administered as part of the National LongitudinalStudy of 1972. The verbal subtest, measuring vocabularyand reading comprehension, contained the largest numberof items flagged as biased. Items at the end of each of thesubtests also tended to be biased against Black examinees,presumably because of lower levels of reading ability orspeed in completing these tests. More recently, T. Freedleand Kostin (1990; R. Freedle & Kostin, 1997) showed thatBlack examinees were more likely to get difficult verbal


items on the Graduate Record Examination (GRE) and theSAT correct when compared with equally able White ex-aminees. However, the Black examinees were less likely toget the easy items right. In explanation, they suggested thatthe easier items possessed multiple meanings more familiarto White examinees, whose culture was most dominant inthe test items and the educational system.

Whitney and Schmitt (1997) investigated the extent towhich DIF may be present in biographical data measuresdeveloped for use in an employment context. In an effort tofurther DIF research, Whitney and Schmitt focused notonly on identifying whether biodata items may exhibit DIF,but also whether the presence of DIF can be traced todifferences in cultural values between racial subgroups.More than one fourth of the biodata items exhibited DIFbetween Black and White examinees. Moreover, the Blackand White examinees differentially endorsed item responseoptions designed to reflect differing cultural notions ofhuman nature, the environment, time orientation, and in-terpersonal relations. However, there was only limited ev-idence that the differences observed between subgroups incultural values were actually associated with DIF. Afterremoval of all DIF items, the observed disparity in testscores between Blacks and Whites was eliminated, a dis-parity that, incidentally, favored Blacks over Whites.

Other studies evaluating the presence of DIF haveproved less interpretable. Scheuneman (1987) developeditem pairs that reflected seven major hypotheses aboutpotential sources of item bias in the experimental portion ofthe GRE. The results were inconsistent in showing thatthese manipulations produced a more difficult test for mi-nority versus majority examinees. In some instances, themanipulations produced a greater impact on Whites thanBlacks. In other instances, a three-way interaction betweengroup, test version, and items indicated that some uncon-trolled factor (e.g., content of a passage or item) wasresponsible for the subgroup difference. Scheuneman andGerritz (1990) examined verbal items from the GRE andthe SAT that consisted of short passages followed byquestions. Although they did identify several item featurespossibly linked to subgroup differences (e.g., content deal-ing with science, requiring that examinees identify themajor thesis in a paragraph), the results, as a whole, yieldedno clear-cut explanations. Scheuneman and Gerritz con-cluded that DIF may result from a combination of itemfeatures, the most important of which seems to be thecontent of the items. Similarly inconclusive results werereported by A. P. Schmitt and Dorans (1990) in a series ofstudies on SAT-Verbal performance. Items that involvedthe use of homographs (i.e., words that are spelled likeother words with a different meaning) were more difficultfor otherwise equally able racial and ethnic group mem-bers. Yet, when nonnative English speakers were removedfrom the analyses, there were few remaining DIF items.

Schmeiser and Ferguson (1978) examined the Englishusage and social studies reading tests of the AmericanCollege Test (ACT) and found little support for DIF. TwoEnglish tests and three social studies tests were developedto contain different content, while targeting the same cog-

nitive skills. None of the interactions between test contentand racial and ethnic group were statistically significant.Similarly, Scheuneman and Grima (1997) reported that theverbal characteristics of word problems (e.g., readabilityindexes, the nature of the arguments, and propositions) inthe quantitative section of the GRE were not related to DIFindexes.

These results indicate that although DIF may be de-tected for a variety of test items, it is often the case that themagnitude of the DIF effect is very small. Furthermore,there does not appear to be a consistent pattern of itemsfavoring one group versus another. Results do not indicatethat removing these items would have a large impact onoverall test scores. In addition, we know little about howDIF item removal will affect test validity. However, certainthemes across these studies suggest the potential for someDIF considerations. Familiarity with the content of itemsappears to be important. The verbal complexity of the itemsis also implicated, yet it is not clear what constitutes verbalcomplexity. Differences in culture are often cited as im-portant determinants of DIF, but beyond the influence ofhaving English as one's primary language, we know littleabout how cultural differences play a role in test itemperformance.

Use of Alternate Modes of Presenting TestStimuliA third strategy to reduce subgroup differences in tests ofknowledge, skill, ability, and achievement has been tochange the mode in which test items or stimulus materialsare presented. Most often this involves using video orauditory presentation of test items, as opposed to present-ing test items in the normal paper-and-pencil mode. Im-plicit in this strategy is the assumption that reducing irrel-evant written or verbal requirements will reduce subgroupdifferences. In support of this premise, Sacco et al. (2000)demonstrated the relationship between reading level andsubgroup differences by assessing the degree to which thereadability level of situational judgment tests (SJTs) wascorrelated with the size of subgroup differences in SJTperformance. They estimated that 10th-, 12th-, and 14th-grade reading levels would be associated with Black-Whiteds of 0.51, 0.62, and 0.74, respectively; reading levels atthe 8th, 10th, and 13th grades would be associated withHispanic-White ds of 0.38, 0.48, and 0.58, respectively.This would suggest that reducing the readability level of atest should in turn reduce subgroup differences. One mustbe careful, however, to remove only verbal requirementsirrelevant to the criterion of interest, as Sacco et al. alsodemonstrated that verbal ability may partially account forSJT validities. In reviewing the research on this strategy,we focused on three key studies investigating video as analternative format to a traditional test. The three selectedstudies illustrate issues central in examining the effects offormat changes on subgroup differences. For other usefulstudies that also touch on this issue, see Weekly and Jones(1997) and N. Schmitt and Mills (in press).

Pulakos and Schmitt (1996) examined three measuresof verbal ability. A paper-and-pencil measure testing verbal


analogies, vocabulary, and reading comprehension had avalidity of .19 when predicting job performance and aBlack-White d = 1.03. A measure that required examineesto evaluate written material and write a persuasive essay onthat material had a validity of .22 and a d = 0.91. The thirdmeasure of verbal ability required examinees to draft adescription of what transpired in a short video. The validityof this measure was .19 with d = 0.45. All three measureshad comparable reliabilities (i.e., .85 to .92). Comparingthe three measures, there was some reduction in d whenwritten materials involving a realistic reproduction of tasksrequired on the job were used rather than a multiple-choicepaper-and-pencil test. When the stimulus material was vi-sual rather than written, there was a much greater reductionin subgroup differences (i.e., Black-White d valuesdropped from 1.03 to 0.45, with parallel findings reportedfor Hispanics). In terms of validity, the traditional verbalability test was the most predictive (r = .39), whereas thevideo test was less so (r = .29, corrected for range restric-tion and criterion unreliability).

Chan and Schmitt (1997) evaluated subgroup differ-ences in performance on a video-based SJT and on awritten SJT identical to the script used to produce the videoenactment. The written version of the test displayed asubgroup difference of 0.95 favoring Whites over Blacks,whereas the video-based version of the test produced asubgroup difference of only 0.21. Corrections for unreli-ability produced differences of 1.19 and 0.28. These dif-ferences in d were matched by subgroup differences inperceptions of the two tests. Both groups were more favor-ably disposed (as indicated by perceptions of the tests' facevalidity) to the video version of the test, but Blacks signif-icantly more so than Whites.

Sackett (1998) summarized research on the use of theMultistate Bar Examination (MBE), a multiple-choice testof legal knowledge and reasoning, and research conductedby Klein (1983) examining a video-based alternative to theMBE. The video test presented vignettes of lawyers takingaction in various settings. After each vignette, examineeswere asked to evaluate the actions taken. Millman, Mehrens,and Sackett (1993) reported a Black-White d of 0.89 forthe MBE; an identical value (d = 0.89) was estimated bySackett for the video test.

Taken at face value, the Sackett (1998), Chan andSchmitt (1997), and Pulakos and Schmitt (1996) resultsappear contradictory regarding the impact of changingfrom a paper-and-pencil format to a video format. Thesecontradictions are more apparent than real. First, it is im-portant to consider the nature of the focal construct. In thePulakos and Schmitt study of verbal ability and the Sackettstudy of legal knowledge and reasoning, the focal constructwas clearly cognitive in nature, falling squarely into thedomain of traditional knowledge, skill, ability and achieve-ment tests. In Chan and Schmitt, however, the focal con-struct (i.e., interpersonal skill) was, as the name implies,not heavily cognitively loaded. In fact, the SJT used was ofthe type often suggested as a potential additional measurethat might supplement a traditional test. Chan and Schmittprovided data supporting the hypothesis that the relatively

large drop in d was due in part to the removal of an implicitreading comprehension component present in the writtenversion of the SJT. When that component was removedthrough video presentation, the measure became less cog-nitively loaded and d was reduced.

But what of the differences between the verbal skill(Pulakos & Schmitt, 1996) and legal knowledge and rea-soning studies (Sackett, 1998)? We use this comparison tohighlight the importance of examining the correlation be-tween the traditional written test and the alternative videotest. In the legal knowledge and reasoning study, in whichthe video did not result in reduced subgroup differences,the correlation between the traditional test and the videotest was .89, corrected for unreliability. This suggests thatthe change from paper-and-pencil to video was essentiallya format change only, with the two tests measuring thesame constructs. In the verbal skills study, in which thevideo resulted in markedly smaller subgroup differences,the correlation between the two, corrected for unreliabilityin the two measures, was .31. This suggests that the video-based test in the verbal skills study reflected not only achange in format, but a change in the constructs measuredas well. An examination of the scoring procedures for theverbal skills video test supports this conclusion. Examineeessays describing video content were rated on features ofverbal ability (e.g., sentence structure, spelling) and oncompleteness of details reported. We suggest that scoringfor completeness introduced into the construct domain per-sonality characteristics such as conscientiousness and detailorientation, both traits exhibiting smaller subgroup differ-ences. Consistent with the arguments made above withregard to the ability of composites to reduce subgroupdifferences, the reduction in d observed for the verbal skillsvideo test was due not to the change in format, but rather tothe introduction of additional constructs that amelioratedthe influence of verbal ability when determining d. Thus,the research to date indicates that changing to a videoformat does not per se lead to a reduction in subgroupdifferences. Future research into the influence of alternativemodes of testing should take steps to control for the unin-tended introduction of additional constructs beyond thosebeing evaluated. Failure to separate test content from testmode will confound results, blurring our ability to under-stand the actual mechanism responsible for reducing sub-group differences.

Last, we wish to highlight the role of reliability whencomparing traditional tests with alternatives. Focusing onthe legal knowledge and reasoning study (Sackett, 1998),recall that the Black-White difference was identical (d =0.89) for both tests. We now add reliability data to ourdiscussion. Internal consistency reliability for the MBEwas .91; correcting the subgroup difference for unreliabil-ity results in a corrected value of d = 0.93. Internalconsistency reliability for the video test was .64; correctingthe subgroup difference for unreliability results in a cor-rected value of d = 1.11. In other words, after takingdifferences in reliability into account, the alternative videotest results in a larger subgroup difference than the tradi-tional paper-and-pencil test. Such an increase is possible in


that a traditional test may tap declarative knowledge,whereas a video alternative may require the application ofthat knowledge, an activity that will likely draw on higherorder cognitive skills resulting in higher levels of d.Clearly, any conclusion about the effects of changing testformat on subgroup differences must take into accountreliability of measurement. What appears to be a formateffect may simply be a reliability effect. Because differentreliability estimation methods focus on different sources ofmeasurement error (e.g., inconsistency across scorers, con-tent sampling), taking reliability into account will requireconsidering the most likely sources of measurement errorin a particular setting.

The results reported in this section document the largedifferences in d that are typically observed when compar-isons are made between different test formats. Examina-tions of reasons for these differences are less conclusive. Itis not clear that the lower values of d are a function of testformat. Cognitive ability requirements and the reading re-quirements of these tests seem to be a major explanation.Clearly more research would be helpful. Separating themethod of assessment from the content measured is a majorchallenge when designing studies to evaluate the impact oftest format. However, it is a challenge that must be sur-mounted in order to understand whether group differencescan be reduced by altering test format. Future researchshould also consider investigating how a change in formatinfluences validity. Only one of the three studies reviewedhere evaluated validity effects; clearly, that issue warrantsadditional attention.

Use of Motivation and Instructional SetsA fourth strategy places the focus not on the test itself, buton the mental state adopted by examinees and its role indetermining test performance. An individual's motivationto complete a test has the potential to influence test perfor-mance. To the extent that there are racial group differencesin test-taking motivation, energizing an individual to per-severe when completing a test may reduce subgroup dif-ferences. At a more general level, a test taker's attitude andapproach toward testing may also influence test perfor-mance. Such test-taking impressions could be partiallydetermined by the instructional sets provided when com-pleting tests and the context in which actual test questionsare derived. Manipulating instructional sets and item con-texts has the potential to alter observed group differences ifthese aspects of the testing experience allow individuals ofvarying racial groups to draw on culture-specific cognitiveprocesses.

A number of studies have demonstrated racial groupdifferences in test-taking motivation. O'Neil and Brown(1997) found that eighth-grade Black students reportedexerting the least amount of effort when completing a mathexam. Hispanic students reported exerting slightly more,whereas White students reported exerting the most effortwhen completing the exam. Chan, Schmitt, DeShon,Clause, and Delbridge (1997) demonstrated that the rela-tionship between race and test performance was partiallymediated by test-taking motivation, although the mediating

effect accounted for a very small portion of the variance intest performance. In a follow-up study, Chan, Schmitt,Sacco, and DeShon (1998) found that pretest reactionsaffected test performance and mediated the relationshipbetween belief in tests and test performance. Their sub-group samples were not large, but motivational effectsoperated similarly for Whites and Blacks.

As a test of the influence of item context, a uniquestudy by DeShon, Smith, Chan, and Schmitt (1998) inves-tigated whether presenting test questions in a certain waywould reduce subgroup differences. They tested the hy-pothesis proposed by Helms (1992) that cognitive abilitytests fail to adequately assess Black intelligence becausethey do not account for the emphasis in Black culture onsocial relations and social context, an observation offeredat a more general level by others as well (e.g., Miller-Jones,1989; O'Connor, 1989). Contrary to Helms's argument,racial subgroup performance differences on a set of Wasonconditional reasoning problems were not reduced by pre-senting the problems in a social relationship form.

Another working hypothesis is that the mere knowl-edge of cultural stereotypes may affect test performance. Inother words, making salient to test takers their ethnic andracial or their gender identity may alter both women's andminorities' test-taking motivation, self concept, effortlevel, and expectation of successful performance. Steeleand colleagues (Steele, 1997; Steele & Aronson, 1995)proposed a provocative theory of stereotype threat thatsuggests that the way in which a test is presented toexaminees can affect examinee performance. The theoryhypothesizes that when a person enters a situation whereina stereotype of the group to which that person belongsbecomes salient, concerns about being judged according tothat stereotype arise and inhibit performance. When mem-bers of racial minority groups encounter high-stakes tests,their awareness of commonly reported group differencesleads to concerns that they may do poorly on the test andthus confirm the stereotype. This concern detracts fromtheir ability to focus all of their attention on the test,resulting in poorer test performance. Steele hypothesized asimilar effect for gender in the domain of mathematics. Aboundary condition for the theory is that individuals mustidentify with the domain in question. If the domain is notrelevant to the individual's self-image, the testing situationwill not elicit stereotype threat.

Steele and Aronson (1995) found support for the the-ory in a series of laboratory experiments. The basic para-digm used was to induce stereotype threat in a sample ofhigh-achieving majority and minority students statisticallyequated in terms of their prior performance on the SAT.One mechanism for inducing threat is via instructional set.In the stereotype threat condition, participants were toldthat they would be given a test of intelligence; in thenonthreat condition, they were told they would be given aproblem-solving task. In fact, all of the participants re-ceived the same test. Steele and Aronson found a largermajority-minority difference in the threat condition than inthe nonthreat condition, a finding supportive of the idea


that the presence of stereotype threat inhibits minoritygroup performance.

These findings are well replicated (Steele, 1997) butcommonly misinterpreted. For example, in the fall of 1999,the PBS show "Frontline" broadcast a one-hour specialentitled "Secrets of the SAT," in which Steele's researchwas featured. The program's narrator noted the largeBlack-White gap on standardized tests, described the ste-reotype threat manipulation, and concluded, "Blacks whobelieved the test was merely a research tool did the same asWhites. But Blacks who believed the test measured theirabilities did half as well." The critical fact excluded wasthat whereas a large score gap exists in the population ingeneral, Steele studied samples of Black and White stu-dents who had been statistically equated on the basis ofSAT scores. Thus, rather than eliminating the large scoregap, the research actually showed something very different.Absent stereotype threat, the Black-White difference wasjust what one would expect (i.e., zero), as the two groupshad been equated on the basis of SAT scores. However, inthe presence of stereotype threat, the Black-White differ-ence was larger than would be expected, given that the twogroups were equated.

There are a variety of additional issues that cloudinterpretation and application of Steele's (1997) findings.One critical issue is whether the SAT scores used to equatethe Black and White students are themselves influenced bystereotype threat, thus confounding interpretation of studyfindings. A second issue involves questions as to the pop-ulations to which these findings generalize (e.g., Whaley,1998). The work of Steele and coworkers focused onhigh-ability college students; Steele (1999) noted that theeffect is not replicable in the broader population. A thirdissue is the conflict between a stereotype threat effect andthe large literature cited earlier indicating a lack of predic-tive bias in test use. If stereotype threat results in observedscores for minority group members that are systematicallylower than true scores, one would expect underpredictionof minority group performance, an expectation not sup-ported in the predictive bias literature. An additional prag-matic issue is the question of how one might reduce ste-reotype threat in high-stakes testing settings when the pur-pose of testing is clear.

These issues aside, Steele's (1997, 1999) research isimportant in that it clearly demonstrates that the instruc-tional set under which examinees approach a test can affecttest results. However, research has yet to demonstratewhether and to what degree this effect generalizes beyondthe laboratory. Thus, we caution against overinterpretingthe findings to date, as they do not warrant the conclusionthat subgroup differences can be explained in whole or inlarge part by stereotype threat.

The research on test-taker motivation and instructionalsets has been conducted primarily in laboratory settings.The effects observed on subgroup differences are not large.Future research should attempt to replicate these findings ina field context so we may better understand the extent towhich group differences can be reduced using this alterna-tive. Given the relatively small effects obtained in con-

trolled environments, it seems doubtful that motivationaland social effects will account for much of the subgroupdifferences observed. Nonetheless, it may make sense fortest users to institute mechanisms for enhancing motiva-tion, such as the use of more realistic test stimuli clearlyapplicable to school or job requirements, for the purpose ofmotivating all examinees.

Use of Portfolios, Accomplishment Records,and Performance AssessmentsResearchers have experimented with methods that directlymeasure an individual's ability to perform aspects of thejob or educational domain of interest as a fifth alternative tousing paper-and-pencil measures of knowledge, skill, abil-ity, and achievement. Portfolios, accomplishment records,and performance assessments have each been investigatedas potential alternatives to traditional tests. Performanceassessments (sometimes referred to in the employmentdomain as job or work samples) require an examinee tocomplete a set of tasks that sample the performance domainof interest. The intent is to obtain and then evaluate arealistic behavior sample in an environment that closelysimulates the work or educational setting in question. Per-formance assessments may be comprehensive and broad-based, designed to obtain a wide ranging behavior samplereflecting many aspects of the performance domain inquestion or narrow with the intent of sampling a singleaspect of the domain in question. Accomplishment recordsand portfolios differ from performance assessments in thatthey require examinees to recount past endeavors or pro-duce work products illustrative of an examinee's ability toperform across a variety of contexts. Often examineesprovide examples demonstrative of their progress towardskill mastery and knowledge acquisition.

Performance assessments, as a potential solution forresolving subgroup differences, were examined in the em-ployment domain by Schmidt, Greenthal, Hunter, Berner,and Seaton (1977), who reported that performance assess-ments corresponded to substantially smaller Black-Whitesubgroup differences when compared with a written tradestest (d = 0.81 vs. 1.44). N. Schmitt et al. (1996) updatedthis estimate to d = 0.38 on the basis of a meta-analyticreview of the literature, although they combined tests of jobknowledge and job samples in their review. The use ofperformance assessments in the context of reducing sub-group differences has been extended to multiple high-stakes situations in the credentialing, educational, and em-ployment arena. We outline three such efforts here.

Legal skills assessment center. Klein andBolus (1982; described in Sackett, 1998) examined anassessment center developed as a potential alternative tothe traditional bar examination. Each day of the two-daycenter involved a separate trial, with a candidate represent-ing the plaintiff on one day and the defendant on thesecond. The center consisted of 11 exercises, such as con-ducting a client interview, delivering an opening argument,conducting a cross-examination, and preparing a settlementplan. Exercises were scored by trained attorneys. Sackettreported a Black-White d = 0.76 and an internal consis-


tency reliability estimate of .67, resulting in a d correctedfor unreliability of 0.93.

Accomplished teacher assessment. Jaeger(1996a, 1996b) examined a complex performance assess-ment process developed to identify and certify highly ac-complished teachers under the auspices of the NationalBoard for Professional Teaching Standards. Different as-sessment packages are developed for different teachingspecialty areas. Jaeger examined assessments for EarlyChildhood Generalists and Middle Childhood Generalists.The assessment process required candidates to complete anassessment center and prepare in advance a portfolio thatincluded videotaped samples of their performance. From afrequency count of scores, we computed Black-White <is of1.06 and 0.97 for preoperational field trials, and 0.88 and1.24 for operational use of the Early and Middle Childhoodassessments, respectively.

Management assessment center. Goldstein,Yusko, Braverman, Smith, and Chung (1998) examined anassessment center designed for management developmentpurposes. Candidate performance on multiple dimensionswas evaluated by trained assessors who observed perfor-mance on seven exercises, including an in-basket, leader-less group discussions, and a one-on-one role play. Scoreson each exercise corresponded to Black-White ds rangingfrom 0.03 to 0.40 for the individual exercises, and a d of0.40 for a composite across all exercises.

One clear message emerging from these studies is thatit is not the case that an assessment involving a complexrealistic performance sample can always be expected toresult in a smaller subgroup difference than differencescommonly observed with traditional tests. The legal skillsassessment center and the accomplished teacher assessmentproduced subgroup differences comparable in magnitudewith those commonly reported for traditional knowledge,ability, and achievement tests. The management assess-ment center did result in smaller subgroup differences, afinding commonly reported for this type of performanceassessment (Thornton & Byham, 1982). We offer severalobservations as to what might account for the differences infindings across the three performance assessments. First, inthe legal skills and accomplished teacher assessments, theassessment exercises required performances that build onan existing declarative knowledge base developed in partthrough formal instruction. Effectiveness in delivering anopening argument, for example, builds on a foundation oflegal knowledge and reasoning. In contrast, typical practicein management assessment centers is to design exercisesthat do not build on a formal knowledge base, but rather aredesigned to tap characteristics such as communicationskills, organizing and planning skills, initiative, effective-ness under stress, and personal adjustment. Not only dothese not build on a formal knowledge base, but they alsoin many cases reflect dimensions outside the cognitivedomain. As a reflection of this difference, we note that thelegal skills assessment center correlates .72 with the MBE,whereas the seven exercises in the management assessmentcenter produce correlations corrected for unreliability rang-ing from .00 to .31 with a written cognitive ability measure.

Thus, although each of these performance assessmentsreflected highly realistic samples or simulations of thebehavior domain in question, the reductions in subgroupdifferences were, we posit, a function of the degree towhich the assessment broadened the set of constructs as-sessed to include characteristics relevant to the job oreducational setting in question, but having little or norelationship with the focal constructs tapped by the tradi-tional test.

Researchers and practitioners in the educational arenahave been particularly vocal about the need to evaluatestudent ability and teacher ability using more than stan-dardized, multiple-choice tests (e.g., D'Costa, 1993; Dwyer& Ramsey, 1995; Harmon, 1991; Lee, 1999; Neil, 1995).Whether termed authentic assessments or constructed-re-sponse tests, these performance assessments have manyadvocates in and out of the educational community. Untilrecently, very few attempts had been made to determinewhether their use decreases the difference between themeasured progress of minority and majority students.Given the high degree of fidelity to actual performance,many consider this alternative a worthwhile approach.

Under particular focus in the education literature is theuse of traditional, multiple-choice tests to measure writingskill. It is commonly argued that a constructed-responseformat that requires examinees to write an essay wouldserve as a more appropriate measure of writing ability. Tothat end, a number of studies have compared subgroupdifferences on multiple-choice tests of writing ability withdifferences observed on essay tests. Welch, Doolittle, andMcLarty (1989) reported that Black-White differences onthe traditional ACT writing skills test and an ACT essaytest were the same, with d = 1.41 in both cases. Bond(1995) reported that differences between Blacks andWhites on the extended-essay portion of the National As-sessment of Educational Progress (NAEP) were actuallygreater after correcting for unreliability than those found onthe multiple-choice reading portion. In contrast, White andThomas (1981) reported a decrease in subgroup differenceswhen writing ability was assessed using an essay testversus a multiple-choice test. Using the descriptive statis-tics reported, we established that the Black-White d de-creased from 1.39 for the traditional test to 0.81 for theessay test, with similar decreases observed for Hispanicsand Asians. White and Thomas did not present reliabilityinformation, making it impossible to correct these differ-ences for unreliability. However, other studies permittingthis correction have reported means and standard devia-tions that, when translated into ds corrected for unreliabil-ity, suggest similar subgroup differences on essay tests forBlacks and Hispanics (Applebee, Langer, Jenkins, Mullis,& Foertsch, 1990; Koenig & Mitchell, 1988).

Klein et al. (1997) compared subgroup mean differ-ences between racial and ethnic groups on science perfor-mance assessments and on the Iowa Tests of Basic Skills,a traditional multiple-choice test. The examinees were fifth,sixth, and ninth graders participating in a statewide, per-formance assessment effort in California. A high level ofscore reliability was achieved with the use of multiple


raters. The differences between subgroups were almostidentical across the two types of tests. Mean scores forWhite students were about one standard deviation higherthan those of Hispanic and Black students. Furthermore,changing test type or question type had no effect on thescore differences between the groups.

Reviews conducted in the employment domain sug-gest that performance assessments are among the mostvalid predictors of performance (Asher & Sciarrino, 1974;Hunter & Hunter, 1984; Robertson & Kandola, 1982;Schmidt & Hunter, 1998; N. Schmitt, Gooding, Noe, &Kirsch, 1984; Smith, 1991). In addition, examinees, partic-ularly minority individuals, have reported more favorableimpressions of performance assessments than more tradi-tional cognitive ability or achievement tests (Schmidt et al.,1977). Given the motivational implications associated withpositive applicant reactions, the use of performance assess-ments wherein test content and format replicate the perfor-mance domain as closely as possible may be advantageousregardless of the extent to which subgroup differences arereduced. However, performance assessments tend to becostly to develop, administer, and score reliably. Workstations can be expensive to design and assessment centerexercises can be expensive to deliver. Stecher and Klein(1997) indicated that it is often difficult and expensive toachieve reliable scoring of performance assessments in alarge-scale testing context. Furthermore, to obtain a perfor-mance assessment that is both reliable and generalizablerequires that examinees complete a number of tasks, arequirement that can triple the amount of testing timenecessary compared with traditional tests (Dunbar, Koretz,& Hoover, 1991; Linn, 1993).

The accomplishment record (Hough, Keyes, & Dun-nette, 1983) was developed in part to surmount some of thedevelopment and administration cost issues characteristicof performance assessments. Accomplishment records askexaminees to describe major past accomplishments that areillustrative of competence on multiple performance dimen-sions. These accomplishments are then scored using behav-iorally defined scales. Accomplishment records can be usedto assess competence in a variety of work and nonworkcontexts. We do note that a very similar approach wasdeveloped by Schmidt et al. (1979) under the label of the"behavioral consistency method"; a meta-analysis by Mc-Daniel, Schmidt, and Hunter (1988) reported useful levelsof predictive validity across 15 studies using this approach.

Hough et al. (1983) used accomplishment records toevaluate attorneys and validated these instruments againstperformance ratings. The accomplishment records werescored with a high degree of interrater reliability (.75 to .85across the different performance dimensions and the totalscore). These scores were then correlated with attorneyexperience (average r = .24), in order to partial out expe-rience from the relationship between accomplishmentrecord scores and performance ratings. These partialedvalidity coefficients ranged from .17 to .25 across thedimensions. Validities for a small group of minority attor-neys were larger than those for the majority group. Hough(1984), describing the same data, reported Black-White

subgroup differences of d = 0.33 for the accomplishmentrecords. The performance ratings exhibited almost exactlythe same difference (i.e., d = 0.35). It is interesting that theaccomplishment records correlated near zero with theLSAT, scores on the bar exam, and grades in law school.These more traditional measures of ability would mostlikely have exhibited greater d when compared with theaccomplishment records, although Hough did not presentthe relevant subgroup means and standard deviations. Be-cause the accomplishment records likely measured con-structs in addition to ability (e.g., motivation and person-ality), it is perhaps not surprising that d was lower than thatfound for more traditional cognitively oriented tests.

Similar to accomplishment records, portfolios repre-sent examinees' past achievements through a collection ofwork samples indicative of one's progress and ability.Although it is applicable to adult assessment, much of theresearch involving portfolios has taken place in schools.LeMahieu, Gitomer, and Eresh (1995) reported on a projectin the Pittsburgh schools in which portfolios were used toassess students' writing ability in Grades 6-12. Their ex-perience indicated that portfolios could be rated with sub-stantial interrater agreement. They also reported that Blackexaminees' scores were significantly lower than those ofWhite examinees, but did not report subgroup means pre-cluding the estimation of an effect size. Supovitz andBrennan (1997) reported on an analysis of writing portfo-lios assembled by first and second graders in the Rochester,New York, schools. Scores on two standardized tests werecompared with scores based on their portfolios. Interraterreliability of the scoring of the language arts portfolios was.73 and .78 for first and second graders, respectively,whereas the reliability of the two standardized tests was .92and .91. Differences between Black and White studentswere about twice as large on the standardized tests as theywere on the writing samples. On both tests, the differencesbetween subgroups were smaller (0.25 to 0.50 in standarddeviation units depending on the test type) than is usuallyreported.

Although accomplishment records and portfolioslikely have fewer development costs when compared withperformance assessments, the cost of scoring, especially ifmultiple raters are used, may be high. Another issue presentwith accomplishment records specifically is the reliance onself-report, although an attempt to secure verification of therole of the examinee in each accomplishment may diminishthe tendency to overreport or enhance one's role. Theremay also be differing levels of opportunity to engage inactivities appropriate for portfolios or accomplishmentrecords. To the extent that examinees feel that they do nothave the resources available to assemble these extensivedocuments, they may find the experience demotivating andfrustrating. This concern is important inasmuch as there issome evidence (Ryan, Ployhart, Greguras, & Schmit, 1998;Schmit & Ryan, 1997) that a greater proportion of minoritythan majority individuals withdraw during the various hur-dles in a selection system.

Summary. Use of more realistic or authentic as-sessments does not eliminate or even diminish subgroup


differences in many of the educational studies. Also, all ofthe studies report that the reliable scoring of these tests isdifficult and expensive to achieve in any large-scale testingapplication. Problems with the standardization of the ma-terial placed in portfolios and the directions and opportu-nities afforded students are also cited in studies of studentreactions to the use of these tests (Dutt-Doner & Gilman,1998) as well as by professionals. Accomplishment recordsand job samples used in employment contexts show smallersubgroup differences in some studies than do cognitivelyloaded tests. The attribution that these smaller subgroupdifferences are due to test type are probably unwarranted,as scores on most job samples and accomplishment recordsmost likely reflect a mix of constructs that go beyond thosemeasured by traditional knowledge, skill, ability, andachievement tests.

Use of Coaching or Orientation ProgramsAnother strategy for reducing subgroup differences is theuse of coaching or orientation programs. The purpose ofthese programs is to inform examinees about test content,provide study materials, and recommend test-taking strat-egies, with the ultimate goal of enabling optimal examineeperformance. The term coaching is at times used to refer toboth orientation programs that focus on general test-takingstrategies and programs featuring intensive drill on sampletest items. We use the term orientation programs to refer toshort duration programs, dealing with broad test-takingstrategies, that introduce examinees to the types of itemsthey will encounter. We use the term coaching to refer tomore extensive programs, commonly involving practiceand feedback, in addition to the material included in ori-entation programs. A review by Sackett, Burris, and Ryan(1989) indicated that coaching programs involving drill andpractice do show evidence of modest score gains abovethose expected due simply to retesting. Although there islittle literature on the differential effectiveness of coachingand orientation programs by subgroup, a plausible hypoth-esis is that subgroups differ in their familiarity with testcontent and test-taking skills. This difference in familiaritymay contribute to observed subgroup differences in testscores. Conceivably, coaching or orientation programswould reduce error variance in test scores due to testanxiety, unfamiliar test formats, and poor test-taking skills(Frierson, 1986; Ryan et al., 1998) that would in turnreduce the extent of subgroup differences. However, thereis evidence suggesting the presence of a larger coachingeffect for individuals with higher precoaching test scores, afinding that argues against the likelihood that coaching willnarrow the gap between a lower scoring subgroup and ahigher scoring subgroup. With that caveat, we discussbelow the coaching and orientation literature investigatingthe influence of this strategy on subgroup differences.

Ryan et al. (1998) studied an optimal orientation pro-gram that familiarized firefighter job applicants with testformat and types of test questions. The findings indicatedthat Blacks, women, and more anxious examinees weremore likely to attend the orientation sessions, but attendingthe orientation program was unrelated to test performance

or motivation. Ryer, Schmidt, and Schmitt (1999) studied amandatory orientation program for entry-level jobs in amanufacturing organization at two locations, with eachlocation having a control group and a test orientationgroup. The results showed a small positive impact of ori-entation on the test scores of minority examinees, approx-imately 0.15 in standard deviation units, and the applicantsdid indicate that they view organizations that provide theseprograms favorably. However, the orientation program hadgreater benefits for nonminority members than minoritymembers at one of the two locations. Schmit (1994) studieda voluntary orientation program for police officers thatconsisted of a review of test items and content, recommen-dations for test-taking strategies, practice on sample testitems, and suggestions on material to study. Attendance atthe program was unrelated to race, and whereas everyonewho attended the program scored higher on the examina-tion than did nonattenders, Black gains were twice as largeas were those of Whites. No standard deviation was pro-vided for the test performance variable, so d could not beestimated.

The educational literature includes a relatively largenumber of efforts to evaluate coaching initiatives. At leastthree reviews of this literature have been conducted. Mes-sick and Jungeblut (1981) reported that the average differ-ence between coached examinees and noncoached exam-inees taking the SAT was about 0.15 standard deviationunits. The length of the coaching program and the amountof score gains realized were positively correlated. Messickand Jungeblut estimated that a gain of close to 0.25 stan-dard deviation units could be achieved with a program thatwould approach regular schooling. DerSimonian and Laird(1983) reported an average effect size of 0.10 standarddeviation units for coaching programs directed at the SATtest, an aptitude test. In an analysis of coaching programsdirected at achievement tests, Bangert-Downs, Kulik, andKulik (1983) reported gains of about 0.25 standard devia-tion units as a function of coaching. Thus, the effects ofcoaching on performance on traditional paper-and-penciltests of aptitude and achievement appear to be small, butreplicable.

Frierson (1986) outlined results from a series of fourstudies investigating the effects of test-taking interventionsdesigned to enhance minority examinee test performanceon various standardized medical examinations (e.g., Med-ical College Admissions Test |MCAT], Nursing StateBoard Examination). The programs taught examinees test-taking strategies and facilitated the formation of learning-support groups. Those minorities who experienced the in-terventions showed increased test scores. However, thesamples used in these studies included very few Whiteexaminees, making it difficult to discern whether coachingproduced a differential effect on test scores in favor ofminorities. Powers (1987) reexamined data from a study onthe effects of test preparation involving practice, feedbackon results, and test-taking strategies using the initial ver-sion of the GRE analytical ability test (Powers & Swinton,1982, 1984). The findings indicated that when suppliedwith the same test preparation materials, no particular


subgroup appeared to gain more or less than any othersubgroup, although all of the examinees showed statisti-cally significant score gains. Koenig and Leger (1997)evaluated the impact of test preparation activities under-taken by individuals who had previously taken the MCATbut had not passed. Black examinee scores improved acrosstwo administrations of the MCAT less than did those ofWhite examinees.

Overall, the majority of studies on coaching and ori-entation programs indicate that these programs have littlepositive impact on the size of subgroup differences. Theseprograms do benefit minority and nonminority examineesslightly, but they do not appear to reduce subgroup differ-ences. Note that these programs would affect passing ratesin settings such as licensure, in which performance relativeto a standard is at issue, rather than performance relative toother examinees. On the positive side, these programs arewell received by examinees; they report favorable impres-sions of those institutions that offer these types of pro-grams. Additional questions also of interest focus on ex-aminee reactions to these programs, the effectiveness ofvarious types of programs, and ways to make these pro-grams readily available to all examinees.

Use of More Generous Time LimitsA final option available for addressing group differences intest scores is the strategy of increasing the amount of timeallotted to complete a test. Unless speed of work is part ofthe construct in question, it can be argued that time limitsmay bias test scores. Tests that limit administration timemay be biased against minority groups, in that certaingroups may be provided too little time to complete the test.It has been reported that attitudes toward speededness areculture-bound with observed differences by race and eth-nicity (O'Connor, 1989). This suggests that providing ex-aminees with more time to complete a test may facilitateminority test performance.

Research to date, however, is not supportive of thenotion that relaxed time limits reduces subgroup differ-ences. Evans and Reilly (1973) increased the time thatexaminees were allotted to complete the Admission Testfor Graduate Study in Business from 69 seconds/item to 80seconds/item and 96 seconds/item. The result was a corre-sponding increase in subgroup differences from a Black-White d of 0.83 to Black-White d% of 1.12 and 1.38,respectively. Wild, Durso, and Rubin (1982) investigatedwhether reducing the speededness of the GRE verbal andquantitative tests from 20 minutes to 30 minutes reducedsubgroup differences. Results using both operational andexperimental versions of the tests indicated that whereasincreasing the time allotted benefited all examinees, it didnot produce differential score gains favoring minorities andoften exacerbated the extent of subgroup differences al-ready present. Applebee et al. (1990) reported that doublingthe time that 4th, 8th, and 12th graders were given whencompleting NAEP essay tests led to inconsistent changes inthe subgroup differences observed for Blacks and Hispan-ics. Through the use of odds ratios computed on the basisof the percentage of examinees that received an adequate or

better passing score, we ascertained that the Black-White dincreased on six of the nine essay tests and the Hispanic-White d increased on five of the nine essay tests when thetests were administered with extended time. Thus, the timeextension increased the differences between Black andWhite essay scores and Hispanic and White essay scores inthe majority of test administrations.

Although relaxed time limits will likely result inhigher test scores for all examinees, there does not appearto be a differential benefit favoring minority subgroups. Infact, it is more common that extending the time provided toexaminees when completing a test increases subgroup dif-ferences, sometimes substantially. Time limits enhance theerror in test scores resulting from differences in pace ofwork versus quality of work. Interpreting the observedincreases in subgroup differences when these limits areremoved suggests that the changes observed are likely afunction of enhancing the reliability of the test. Thus, timeextensions are unlikely to serve as an effective strategy forreducing subgroup differences.

Discussion and ConclusionsOur goal in this article was to consider a variety of strat-egies that have been proposed as potential methods forreducing the subgroup differences regularly observed ontraditional tests of knowledge, skill, ability, and achieve-ment. Such tests are commonly used in the contexts ofselection for employment, educational admissions, and li-censure and certification. We posit that there is extensiveevidence supporting the validity of well-developed tradi-tional tests for their intended purposes, and that institutionsrelying on traditional tests value the positive outcomesresulting from test use. Thus, one set of critical constraintsis that any proposed alternative must include the constructsunderlying traditional tests and any proposed alternativecannot result in an appreciable decrement in validity. Putanother way, we focus here on the situation in which theinstitution is not willing to sacrifice validity in the interestof reducing subgroup differences, as it is in that situationthat there is tension between pursuing a validity-maximi-zation strategy and a diversity-maximization strategy. Asecond set of critical constraints is that, in the face ofgrowing legal obstacles to preference-based forms of affir-mative action, any proposed alternative cannot include anyform of preferential treatment by subgroup. Within theseconstraints, we considered two general types of strategies:modifications to procedural aspects of testing and the cre-ation of alternative testing instruments.

In evaluating the various strategies that have beenpromulgated as potential solutions to the performance ver-sus diversity dilemma, our review echoes the sentiments ofthe 1982 National Academy of Sciences panel that con-cluded, "the Committee has seen no evidence of alterna-tives to testing that are equally informative, equally ade-quate technically, and also economically and politicallyviable" (p. 144). Although our observations are offeredalmost 20 years later, the story remains much the same.Alternatives to traditional tests tend to produce equivalentsubgroup differences in test scores when the alternative test


measures cognitively loaded constructs. If such differencesare not observed, the reduction can often be traced to analternative that exhibits low levels of reliability or intro-duces noncognitive constructs. In fact, certainly the mostdefinitive conclusion one can reach from this review is thatadverse impact is unlikely to be eliminated as long as oneassesses domain-relevant constructs that are cognitivelyloaded. This conclusion is no surprise to anyone who hasread the literature in this area over the past three or moredecades. Subgroup differences on cognitively loaded testsof knowledge, skill, ability, and achievement simply doc-ument persistent inequities. Complicating matters further,attempts to overcome issues associated with reliable mea-surement often result in a testing procedure that is cost-prohibitive when conducted on a large scale. In spite ofthese statements, there are a number of actions that can betaken by employers, academic admissions officers, or otherdecision makers who are faced with the conflict betweendiversity goals and a demand that only those who are mostable should be given desirable educational and employmentopportunities. Although elimination of subgroup differ-ences via methods reviewed in this article is not feasible,reduction in subgroup differences, if it can be achievedwithout loss of validity, would be of considerable value.

First, in constructing test batteries, the full range ofperformance goals and organizational interests should beconsidered. In the employment arena, researchers havetended to focus on measures of maximum performance(i.e., ability), rather than on measures of typical perfor-mance (perhaps most related to motivation factors), whenconsidering what knowledge, skills and abilities to mea-sure. These maximum performance constructs were easy tomeasure using highly reliable and valid instruments. Withthe recent literature espousing the value of personality,improvements in interviews, and better methods for docu-menting job-related experience, valid methods for measur-ing less cognitively oriented constructs are becoming avail-able. When these constructs are included in test batteries,there is often less adverse impact. We must also emphasizethe importance of clearly identifying the performance con-struct one is hoping to predict. The weighting of differentaspects of performance and organizational goals shoulddetermine the nature of the constructs measured in a high-stakes testing situation. It is important to measure what isrelevant, not what is convenient, easy, or cheap.

Second, research on the identification and removal ofitems that may be unfairly biased against one group oranother does not indicate that any practically significantreductions in d can be achieved in this fashion. Studies ofDIF are characterized by small effects, with items notconsistently favoring one group versus another. The effectsof removing biased items on overall test characteristics areusually minimal. It does seem apparent that one shouldwrite as simply as possible consistent with the constructone is hoping to measure and that content that is obviouslycultural should be removed.

Research on the mode of presenting test stimuli sug-gests that video-based procedures, which broaden the rangeof constructs assessed, or reducing the verbal component

(or reading level) of tests may have a positive effect onsubgroup differences, although d is often still large enoughto produce adverse impact, particularly when the selectionratio is low. Results are not consistent across studies, andclearly more research would be helpful. Such studies areparticularly difficult to conduct. Separation of the mode oftesting and the construct tested is a challenge; conflictingresults across studies may be due to an inability to differ-entiate between constructs and methods. With improve-ments in technology, alternatives to traditional paper-and-pencil tests are clearly feasible and worthy of exploration.It is also important to note that verbal ability may be a skillrelated to important outcomes and hence considered adesirable component of test performance. In these cases, itwould be best to include a measure that specifically as-sesses verbal ability so that one may remove its influencewhen measuring other job-related constructs. The entiretest battery can then be constructed to reflect an appropriateweighting and combination of relevant attributes given therelative importance of the various performance outcomes.

Whenever possible, it seems desirable to measureexperiences that reflect necessary knowledge, skills, andabilities required in the target situation. Accomplishmentrecords reduced differences compared with the usual dobtained with cognitively loaded tests in a manner that waspractically important as well. This is very likely becauseadditional constructs are targeted and assessed in the ac-complishment record. Results for portfolio and perfor-mance assessments in the educational arena have beenmixed. Some studies indicate lower levels of d, whereasother studies indicate no difference or even greater differ-ences on portfolio or performance assessments when com-pared with the typical multiple-choice measure of achieve-ment. Differences in the level of d across studies may bedue partly to the degree to which test scores are a functionof ability and motivation. If partly a function of motivation,we would expect d to be smaller. Again, if relativelycomplex and realistic performance assessments involvecognitive skills as opposed to interpersonal skills, the levelof d will likely be the same as a traditional cognitivelyloaded measure. In addition, problems in attaining reliablescores at reasonable expense questions the feasibility ofthis strategy.

It seems reasonable to recommend that some form oftest preparation or orientation course be provided to exam-inees. The effects of coaching appear to be minimallypositive over all groups, even though coaching does notseem to reduce d. Reactions to test preparation and coach-ing efforts among job applicants have been universallypositive. Insofar as some candidates do not have access toinformal networks that provide information on the nature ofexams, these programs could serve to place all examineeson the same playing field. At the very least, it would seemthat such positive reactions would lead to less complaintsabout the test and probably less litigation—although wehave little research documenting the relationship betweenreactions and organizational outcomes.

Finally, we recommend that test constructors pay at-tention to face validity. When tests look appropriate for the


performance situation in which examinees will be expectedto perform, they tend to react positively. Such positivereactions seem to produce a small reduction in the size ofd. Equally important, perhaps, may be the perception thatone is fairly treated. This is the same rationale underlyingour recommendation that test preparation programs beused.

In sum, subgroup differences can be expected oncognitively loaded tests of knowledge, skill, ability, andachievement. We can, however, take some useful actions toreduce such differences and to create the perception thatone's attributes are being fairly and appropriately assessed.We note that in this article we have focused on describingsubgroup differences resulting from different measurementapproaches. We cannot in the space available here addresscrucial questions of interventions to remedy subgroup dif-ferences in the life opportunities that affect the develop-ment of the knowledge, skill, ability, and achievementdomains that are the focus of this article. The researchdiscussed in this article, suggesting that subgroup differ-ences are not simply artifacts of paper-and-pencil testingtechnologies, highlights the need to consider those largerquestions.

REFERENCES

Adarand Constructors, Inc. v. Pena. 115 S. Ct. 2097, 2113 (1995).American Educational Research Association, American Psychological

Association, & National Council on Measurement in Education. (1999).Standards for educational and psychological testing. Washington, DC:American Psychological Association.

Applebee, A. N., Langer, J. A., Jenkins. L. B., Mullis, I. V. S., & Foertsch,M. A. (1990). Learning to write in our nation's schools: Instruction andachievement in 1988 at grades 4, 8, and 12 (NAEP Rep. No. 19-W-02).Princeton. NJ: Educational Testing Service.

Asher. J. J.. & Sciarrino. J. A. (1974). Realistic work sample tests: Areview. Personnel Psychology, 27, 519-533.

Bangert-Downs. R. L., Kulik. J. A., & Kulik. C-L. C. (1983). Effects ofcoaching programs on achievement test scores. Review of EducationalResearch. 53, 571-585.

Barrick. M. R.. & Mount, M. K. (1991). The Big-Five personality dimen-sions in job performance: A meta-analysis. Personnel Psychology, 44,1-26.

Berk, R. A. (1982). Handbook of methods for detecting test bias. Balti-more: Johns Hopkins University Press.

Bobko. P.. Roth, P. L.. & Potosky. D. (1999). Derivation and implicationsof a meta-analytic matrix incorporating cognitive ability, alternativepredictors, and job performance. Personnel Psychology, 52, 561-590.

Bond, L. (1995). Unintended consequences of performance assessment:Issues of bias and fairness. Educational Measurement: Issues andPractice. 14. 21-24.

Chan, D., & Schmitt. N. (1997). Video-based versus paper-and-pencilmethod of assessment in situational judgment tests: Subgroup differ-ences in test performance and face validity perceptions. Journal ofApplied Psychology. 82, 143-159.

Chan. D., Schmitt, N.. DeShon. R. P.. Clause, C. C , & Delbridge, K.(1997). Reactions to cognitive ability tests: The relationships betweenrace, test performance, face validity perceptions, and test-taking moti-vation. Journal of Applied Psychology, 82, 300-310.

Chan, D.. Schmitt. N., Sacco, J. M, & DeShon, R. P. (1998). Understand-ing pretest and posttest reactions to cognitive ability and personalitymeasures. Journal of Applied Psychology, 83, 471-485.

City of Richmond v. J. A. Croson Co., 488 U.S. 469 (1989).Cole. N. S. (1981). Bias in testing. American Psychologist, 36, 1067-

1077.D'Costa. A. G. (1993). The impact of courts on teacher competence

testing. Theory into Practice: Assessing Tomorrow's Teachers, 32,104-112.

De Corte, W. (1999). Weighing job performance predictors to bothmaximize the quality of the selected workforce and control the level ofadverse impact. Journal of Applied Psychology, 84, 695-702.

DerSimonian, R., & Laird, N. (1983). Evaluating the effect of coaching onSAT scores: A meta-analysis. Harvard Educational Review, 18, 694-734.

DeShon, R. P., Smith, M., Chan, D., & Schmitt, N. (1998). Can adverseimpact on cognitive ability and personality tests be reduced by present-ing problems in a social context? Journal of Applied Psychology, 83,438-451.

Dunbar, S. B., Koretz, D. M., & Hoover. H. D. (1991). Quality control inthe development and use of performance assessments. Applied Mea-surement in Education, 4, 289-303.

Dutt-Doner. K., & Gilman, D. A. (1998). Students react to portfolioassessment. Contemporary Education, 69, 159-165.

Dwyer, C. A., & Ramsey, P. A. (1995). Equity issues in teacher assess-ment. In M. T. Nettles & A. L. Nettles (Eds.), Equity and excellence ineducational testing and assessment (pp. 327-342). Boston: KluwerAcademic.

Evans, F. R., & Reilly, R. R. (1973). A study of test speededness as apotential source of bias in the quantitative score of the admission testfor graduate study in business. Research in Higher Education, I,173-183.

Ford, J. K., Kraiger, K., & Schechtman, S. L. (1986). Study of race effectsin objective indices and subjective evaluations of performance: Ameta-analysis of performance criteria. Psychological Bulletin, 99, 330-337.

Freedle, T., & Kostin, I. (1990). Item difficulty of four verbal item typesand an index of differential item functioning for Black and Whiteexaminees. Journal of Educational Measurement, 27, 329-343.

Freedle, R., & Kostin, I. (1997). Predicting Black and White differentialitem functioning in verbal analogy performance. Intelligence, 24, All—AAA.

Frierson, H. T. (1986). Enhancing minority college students' performanceon educational tests. Journal of Negro Education, 55, 38-45.

Goldstein, H. W., Yusko, K. P., Braverman, E. P., Smith, D. B., & Chung,B. (1998). The role of cognitive ability in the subgroup differences andincremental validity of assessment center exercises. Personnel Psychol-ogy, 51, 357-374.

Harmon, M. (1991). Fairness in testing: Are science education assess-ments biased? In G. Kulm & S. M. Malcom (Eds.), Science assessmentin the service of reform (pp. 31-54). Washington, DC: AmericanAssociation for the Advancement of Science.

Hartigan. J. A., & Wigdor, A. K. (1989). Fairness in employment testing.Washington, DC: National Academy Press.

Hattrup, K., Rock, J., & Scalia, C. (1997). The effects of varying con-ceptualizations of job performance on adverse impact, minority hiring,and predicted performance. Journal of Applied Psychology, 82, 656-664.

Helms. J. E. (1992). Why is there no study of cultural equivalence instandardized cognitive ability testing? American Psychologist, 47,1083-1101.

Hopwood v. State of Texas, 78 F. 3d 932, 948 (5th Cir. 1996).Hough, L. M. (1984). Development and evaluation of the "accomplish-

ment record" method of selecting and promoting professionals. Journalof Applied Psychology, 69, 135-146.

Hough, L. M., Keyes, M. A., & Dunnette, M. D. (1983). An evaluation ofthree "alternative" selection procedures. Personnel Psychology, 36,261-276.

Hunter, J. E., & Hunter, R. F. (1984). Validity and utility of alternativepredictors of job performance. Psychological Bulletin, 96, 72-88.

Ironson. G. H., & Subkoviak, M. J. (1979). A comparison of severalmethods of assessing item bias. Journal of Educational Measurement,16, 209-225.

Jaeger, R. M. (1996a). Conclusions on the technical measurement qualityof the 1995-1996 operational version of the National Board for Pro-fessional Teaching Standards' Early Childhood Generalist Assessment.Center for Educational Research and Evaluation, University of NorthCarolina at Greensboro.

Jaeger, R. M. (1996b). Conclusions on the technical measurement quality


of the 1995-1996 operational version of the National Board for Pro-fessional Teaching Standards' Middle Childhood Generalist Assess-ment. Center for Educational Research and Evaluation, University ofNorth Carolina at Greensboro.

Jensen, A. R. (1980). Bias in mental testing. New York: Free Press.Kier, F. J., & Davenport, D. S. (1997). Ramifications of Hopwood v. Texas

on the process of applicant selection in APA-accredited professionalpsychology programs. Professional Psychology: Research and Prac-tice, 28, 486-491.

Klein, S. P. (1983). An analysis of the relationship between trial practiceskills and bar examination results. Unpublished manuscript.

Klein, S. P., & Bolus, R. E. (1982). An analysis of the relationshipbetween clinical legal skills and bar examination results. Unpublishedmanuscript.

Klein, S. P., Jovanovic, J., Stecher, B. M., McCaffrey, D., Shavelson,R. J., Haertel, E., Solano-Flores, G., & Comfort, K. (1997). Gender andracial/ethnic differences on performance assessments in science. Edu-cational Evaluation and Policy Analysis, 19, 83-97.

Koenig, J. A., & Leger, K. F. (1997). A comparison of retest performancesand test-preparation methods for MCAT examinees grouped by genderand race-ethnicity. Academic Medicine, 72, S100-S102.

Koenig, J. A., & Mitchell, K. J. (1988). An interim report on the MCATessay pilot project. Journal of Medical Education, 63, 21-29.

Lee, O. (1999). Equity implications based on the conceptions of scienceachievement in major reform documents. Review of Educational Re-search, 69, 83-115.

Linn, R. L. (1993). Educational assessment: Expanded expectations andchallenges. Educational Evaluation and Policy Analysis, 15, 1-16.

LeMahieu, P. G., Gitomer, D. H., & Eresh, J. T. (1995). Portfolios inlarge-scale assessment: Difficult but not impossible. Educational Mea-surement: Issues and Practice, 14, 11-16, 25-28.

Lynn, R. (1996). Racial and ethnic differences in intelligence in the U.S.on the Differential Ability Scale. Personality and Individual Differ-ences, 20, 271-273.

McCauley, C. D., & Mendoza, J. (1985). A simulation study of item biasusing a two-parameter item response model. Applied PsychologicalMeasurement, 9, 389-400.

McDaniel, M. A., Schmidt, F. L., & Hunter, J. E. (1988). A meta-analysisof the validity of methods for rating training and experience in person-nel selection. Personnel Psychology, 41, 283-309.

Medley, D. M., & Quirk, T. J. (1974). The application of a factorial designto the study of cultural bias in general culture items on the NationalTeacher Examination. Journal of Educational Measurement, 11, 235-245.

Mehrens, W. A. (1989). Using test scores for decision making. In B. R.Gifford (Ed.), Test policy and test performance: Education, language,and culture (pp. 93-99). Boston: Kluwer Academic.

Mehrens, W. A. (1999). The CBEST saga: Implications for licensure andemployment testing. The Bar Examiner, 68, 23-32.

Messick, S. M., & Jungeblut, A. (1981). Time and method in coaching forthe SAT. Psychological Bulletin. 89, 191-216.

Miller-Jones, D. (1989). Culture and testing. American Psychologist, 44,360-366.

Millman, J., Mehrens, W. A., & Sackett, P. R. (1993). An evaluation of theNew York State Bar Examination. Unpublished manuscript.

Mishkin, P. J. (1996). Foreword: The making of a turning point—Metroand Adarand. California Law Review, 84, 875-886.

Motowidlo, S. J., Dunnette, M. D., & Carter, G. W. (1990). An alternativeselection procedure: The low-fidelity simulation. Journal of AppliedPsychology, 75, 640-647.

Mount, M. K., & Barrick, M. R. (1995). The Big Five personalitydimensions: Implications for research and practice in human resourcesmanagement. In G. Ferris (Ed.), Research in personnel and humanresources management (Vol. 13, pp. 153-200). Greenwich, CT: JAIPress.

National Academy of Sciences. (1982). Ability testing: Uses, conse-quences, and controversies (Vol. 1). Washington, DC: National Acad-emy Press.

Neill, M. (1995). Some prerequisites for the establishment of equitable,inclusive multicultural assessment systems. In M. T. Nettles & A. L.Nettles (Eds.), Equity and excellence in educational testing and assess-ment (pp. 115-157). Boston: Kluwer Academic.

Neisser, U., Boodoo, G., Bouchard, T. J., Jr., Boykin, A. W., Brody, N.,Ceci, S. J., Halpern, D. F., Loehlin, J. C, Perloff, R., Sternberg, R. J.,& Urbina, S. (1996). Intelligence: Knowns and unknowns. AmericanPsychologist, 51, 77-101.

O'Connor, M. C. (1989). Aspects of differential performance by minor-ities on standardized tests: Linguistic and sociocultural factors. In B. R.Gifford (Ed.), Test policy and test performance: Education, language,and culture (pp. 129-181). Boston: Kluwer Academic.

O'Neil, H. F., & Brown, R. S. (1997). Differential effects of questionformats in math assessment on metacognition and effect (Tech. Rep.No. 449). Los Angeles: University of California, National Center forResearch on Evaluation, Standards, and Student Testing.

Ones, D. S., Viswesvaran, C , & Schmidt, F. L. (1993). Comprehensivemeta-analysis of integrity test validities: Findings and implications forpersonnel selection and theories of job performance. Journal of AppliedPsychology, 78, 656-664.

Pear, R. (1996, November 6). The 1996 elections: The nation—the states.The New York Times, p. B7.

Powers, D. E. (1987). Who benefits most from preparing for a "coachable"admissions test? Journal of Educational Measurement, 24, 247-262.

Powers, D. E., & Swinton, S. S. (1982). The effects of self-study of testfamiliarization materials for the analytical section of the GRE AptitudeTest (GRE Board Research Report GREB No. 79-9). Princeton, NJ:Educational Testing Service.

Powers, D. E., & Swinton, S. S. (1984). Effects of self-study for coachabletest item types. Journal of Educational Psychology, 76, 266-278.

Pulakos, E. D., & Schmitt, N. (1996). An evaluation of two strategies forreducing adverse impact and their effects on criterion-related validity.Human Performance, 9, 241-258.

Robertson, I. T., & Kandola, R. S. (1982). Work sample tests: Validity,adverse impact, and applicant reaction. Journal of Occupational Psy-chology, 55, 171-183.

Ryan, A. M., Ployhart, R. E., & Friedel, L. A. (1998). Using personalitytesting to reduce adverse impact: A cautionary note. Journal of AppliedPsychology, 83, 298-307.

Ryan, A. M., Ployhart, R. E., Greguras, G. J., & Schmit, M. J. (1998). Testpreparation programs in selection contexts: Self-selection and programeffectiveness. Personnel Psychology, 51, 599-622.

Ryer, J. A., Schmidt, D. B., & Schmitt, N. (1999, April). Candidateorientation programs: Effects on test scores and adverse impact. Paperpresented at the annual conference of the Society for Industrial andOrganizational Psychology, Atlanta, GA.

Sacco, J. M., Scheu, C. R., Ryan, A. M., Schmitt, N., Schmidt, D. B., &Rogg, K. L. (2000). Reading level and verbal test scores as predictorsof subgroup differences and validities of situational judgment tests.Unpublished manuscript.

Sackett, P. R. (1998). Performance assessment in education and profes-sional certification: Lessons for personnel selection. In M. D. Hakel(Ed.), Beyond multiple-choice: Evaluating alternatives to traditionaltesting for selection (pp. 113-129). Mahwah, NJ: Erlbaum.

Sackett, P. R., Burris, L. R.. & Ryan, A. M. (1989). Coaching and practiceeffects in personnel selection. In C. L. Cooper & I. Robertson (Eds.),International review of industrial and organizational psychology 1989.London: Wiley.

Sackett, P. R., & Ellingson, J. E. (1997). The effects of forming multi-predictor composites on group differences and adverse impact. Person-nel Psychology, 50, 707-722.

Sackett, P. R., & Wilk, S. L. (1994). Within-group norming and otherforms of score adjustment in preemployment testing. American Psy-chologist, 49, 929-954.

Salgado, J. F. (1997). The five factor model of personality and jobperformance in the European community. Journal of Applied Psychol-ogy, 82, 30-43.

Scarr, S. (1981). Race, social class, and individual differences in l.Q.Hillsdale, NJ: Erlbaum.

Scheuneman, J. (1987). An experimental, exploratory study of causes ofbias in test items. Journal of Educational Measurement, 24, 97-118.

Scheuneman, J., & Gerritz, K. (1990). Using differential item functioningprocedures to explore sources of item difficulty and group performancecharacteristics. Journal of Educational Measurement, 27, 109-131.

Scheuneman, J., & Grima, A. (1997). Characteristics of quantitative word


items associated with differential performance for female and Blackexaminees. Applied Measurement in Education, 10, 299-319.

Schmeiser, C. B., & Ferguson. R. L. (1978). Performance of Black andWhite students on test materials containing content based on Black andWhite cultures. Journal of Educational Measurement, 15, 193-200.

Schmidt, F. L. (1988). The problem of group differences in ability testscores in employment selection. Journal of Vocational Behavior, 33,272-292.

Schmidt, F. L., Greenthal, A. L.. Hunter, J. E., Berner, J. G., & Seaton,F. W. (1977). Job sample vs. paper-and-pencil trades and technicaltests: Adverse impact and examinee attitudes. Personnel Psychology,30, 187-196.

Schmidt, F. L.. & Hunter, J. E. (1981). Employment testing: Old theoriesand new research findings. American Psychologist, 36, 1128-1137.

Schmidt, F. L.. & Hunter, J. E. (1998). The validity and utility of selectionmethods in personnel psychology: Practical and theoretical implicationsof 85 years of research findings. Psychological Bulletin, 124, 262-274.

Schmidt. F. L., Kaplan. J. R., Bemis, S. E.. Decuir, R., Dunn, L.. &Antone, L. (1979). The behavioral consistency method of unassembledexamining (TM-79-21). Washington, DC: U.S. Office of PersonnelManagement, Personnel Research and Development Center.

Schmidt, F. L., Mack. M. J.. & Hunter, J. E. (1984). Selection utility in theoccupation of U.S. park ranger for three modes of test use. Journal ofApplied Psychology, 69, 490-497.

Schmit, M. J. (1994). Pre-employment processes and outcomes, applicantbelief systems, and minority-majority group differences. Unpublisheddoctoral dissertation. Bowling Green State University.

Schmit, M. J., & Ryan. A. M. (1997). Applicant withdrawal: The role oftest-taking attitudes and racial differences. Personnel Psychology, 50,855-876.

Schmitt, A. P., & Dorans, N. J. (1990). Differential item functioning forminority examinees on the SAT. Journal of Educational Measurement,27, 67-81.

Schmitt. N.. Clause, C. S., & Pulakos, E. D. (1996). Subgroup differencesassociated with different measures of some job-relevant constructs. InC. R. Cooper & I. T. Robertson (Eds.), International review of indus-trial and organizational psychology (Vol. 11, pp. 115-140). New York:Wiley.

Schmitt, N., Gooding, R. Z.. Noe, R. A., & Kirsch, M. P. (1984).Meta-analyses of validity studies published between 1964 and 1982 andthe investigation of study characteristics. Personnel Psychology, 37,407-422.

Schmitt, N., & Mills. A. E. (in press). Traditional tests and simulations:Minority and majority performance and test validities. Journal of Ap-plied Psychology.

Schmitt, N., Rogers. W., Chan, D.. Sheppard. L., & Jennings, D. (1997).Adverse impact and predictive efficiency of various predictor combi-nations. Journal of Applied Psychology, 82, 719-730.

Smith, F. D. (1991). Work samples as measures of performance. In A. K.

Wigdor & B. G. Green Jr. (Eds.), Performance assessment for theworkplace (pp. 27-52). Washington: DC: National Academy Press.

Stecher, B. M., & Klein, S. P. (1997). The cost of science performanceassessments in large-scale testing programs. Educational Evaluationand Policy Analysis, 19, 1-14.

Steele, C. M. (1997). A threat in the air: How stereotypes shape intellec-tual identity and performance. American Psychologist, 52, 613-629.

Steele, C. M., (1999, August). Thin ice: "Stereotype threat" and Blackcollege students. Atlantic Monthly, 284(3), 44-54.

Steele, C. M., & Aronson, J. (1995). Stereotype threat and the intellectualtest performance of African Americans. Journal of Personality andSocial Psychology, 69, 797-811.

Stokes, G. S., Mumford, M. D., & Owens, W. A. (1994). Biodatahandbook. Palo Alto, CA: Consulting Psychologists Press.

Supovitz, J. A., & Brennan, R. T. (1997). Mirror, mirror on the wall,which is the fairest test of all? An examination of the equitability ofportfolio assessment relative to standardized test. Harvard EducationalReview, 67, 472-506.

Thornton, G. C, & Byham, W. C. (1982). Assessment centers andmanagerial performance. New York: Academic Press.

Verhovek, S. H., & Ayres, B. D., Jr. (1998, November 4). The 1998elections: The nation—referendums. The New York Times, p. B2.

Weekly, J. A., & Jones, C. (1997). Video-based situational testing. Per-sonnel Psychology, 50, 25-49.

Welch, C, Doolittle, A., & McLarty, J. (1989). Differential performanceon a direct measure of writing skills for Black and White collegefreshmen (ACT Research Reports, Tech. Rep. No. 89-8). Iowa City, IA:ACT.

Whaley, A. L. (1998). Issues of validity in empirical tests of stereotypethreat theory. American Psychologist, 53, 679-680.

White, E. M., & Thomas, L. L. (1981). Racial minorities and writing skillsassessment in the California State University and colleges. CollegeEnglish, 43, 276-283.

Whitney, D. J., & Schmitt, N. (1997). Relationship between culture andresponses to biodata employment items. Journal of Applied Psychol-ogy, 82, 113-129.

Wightman, L. F. (1997). The threat to diversity in legal education: Anempirical analysis of the consequences of abandoning race as a factorin law school admission decisions. New York University Law Review,72, 1-53.

Wild, C. L., Durso, R., & Rubin, D. B. (1982). Effects of increasedtest-taking time on test scores by ethnic group, years out of school, andsex. Journal of Educational Measurement, 19, 19-28.

Wilson, K. M. (1981). Analyzing the long-term performance of minorityand nonminority students: A tale of two studies. Research in HigherEducation, 15, 351-375.

Wolfe, R. N., & Johnson, S. D. (1995). Personality as a predictor ofcollege performance. Educational and Psychological Measurement, 55,177-185.


Documents

High-Stakes Testing in Employment, Credentialing, and ...psych.wfu.edu/furr/362/Sackett et al 2001 Testing and bias -amer... · High-Stakes Testing in Employment, Credentialing, and