4
Preserving the Integrity of Online Testing EUGENE BURKE SHL Group Ltd. A key issue that has emerged is the security of Internet testing (Tippins et al., 2006). This can be readily understood in the context of statistics such as those offered by the back- ground checking company Automatic Data Processing Inc. (2008), which reports that 45% of U.S. job applicants falsify work his- tories. To that can be added popular books such as Freakonomics (Levitt & Dubner, 2005), which asks the question in the second chapter: What do schoolteachers and sumo wrestlers have in common? The reader finds the answer is cheating. So, in an era of rapid change, new technologies, uncertainty, and concerns about trust, one can understand that the security of Internet testing is a legiti- mate concern. But, as Tippins (2008) stated at a recent Society for Industrial and Organizational Psychology symposium on online testing, ‘‘The Internet testing train has left the sta- tion!’’ This reinforces the message that one of the key challenges to our science and practice is how we can effectively and practically defend the security and fairness of Internet testing. I will offer a few thoughts based on my work with colleagues on developing Internet testing solutions for clients in Asia, North America, and Europe (including the UK), experience gained from over 1.6 million ability tests delivered online in 19 languages over the past 7 years (Burke, 2008). Managing Security Threats First, let me make the assumption that the technology used to support Internet testing is itself secure in protecting the data that it manages and captures. That is, the informa- tion technology architecture and infrastruc- ture are secure and robust attempts to access test content, scoring keys, and test score data. The focus on security then moves to the content of the tests themselves as revealed through their deployment online, and, most importantly, to the scoring keys used to turn responses to the test items into scores that will be used in decision making. The key risk is whether that content can be fraudulently accessed, which inevitably leads us to the issues of cheating and the actions of pirates in facilitating cheating on a test. It also leads us to focus on ability tests or any test for which there is only one correct answer to each item or where an instrument contains discrete answer patterns that can be associated with a higher score and where those patterns can be discovered and can be either memorized, recognized, or learned to result in a falsely inflated score for a candidate. Procedures developed to defend against fraudulent access have to operate at two lev- els. The first is within the live assessment itself and includes clear rules for the test Correspondence concerning this article should be addressed to Eugene Burke. E-mail: eugene.burke@ shlgroup.com Address: Director of Science and Innovation, SHL Group Ltd., The Pavilion, 1 Atwell Place, Thames Ditton, Surrey KT7 0NE, United Kingdom. Eugene Burke, SHL Group Ltd. Industrial and Organizational Psychology, 2 (2009), 35–38. Copyright ª 2009 Society for Industrial and Organizational Psychology. 1754-9426/09 35

Preserving the Integrity of Online Testing

Embed Size (px)

Citation preview

Preserving the Integrity ofOnline Testing

EUGENE BURKESHL Group Ltd.

A key issue that has emerged is the security ofInternet testing (Tippins et al., 2006). Thiscan be readily understood in the context ofstatistics such as those offered by the back-ground checking company Automatic DataProcessing Inc. (2008), which reports that45% of U.S. job applicants falsify work his-tories. To that can be added popular bookssuch as Freakonomics (Levitt & Dubner,2005), which asks the question in the secondchapter: What do schoolteachers and sumowrestlers have in common? The reader findsthe answer is cheating. So, in an era of rapidchange, new technologies, uncertainty, andconcerns about trust, one can understandthat the security of Internet testing is a legiti-mate concern.

But, as Tippins (2008) stated at a recentSociety for Industrial and OrganizationalPsychology symposium on online testing,‘‘The Internet testing train has left the sta-tion!’’ This reinforces the message that oneof the key challenges to our science andpractice is how we can effectively andpractically defend the security and fairnessof Internet testing. I will offer a few thoughtsbased on my work with colleagues ondeveloping Internet testing solutions forclients in Asia, North America, and Europe

(including the UK), experience gainedfrom over 1.6 million ability tests deliveredonline in 19 languages over the past 7 years(Burke, 2008).

Managing Security Threats

First, let me make the assumption that thetechnology used to support Internet testingis itself secure in protecting the data that itmanages and captures. That is, the informa-tion technology architecture and infrastruc-ture are secure and robust attempts to accesstest content, scoring keys, and test scoredata. The focus on security then moves tothe content of the tests themselves asrevealed through their deployment online,and, most importantly, to the scoring keysused to turn responses to the test items intoscores that will be used in decision making.The key risk is whether that content canbe fraudulently accessed, which inevitablyleads us to the issues of cheating and theactions of pirates in facilitating cheating ona test. It also leads us to focus on ability testsor any test for which there is only one correctanswer to each item or where an instrumentcontains discrete answer patterns that canbe associated with a higher score andwhere those patterns can be discoveredand can be either memorized, recognized,or learned to result in a falsely inflated scorefor a candidate.

Procedures developed to defend againstfraudulent access have to operate at two lev-els. The first is within the live assessmentitself and includes clear rules for the test

Correspondence concerning this article should beaddressed to Eugene Burke. E-mail: [email protected]

Address: Director of Science and Innovation, SHLGroupLtd., The Pavilion, 1 Atwell Place, ThamesDitton,Surrey KT7 0NE, United Kingdom.

Eugene Burke, SHL Group Ltd.

Industrial and Organizational Psychology, 2 (2009), 35–38.Copyright ª 2009 Society for Industrial and Organizational Psychology. 1754-9426/09

35

taker about what is expected of and what isappropriate (and what is not) by designingthe test to preserve the security of the scoringkey and by incorporating in the overall test-ing or assessment process checks on thevalidity of scores and the authenticity ofcandidates, all of which act as a strongincentive to subscribe to honest test taking.Let me offer two quotes to support this firstlevel of test security, the first of whichemphasizes the need to set clear groundrules and is taken from Cizek (1999) and towit (and to paraphrase): cheating is ‘‘depri-ve[ing] of something valuable by the use ofdeceit or fraud. . . . In testing specifically,cheating is violating the rules.’’ Honestycontracts are familiar instruments toour field as used with assessments such asbiographical data questionnaires, and,within a framework such as that providedby theGuidelines on the Rights and Respon-sibilities of Test Takers (American Psycho-logical Association, 1996), do offer a clearbasis for communicating the ‘‘rules’’ asCizek refers to them.

However, as recent research in socialpsychology indicates (e.g., Vargas, vonHippel, & Petty, 2004; von Hippel, Lakin, &Shakarchi, 2005), people will often rational-ize what would be generally construed asinappropriate behaviors, and therefore,additional incentives to follow the rules willstrengthen security. Stronger measures areencouraged by the second quote I offer inrelation to security measures internal to theassessment, and it is taken from Tippins et al.(2006): ‘‘Any Internet test that administers thesame set of items to all examinees is asking tobe compromised. At the very least, the itemsshould be administered in a randomizedorder. It would be better yet to sample itemsfrom a reasonably large item pool.’’ The Inter-net tests that we deploy in our systems usea randomized testing model through whichequivalent but different tests are constructedfrom Item Response Theory (IRT) calibrateditem pools and that are delivered client side(via the test taker’s PC) via encrypted soft-ware that is returned server side with itemresponses at the end of the testing session.The software used client side does not con-

tain any scoring key information, which isheld securely server side. Each test hasa unique scoring key.

To this requirement for the design of Inter-net tests, we have added verification ofscores in the Internet testing models that col-leagues and I have developed (Burke, vanSomeren, & Tatham, 2006; Burke & Wright,2008). These verification tests are short testsadministered in a proctored setting and ata later stage when authentication of the testtaker can be undertaken, and these verifica-tion tests provide data that is used to checkthe consistency of scores from the first stageof testing. I should emphasize that this is nota second or additional score and that thescore of record in our systems is the firstscore as this has been used to effect decisionson the full pool of test takers. The data fromthe verification test operate much like a fak-ing good check would on personality instru-ments. These verification tests serve toprovide yet another defense of the validityof the first Internet test and can be deployedat any point in the process such as the last orpenultimate stage in which the candidatepool has been reduced to the final short list.This requirement was established throughclient interviews conducted in the AsiaPacific region, North America, Europe, andthe UK. These clients asked for the score ofrecord to be the first test and that this test beof sound psychometric quality to makeaccurate and valid decisions so that it couldthen be carried forward throughout remain-ing stages of any process, and that the verifi-cation tests be long enough to provide anaccurate check on the first score but shortenough to be practical in terms of includingin later assessments such as assessmentcenters and/or interviews.

Thus far, I have mentioned security fea-tures operating within the testing processitself. The second level of security involvesfrequent external checks of the process todetermine if procedures to manage testsecurity are actually working. In the caseof the systems that colleagues and I havedeveloped, we deploy what are referredto as Web patrols to search the Internet forsites for potential breaches of security such

36 E. Burke

as pirate sites, brain dump sites, or forumsthrough which access to content is beinginappropriately offered. The reason that suchsearcheswork is quite simple: pirates and col-luders can only achieve their goals if they areable to transact with others, and the mediumthat best facilitates such transactions is theInternet. So, ironically, the medium for whichsecurity measures are deemed important isalso a key medium that enables those mea-sures to be monitored for effectiveness.

In our systems, Web patrols are supportedby data forensic audits for which space inthis article prohibits a detailed explanation,and the reader is directed to http://caveon.com/df_blog/, which is maintained byDennis Maynes, Chief Scientist at Caveon,and to Burke (2006) where readers will finda reproduction of an article by Dennis thatprovides an executive summary of data for-ensics. Data forensic algorithms look for evi-dence of aberrant scores or scores that, fromknowledge of the psychometric properties ofthe test and how the items and test scoresshould behave under honest test taking con-ditions, are highly unlikely. One examplewould be a pattern of fast response timescoupled with high correct answer rates toitems that may suggest someone has hadaccess to the scoring key. Another examplewould be long response latencies to itemscoupled with few right answers or all wronganswers that may suggest that the test taker isnot a real candidate but someone harvestingthe test content for possible piracy or collu-sion later. A third example would be algo-rithms that look for matches in correct andincorrect answer strings to tests that wouldsuggest collusion among test takers.

Taken together, these measures operateto maintain the integrity of Internet tests,they enforce test security through activelymanaging breaches and monitoring testtaker behaviors by identifying potential testfraud through Web patrols and critical inci-dent procedures for dealing with securitybreaches and through data forensics auditsto detect levels of item exposure, as well asdesigning cheat resistance into the testingprocedures through equivalent random-ized tests deployed via the Internet and

proctored verification tests at later stagesof the process.

Are These Measures Necessary and

Do They Work?

In the past 3 years as we have evolved thesecurity measures supporting our Internettesting systems, we have conducted Webpatrols in English and Chinese (and morerecently have begun to conduct patrols inFrench and Dutch with a rolling languageprogram). Through these patrols, we haveidentified 30 sites over the past 18 monthsthat have been qualified at three levels of riskusing our critical incident procedures. Ofthose sites detected, 18 have been classifiedas high with 4 of those being sites in Englandand 14 of those sites operating in China (i.e.,40% of sites detected have been classifiedas high risk). Most of these sites, includingones on eBay, have been removed throughsimple letters or through invoking policiesthrough the site operator or the Internet pro-vider, but the 18 detected were offering con-tent that would constitute a clear threat if thetest had been a fixed set of items. These Webpatrols have also shown that test security is aninternational issue with Chinese sites offeringcontent in English. This is not surprising whenone considers that many companies head-quartered in Europe or North America alsooperate in China and have imported testingprocedures from their home countries. Wehave conducted regular cycles of data foren-sics checks on tests and content threatened bysuch breaches, and the data show that therandomized testing model very clearly de-fends against collusion. These data forensicschecks have also shown that there has beenminimal impact from the exposure of itemcontent. This can be seen as a result of the testdesign principles used because the itemsexposed do not have their scoring key avail-able with them at the time of testing, andbecause items are configured in different per-mutations in the randomized testing modeland are therefore unlikely to appear togetherin any one test.

I include here a case that I hope conveysthe positive impact of the security measures

Unproctored Internet testing 37

that I have described on the sense of fair playamong test takers. The following is a verba-tim extract from a UK site tracked throughWeb patrols on January 20, 2007 (alltimes are GMT). Posted at 12:09 I got 20 realSHL numerical test questions. If someoneneeded, send me e-mail ,e-mail addressgiven.. Only 20 pounds. If you want topay me a little money, then push you to theassessment stage, just e-mail me. Worth ornot, you decide. Posted at 16:01 There area number of reasons why everyone shouldignore the original poster. Firstly I have takena number of SHL numerical tests and by andlarge the questions are of a similar level butdifferent. Yes, there may be some that arerepeated but out of 20 questions there isnot much. More saliently, lets assume thatby some miracle all 20 questions come up.Yes, you will pass but will be retested at thenext round . . . so a waste of money and time.I really hate it when cretins like,Web namecited. take advantage of people. If shereally wanted to help, she would haveoffered them for free. You have been warned. . . Posted at 17:23 I agree, what a stupidthing to advertise. 1. It is not probable thatyou will get these 20 questions whenyou takethe test (this person then goes on to listanother nine reasons why the first posting isunhelpful becoming increasingly emotionalabout the posting and the person who postedit, hence the termination of the account of thisthread at this point).

In conclusion, I do not claim that theprocedures we have developed representthe only solution to the security of Internettesting or that our work is by any means com-pleted. New challenges require new thinkingthat will continue to evolve. Indeed, thesenew developments do often show how wecan strengthen the security and fairness oftesting whether testing is proctored or notand have frequently shown us how insecure

traditional proctored testing, the supposedbenchmark for secure testing, is.

References

American Psychological Association (1996). The rightsand responsibilities of test takers: Guidelines andexpectations. Washington. DC: Author. Retrieved onJuly, 28, 2008 from www.apa.org/science/ttrr.html

AutomaticDataProcessing Inc. (2008).2008ADPscreen-ing index. Retrieved on July, 28, 2008 from www.adp.com/media/press-releases/2008-news-releases/adp-annual-pre-employment-screening-index.aspx

Burke, E. (2006). Better practice for online assessment.Thames Ditton, UK: SHL. Retrieved on July 28, 2008from www.shl.com/SHL/en-int/Thought_Leadership/White_Papers/White-Papers.aspx

Burke, E. (2008, April). Preserving the integrity of onlinetesting. In N. T. Tippins (Chair), Internet testing: Cur-rent issues, research solutions, guidelines, and con-cerns. Symposium conducted at the 23rd AnnualConference of the Society for Industrial and Organi-zational Psychology, San Francisco, CA.

Burke, E., van Someren, G., & Tatham, N. (2006).Verifyrangeof ability tests: Technicalmanual. Thames Dit-ton, UK: SHL. Retrieved on July, 28, 2008 fromwww.shl.com/SHL/en-int/Products/Access_Ability/Access_Ability_List/verify.aspx

Burke, E., & Wright, D. (2008, January). Defending thevalidity of online tests in an era of cheating andpiracy. Paper presented at the Annual Conferenceof the Division of Occupational Psychology of theBritish Psychological Society, Stratford-upon-Avon,UK.

Cizek, G. J. (1999). Cheating on tests: How to do it,detect it, and prevent it. Mahwah, NJ: LawrenceErlbaum Associates.

Levitt, S. D., & Dubner, S. J. (2005). Freakonomics. NewYork: Penguin.

Tippins, N. T. (2008, April). Internet testing: Currentissues, research solutions, guidelines, and concerns.Symposium presented at the 23rd Annual Confer-ence of the Society for Industrial and OrganizationalPsychology, San Francisco, CA.

Tippins, N. T., Beaty, J., Drasgow, F., Gibson, W. M.,Pearlman, K., Segall, D. O., et al. (2006). Unproc-tored Internet testing in employment settings. Per-sonnel Psychology, 59, 189–225.

Vargas, P. T., von Hippel, W., & Petty, R. E. (2004). Usingpartially structured measures to enhance the atti-tude-behavior relationship. Personality and SocialPsychology Bulletin, 30, 97–211.

von Hippel, W., Lakin, J. L., & Shakarchi, R. J. (2005).Individual differences in motivated social cognition:The case of self-serving information processing.Personality and Social Psychology Bulletin, 31,1347–1357.

38 E. Burke