CLEAR Exam Revie · CLEAR Exam Review is a journal, published twice a year, reviewing issues affecting testing and credentialing. CER is published by the Council on Licensure, Enforcement,

A Journal

CLEARExam Review

Volume XXIV, Number 2Fall 2014

CLEAR Exam Review is a journal, published twice a year, reviewing issues affecting testing and credentialing. CER is published by the Council on Licensure, Enforcement, and Regulation, 403 Marquis Ave., Suite 200, Lexington, KY 40502.

Design and composition of this journal have been underwritten by Prometric, which specializes in the design, development, and full-service operation of high-quality licensing, certification and other adult examination programs.

Subscriptions to CER are sent free of charge to all CLEAR members and are available for $30 per year to others. Contact Stephanie Thompson at (859) 269-1802, or at her e-mail address, [email protected], for membership and subscription information.

Advertisements and Classified (e.g., position vacancies) for CER may be reserved by contacting Janet Horne at the address or phone number noted above. Ads are limited in size to 1/4 or 1/2 page, and cost $100 or $200, respectively, per issue.

Editorial BoardSteven NettlesApplied Measurement Professionals

Jim Zukowski360training

CoeditorElizabeth Witt, Ph.D.Witt Measurement ConsultingLaingsburg, [email protected]

CoeditorSandra Greenberg, Ph.D.Professional Examination ServiceNew York, [email protected]

CLEAR Exam Review

Contents

From the editors ................................................................ 1

Sandra Greenberg, Ph.D.

Elizabeth Witt, Ph.D.

Columns

Abstracts and Updates ......................................................... 3George T. Gray, Ed.D.

Technology and Testing....................................................... 9Brian D. Bontempo, Ph.D.

Legal Beat ........................................................................... 15Dale J. Atkinson, Esq.

Perspectives on Testing...................................................... 19Chuck Friedman, Ph.D.

ArtiCles

2014 Standards for Educational and Psychological ........... 22Testing ReleasedRon Rodgers, Ph.D.

Incorporating Continuing Competency into ..................... 23Certification Maintenance with Attention to Certificant ConcernsFran Byrd

VolumE XXIV, NumbEr 2 FAll 2014

Copyright ©2014 Council on Licensure, Enforcement, and Regulation. All rights reserved. ISSN 1076-8025

1ClEAr EXAm rEVIEW

From the Editors

We are delighted to present the Fall 2014 issue of the CLEAR Exam Review, volume XXIV, number 2. This issue includes four columns, one full-length article, and one brief article that we believe you will find interesting and informative. We welcome Chuck Friedman as the coordinator of the new column, Perspectives on Testing: Responses to Your Questions.

George Gray reviews a multitude of publications in the Abstracts and Updates column, including articles focusing on objective structured clinical examinations; barriers to professional mobility for internationally educated professionals; a role delineation study in orthopaedic nursing; a wide variety of psychometric issues; test format and administration issues; and guides to useful credentialing-related software. There is something for everyone in this extensive review of recently published material.

In Testing and Technology, Brian Bontempo continues his series on data visualization. This article, the second in the series, presents a step-by-step process for designing informative data visualizations, along with Dr. Bontempo’s insights and guidance on successfully executing each step.

Dale Atkinson’s Legal Beat describes a recent case in which a licensure candidate filed suit against a state regulatory board over issues related to disability accommodations. Claims were also filed against the national association of which the state board is a member. The case illustrates the complexity of issues that are addressed by the courts in evaluating such litigation.

A new column, Perspectives on Testing: Responses to Your Questions, makes its debut with this issue. Chuck Friedman coordinates this column, which spotlights questions posed by CLEAR members at the annual conference. Answers are provided by experts in measurement and licensure/certification testing. Readers are encouraged to submit questions for future columns and conferences to [email protected]. Please enter “CER Perspectives on Testing” in the subject line.

Readers will be pleased to learn that the long-awaited updated edition of the Standards for Educational and Psychological Testing, jointly published by the American Psychological Association, the American Educational Research Association, and the National Council on Measurement in Education, is now available. Ron Rodgers presents a brief overview of the updated Standards, with a more detailed review to come in future CLEAR publications.

FAll 2014 n

2 ClEAr EXAm rEVIEW n FAll 2014

This issue also includes an article by Fran Byrd of the National Certification Corporation (NCC) describing the NCC’s recent modification of its approach to maintenance of certification, in which online specialty assessments are used to determine the areas in which continuing education is needed for individual certificants. This paper describes the issues the NCC felt were important to address in making this change and illustrates an approach to certification maintenance that departs from the one-size-fits-all continuing education requirements that were common in the past, focusing instead on personalized, ongoing professional development.

Read on, and enjoy . . .

3ClEAr EXAm rEVIEW

This issue’s abstracts and updates cover a variety of topics related to examination construction and licensure and certification issues in general. Methodology is highlighted when it may be of particular interest to readers of CLEAR Exam Review. In all, there are three articles on objective structured clinical examinations (OSCEs), a report on a role delineation study, seven papers covering diverse measurement topics, and two articles related to examination format and administration. Finally, two white papers are included related to testing and credentialing software programs.

Objective Structured Clinical Examination (OSCE)

Hastie, M.J., Spellman, J.L., Pagano, P.P., Hastie, J., and Egan, B.J. (2014) Designing and implementing the objective structured clinical examination in anesthesiology. Anesthesiology 120(1), 196-203.

This review article includes a thorough discussion of the benefits and challenges of the objective structured clinical examination (OSCE) format. The issues discussed apply to OSCEs regardless of discipline, but the examples are all taken from anesthesiology. The Accreditation Council for Graduate Medical Education has designated developmental milestones being implemented this year. Assessment of residents’ progress may include examinations (written or oral), clinical observation or OSCEs. An OSCE component is also being introduced into its examination, “including use of standardized patients, mannequins, or computer-based assessments” (p. 196). The article includes a brief summary of the use of the OSCE in medical education and anesthesiology. Latter examples include a 17-station examination used by the Royal College of Anesthesiologists in the U.K. and a group of simulation-based stations for an OSCE that has been part of the Israeli National Board Examination in Anesthesiology. Examples of the subject matter of OSCE stations in anesthesiology in the U.K. included resuscitation, troubleshooting anesthesiology equipment, data interpretation, and history taking and communication, among others.

The authors take a close look at the program needs that would require an OCSE, the design and feasibility, and financial practicality. Their review includes measurement considerations such as inter-rater reliability, internal reliability (internal consistency), norm-referenced versus criterion-referenced passing score determination, and perspectives on validity. OSCE stations, concepts, and tasks in anesthesia are listed along with the accompanying blueprint design.

The authors conclude that “when well designed, OSCE is a reliable tool with only ‘modest’ validity and should accordingly be viewed as a valuable, although insufficient, addition to residency assessment. OSCE allows for a flexible yet structured examination

Abstracts and updatesGEorGE T. GrAy, Ed.D.director of research and developmentSchroeder Measurement Technologies, Inc.

FAll 2014 n

4 ClEAr EXAm rEVIEW

characterized by objective evaluation of trainees, preparing them for the board examination, as well as providing programs with means of regular trainee and program assessment. However, programs should not rely solely on OSCE to provide by itself a comprehensive assessment of a trainee’s competence, but rather view it as complementary to existing exam modalities” (p. 202).

Pell, G., Fuller, R., Homer M., and Roberts T. (2013) Advancing the objective structured clinical examination: sequential testing in theory and practice. Medical Education 47(6), 569-77.

This study illustrates use of the OSCE to obtain greater decision accuracy at the University of Leeds, U.K. The OSCE is administered, and some individuals are required to come back to take another test based on their original scores. The authors state that there are two standard errors of measurement added to the original passing score to eliminate false positive results. Candidates who pass on the first round are well above the cut score. Over the course of three years, the number of stations in the OSCEs was manipulated. In the original model, there were 18 stations for the OSCE and 18 stations for failing candidates taking a retest five months later. Over three annual iterations, the number of stations was changed so that both the original testing and the retest were based on only twelve stations each.

The authors conclude that “sequential testing brings a number of benefits. For the institution, avoiding the need to make a significant structural change (e.g., altering assessment timetables to facilitate retests within an academic year) is of value and allows the whole sequence of testing to be undertaken within a single planning activity and …facilitates...cost savings….” (p. 575).

Kaliyadan, F., Khan, A., Kuruvilla, J., and Feroze, K. (2014) Validation of a computer based objective structured clinical examination in the assessment of undergraduate dermatology courses. Indian Journal of Dermatology, Venereology, and Leprology 80(2), 134-136.

This paper reports a study of a computerized version of an objective structured clinical examination (OSCE) administered to medical students in dermatology. The purpose of the project was to capture the essence of assessment for a three-week compulsory rotation in dermatology. The main objectives of the course were description of skin lesions and the diagnosis and treatment of common skin diseases as well as investigation and treatment (p. 134). The authors sought to use a computerized format that was similar to OSCE stations by displaying sixteen images of common cases and asking four questions about each of the images.

Strong correlations with the clinical presentation and with students’ scores on clinical presentation and overall scores led the authors to conclude that “this is a reliable method for assessment in dermatology.”

Immigration and Professional Practice in Canada

Cheng, L., Spaling, M., and Song, X. (2013) Barriers and facilitators to professional licensure and certification testing in Canada; perspectives of internationally educated professionals. Journal of International Migration and Integration 14(4), 733-750.

The study, based on eighteen interviews, “examines the role that testing plays in professional licensure and certification from the perspectives of newly arrived internationally educated professionals (IEPs) in four professions: teachers, engineers, nurses, and medical doctors” (p. 733). The sample included two engineers, four nurses, five teachers and seven physicians. Interviews were conducted in Kingston and Windsor, Ontario. The authors cite a number of publications supporting the perspective that “although recent immigrants are more likely to hold higher degrees than native-born Canadians after adjusting for age and experience, studies have found low success rates for IEPs in obtaining successful and suitable employment in their professional fields” (p. 734). Tests required may involve both subject matter knowledge tests and tests of English language proficiency. The authors indicate that “in this study, the IEPs expressed very little concern about the certification tests themselves. The participants regarded English language proficiency tests, such as TOEFL, as a standard, valid criterion to indicate one’s communication capabilities” (p. 738). Content-based testing was regarded as similar to testing in candidates’ home countries, but several health professionals noted gaps in their knowledge of psychiatry, the Canadian perspective on health care practice, and the Canadian health care system. In contrast to the generally positive remarks about the testing process, there were a number of complaints concerning the process of becoming certified and the associated financial costs.

The authors acknowledge that their study is “small and uneven in scale and the choice of employing IEPs from four diversified professions…limits the study findings” (p. 747). They recommend that “IEPs should be prepared for certification as early and comprehensively as possible, before they migrate to or as soon as they arrive in Canada….Additionally, financial security can be addressed if IEPs can continue to work in their home country while addressing the requirements for Canadian certification” (p. 747).

n FAll 2014

5ClEAr EXAm rEVIEW

Role Delineation Study/Job Analysis

Roberts, D. and Hughes, M. (2013) What do orthopaedic nurses do? Implications of the role delineation study for certification. Orthopaedic Nursing 32(4), 198-206.

The Orthopaedic Nurses Certification Board conducts a role delineation study every five years. This article reports the results of the most recent study. An online survey of task and knowledge statements for three certification programs was constructed based on a review of the previous role delineation study (RDS). The survey was offered based on over five thousand email invitations. 1,194 usable responses were received, a return rate of 22.7%.

As there are variants in RDS survey methodology, this is an interesting aspect of this report. The survey included 62 task statements categorized by 11 domains and 157 knowledge statements linked to the task statements. Ratings were obtained based on a “significance” scale. Zero on the scale was used for “not necessary for my job” and points 1-5 were linked to ratings from “minimally significant” to “extremely significant.”

The Orthopaedic Nurse Certification (ONC) program was included in the study as well as the Orthopaedic Nurse Practitioner (ONP-C) and Orthopaedic Clinical Nurse Specialist (OCNS-C) certifications. Demographic data and summary findings are included in the article. Data analysis included reliability of ratings for survey sections and a comparison of 2005 baseline content weightings with derived 2010 recommended percentages for each content area of each of three examinations.

Measurement Topics

He, W., and Reckase, M. (2014) Item pool design for an operational variable-length computerized adaptive test. Educational and Psychological Measurement 74(3), 473-494.

Based on extensive simulations, the authors present step-by-step guidance for developing an item pool for a variable length computer adaptive test that has “a decision stopping rule, content balancing, and exposure control.” (p 473) The approach taken is the bin-and-union approach. As the authors describe this method, “a set of ‘bins’ are defined on the ability scale used for reporting results of the CAT and for calibrating the items. These bins, with a specified width on the ability scale, are used to tally the number of administered items needed for that range of the scale” (p. 475).

An example uses the Rasch model with two candidates having theta ability of -.0953 and .0264. Bins are set up with a bandwidth of .40 logit. The thresholds for dividing the two bins of items that are most appropriate for administering to these candidates are -.48 to -.08 and from -.08 to .32. As ability is estimated, an item is selected for administration that corresponds to the current ability estimate. In a simulation of abilities, examinee one received 250 items and examinee two received 115 items before the stopping rule determined that the test was complete. The demand on the item pool was 255 items rather than 365 items (250+115) because the candidate abilities were close enough that a number of the same items could be common to both candidates, as they were drawn from the same bins. When 1,500 candidates were considered, the size of the item pool required in the simulation was 1,086 items (p. 476).

The authors discuss their example in detail: test length 60-250 items Bayesian estimation procedure used at the beginning of the test, switching to maximum likelihood when candidate correct and incorrect responses became available content balancing exposure control based on random selection from a group of fifteen items closest to the current ability estimate.

In all, seven item pools were designed, representing three bin widths with and without use of the exposure control procedure. Average test length, percent of accurate classification, bias, mean squared error, and correlation were similar. The major differences in the test pools were item overlap rate for candidates, and percentage of over exposed items (range of 0-51.9% exposure rate >.2 across the seven pools) and percentage of items underexposed (range of 11.9-47% of items having exposure rate <.02). The largest item pools (over 2,000 items) were required for three conditions that had no items with a high exposure rate (p. 486).

Kolen, M.J., Wang, T., and Lee, W. (2012) Conditional standard errors of measurement for composite scores using IRT. International Journal of Testing 12(1), 1-20.

This research addresses measurement error when scores on different tests are combined. A number of examples are provided by the authors. Conceptually, these include combinations of test scores from different content areas on achievement tests, tests that contain multiple item types, and tests that are designed to survey different groups of students.

FAll 2014 n

6 ClEAr EXAm rEVIEW

The ACT Assessment total score for college admissions is one example of the first type of test, which provides a combined score. Another is an achievement test used at the junior high school level in Taiwan that has tests and scores in five different areas. Some high school Advanced Placement tests for college credit are mixed-format tests that contain different item types, such as multiple choice and constructed response items. Finally, the National Assessment of Educational Progress is an example of the last type of composite, which compares demographic and state results.

The authors note that the Standards (1) recommend that the conditional standard error of measurement be reported to test users for both raw and scaled scores. Using a multi-dimensional item response theory (IRT) model, the authors present a procedure for estimating conditional standard errors of measurement and reliability for composite scores.

Zhang, X., and Roberts, W.L. (2013) Investigation of standardized patient ratings of humanistic competence on a medical licensure examination using Many-Facet Rasch Measurement and generalizability theory. Advances in Health Science Education 18(5), 929-944.

This is a study reporting on the Global Patient Assessment (GPA) instrument used to measure humanistic doctor-patient interactions in the clinical skills component of the osteopathic medical licensure examination. “This study examined the pattern in which SPs (standardized patients) utilize the GPA rating through MFRM (many facet Rasch measurement); in particular, the magnitude of potential rater effects corresponding to measurement error was evaluated. Three types of rater effects were examined: rater stringency/leniency (SPs consistently rate lower or higher than expected); halo effect (SPs attribute the same rating to all aspects of the CPA tool); and restriction of rating range (the extent to which obtained ratings discriminated among different examinees with their respective performance levels…” (p. 931). Data included 50,090 GPA scores (4,564 examinees, 12 standardized patient stations).

The authors conclude that “although SP raters varied in leniency/stringency of rating, SPs differentiated the six GPA aspects in difficulty and utilized a reasonable range of the 9-point scale. Reliability indices resulted in sufficient examinees separation, 0.94, from the Rasch model and sufficient dependability from the generalizability analysis for raw scores, 0.83, and transformed Rasch scores, 0.97. Results indicate that medical students’ humanistic competence can be reliability measured through means of observation and quality control with valuable information

about the psychometric quality of ratings of humanistic competence” (p. 929). Ferrando, P.J. (2014) A general approach for assessing person fit and person reliability in typical- response measurement. Applied Psychological Measurement 38(2), 166-83.

The author begins by explaining that the best known IRT models consider the item as the sole source of measurement error (p. 166). An alternative approach is described that “focuses on the individual as the main source of error” (p. 167). The former approach is a constant theta model, and the latter is a variable theta model. The purpose of the study is to investigate person fit and person reliability and “propose indices and procedures for assessing person fit when a variable theta model is used” (p. 167). “In the variable theta approach, person reliability is no longer viewed as a source of individual misfit but rather as a relevant person characteristic that is modeled as a parameter and that partly explains the behavior of the individual when answering the test.” (Ibid.) For binary, graded and continuous response models, three approaches are discussed: graphical procedures, global personality-fit indices, and residual indices at the item level (p. 166).

Popham, W.J. (2014) Criterion referenced measurement: half a century wasted? Educational Leadership 71(6), 62-68.

The focus of this article is on the use of criterion-referenced measurement. The distinction between norm-referenced or person-referenced measurement and criterion-referenced measurement was made many years ago. Popham addresses “areas of confusion” that have arisen in this arena. He states that there are not criterion-referenced tests vs. norm-referenced tests. These terms refer to inferences about the test taker’s score (p. 64). Also, a criterion is a domain of behavior or knowledge rather than a level of performance. He does acknowledge that some of the literature has adopted the position that a criterion represents a level of performance, but he indicates that this emphasis is misplaced. The context of the discussion is on evaluation of instructional outcomes in schools. The implications for the use of the term “criterion-referenced measurement” in certification and licensure testing are not clear. Popham’s historical review is quite explicit on the focus concerning a domain of knowledge. It appears that in the passion for passing score studies (Angoff and other methods), the certification and licensure field has adopted the term criterion-referenced measurement and used it in a way that is different from what the original authors in the field intended.

n FAll 2014

Clauser, J.C., Margolis, M.J., and Clauser, B.E. (2014) An examination of the replicability of Angoff standard setting results within a generalizability theory framework. Journal of Educational Measurement 51(2), 127-140.

The authors conducted a study of the comparison of results of Angoff standard setting studies when multiple panels of judges were used. Data were obtained from the three steps of the United States Medical Licensing Examination (USMLE). When the USMLE conducts passing score studies there are three replications using three independent panels. The study included six standard setting exercises. The authors concluded that “the results show that although in some cases the panel effect is negligible, for four of the six data sets, the panel facet represented a large portion of the overall variance” (p.127).

Resources and timing for certification and licensure programs frequently do not allow for replication of passing score studies but the assumption is made that the results would be similar if the passing score study were to be replicated. This is not necessarily the case, as is clearly illustrated by this research. The authors quote Standard 4.19 in the Standards (1): “Where applicable, variability over judges should be reported. Whenever feasible, an estimate should be provided of the amount of variation in cut scores that might be expected if the standard setting procedures were replicated” (p. 60).

The authors conclude that “ignoring the often hidden panel/occasion facet can result in artificially optimistic estimates of the cut score stability. Results based on a single panel should not be viewed as a reasonable estimate of the results that would be found over multiple panels. Instead, the variability seen in a single panel can best be viewed as a lower bound of the expected variability when the exercise is replicated” (p. 127).

Margolis, M.J. and Clauser, B.E. (2014) The impact of examinee performance information on judges’ cut scores in modified Angoff standard-setting exercises. Educational Measurement: Issues and Practice 33(1), 15-22.

The context of this study was similar to the one described immediately above, but a different variable was investigated: sharing performance data on items with the raters, usually on the second round of item ratings. This is in fact the procedure that typically makes the Angoff study “modified,” as Angoff’s original description of the method consisted of three sentences and did not address the conceptual approach at this level of detail. There has been debate about the impact of providing performance data to raters in Angoff

studies over many years. On the one hand, the obvious benefit is a “reality check” on the ratings. If a rater thinks that a high percentage of minimally competent candidates should answer an item correctly but the item was actually difficult for the total group of candidates, this information is important for insuring realistic ratings. On the negative side of the issue, there is some risk that raters will neglect the task of judging minimum competency for examination performance and will target their ratings to some reference point associated with the difficulty levels of the test items.

In the authors’ study, data from six standard setting panels on each of the three steps of the USMLE were utilized. The authors indicate that after item performance information was provided, “results varied by panel but in general indicated that both the variability among the panelists and the resulting cut scores were affected by the data. In addition, for all panels and examinations pre- and post-data cut scores were significantly different. Investigation of the practical significance of the findings indicated that nontrivial fail rate changes were associated with the cut score changes for a majority of standard-setting exercises” (p. 15).

“Results of the present research provided no support for a general decrease in cut scores after judges reviewed performance data. Although there was no significant general trend across the multiple data sets, there was clear evidence that increases in cut scores are not uncommon after judges review performance data. Specifically, the results indicate that both panelist variability and resulting cut scores were affected by the data. In general, panelist variability decreased after judges reviewed performance data. In addition, post-data cut scores were significantly different from those in the pre-data condition; these differences were observed for each of the three examinations for both years. It should be noted that, although the changes that result from providing judges with performance data were statistically significant and non-trivial in terms of impact, the differences in cut scores before and after provision of performance data were not particularly large when compared to other sources of variability in estimating cut scores” (p. 20). The authors refer to the variability in ratings across panels as reported in the Journal of Education Measurement article above as one major source of variation.

Sinharay, S. (2014) Analysis of added value of subscores with respect to classification. Journal of Educational Measurement 51(2), 212-222.

A number of articles about the reliability of subscores, primarily by Sinharay and colleagues, have appeared in this column in the past. This paper is somewhat different

7ClEAr EXAm rEVIEW FAll 2014 n

in that it focuses not on the internal consistency of the subscores but on a situation where the pass/fail criterion includes the requirement of passing a subtest in addition to obtaining a passing score on the overall assessment. The author suggests a method “to assess whether classification based on a subscore is in better agreement than classification based on the total score, with classification based on the corresponding subscore on a parallel test” (p. 212).

Test Format and Administration

Sheaffer, E.A. and Addo, R.T. (2013) Pharmacy student performance on constructed-response versus selected-response calculations questions. American Journal of Pharmaceutical Education 71(1), 1-7.

This study compared performance on fifteen pairs of calculation questions that were offered via computer based testing using a multiple choice or constructed answer format. Respondents were students in a PharmD program. Results of the study were inconclusive as the class scored higher on the constructed response format for 4 items and higher on the multiple choice format for 11 questions. In response to a survey, students indicated a preference for the multiple choice format but indicated that the constructed response format “better prepared them for a career in health care.”

Dosch, M.P. (2012) Practice in computer-based testing improves scores on the National Certification Examination for Nurse Anesthetists. AANA Journal 80(4), S60-S66.

This retrospective study in a single university compared the performance on the National Certification Examination for Nurse Anesthetists for a group of students who had extensive experience taking computer-based tests with a group having limited experience. The major difference in the groups was a program transition to computerized tests in courses. The two groups were matched on age, grade point average and gender. The group having more course-related computer assessments performed better on the certification examination.

Analysis of Software Programs

Gander, S.L. (2014) Taming test engine chaos. Institute for Performance Improvement, 15 pp. www.tifpi.org

Gander, S.L. (2014) Selecting a credential verification and portfolio management engine. Institute for Performance Improvement, 22 pp. www.tifpi.org

These two papers, completed by Gander in March 2014 for the Institute for Performance Improvement, are available on

the I.P.I. website. “Taming test engine chaos” presents an approach to evaluating the functionality and ease of use of a delivery engine for computer-based testing. As a framework for analysis, four skill levels required for use are identified (“minimalist, functionalist, confident, and super user”) as well as five audience types (candidates, event administrator, process administrator, systems administrator, and item writer). This framework is used to create thumbnail comparisons of a sample of eight test engines. The list of test engines is by no means exhaustive, but the analytic framework is certainly of heuristic interest and should be useful for organizations seeking to implement capability for item banking, test form development, and computer-based testing. A particularly useful adjunct to the paper is an appendix listing specific aspects of functionality that the author considered in her review.

The author takes a similar approach in “Selecting a credential verification and portfolio management engine.” She distinguishes between these two types of software and notes that in today’s market they are separate software solutions, but “as the market matures, this may change. Today, it usually requires planning to implement two software solutions in tandem” (p. 2). The focus of the paper is to provide a framework for reviewing products on the market, starting with four “decision-making lenses”: portfolio models, audience skill level, audience types, and installation and annual fees.

The models are classified as credential verification only model, career professional model, interactive coaching/mentoring model, and custom designed model. Skill level of the user is characterized using the same grouping as the paper on test engines: minimalist, functionalist, confident, and super-user. Audience types are public/private assessor, candidate, process administrator, systems administrator, and reviewer. The author proceeds to summarize the capabilities of eight software products using this model for analysis.

Both of these papers provide useful frameworks to gain greater understanding of the capabilities of software products and product comparisons.

References

American Educational Research Association, American Psychological Association, and National Council on Measurement in Education. (1999) Standards for educational and psychological testing. Washington, DC: American Educational Research Association.


Introduction

This article aims to synthesize the literature on data visualization design with my own personal experience to provide readers with some guidance when creating data visualizations for testing. Since data visualization design is a large topic, this article serves as a beginning primer aimed at individuals designing visualizations for the testing industry. Future editions may apply these design principles to examinee score reports and testing dashboards.

This is the second article in Technology in Testing dedicated to the topic of data visualization. Readers not yet familiar with data visualization are encouraged to read Bontempo, 2014 to gain a basic understanding.

In April 2014, I had the chance to attend the OpenViz conference in Boston, MA. Many of the conference presenters, which included some respected data visualization designers such as Mike Bostock, John Resig, and Robert Simmon1, explicitly divulged their basic data visualization design processes. Although these varied, a common theme emerged. Data visualization design is an iterative process in which the product emerges and improves as ideas are explored through a trial and error process. These designers believe that design is an art rather than a science. Even so, I believe that the basic process for designing explanatory data visualizations can be divided into the following steps:

1. Identify the purpose

2. Choose a chart type

3. Populate with data

4. Add attributes

5. Optimize the layout

Each of these will be explored.

Identify the Purpose

The design of a visualization can be greatly impacted by the purpose, so it is wise to identify the purpose and the intended audience before beginning the design process.

Technology and TestingDesigning Effective Data Visualizations for Testing

brIAN D. boNTEmPo, Ph.D.principal consultantMountain Measurement, Inc.


Mike Bostock, a data visualization expert for the New York Times, is the author of D3.js (http://d3js.org/) the javascript li-brary which is regarded as the leading open source data visualization language. John Resig is a software engineer and the author of the jQuery javascript library (http://jquery.com/). Robert Simmon is a senior visualization engineer for Plan-et Labs and formerly a NASA designer. See the reference section, below, for the URLs to blogs by Resig and Simmon.

Choose a Chart Type

Once the purpose is clear, the next step is to choose which type of chart to construct. Loosely speaking, there are four types of charts: tables, graphs, maps, and a catch-all category which I’ll call innovative visualizations. We’ll explore some reasons to use each type.

Tables are effective at providing numbers in a logical, organized manner. As a result, they are the most useful chart type when the numeric values themselves are important. Therefore, tables are well suited to visualizations that have an informative purpose and are often useful for examinee score reports.

Tables are also a useful way of conveying multiple constructs simultaneously, especially when those constructs have different units of measurement (Few, 2004). This makes tables a useful way of conveying information in technical reports where different item performance or test performance metrics are displayed concurrently.

In contrast, graphs are pictures in which the placement of objects in at least one dimension of space spatially conveys quantitative data. Graphs are useful for conveying the relationship between variables or displaying trends. For graphs, the pictures and the conclusions one may draw from them are more important than the numeric details. There are many types of graphs, so once a designer has decided to use a graph, the next step is identifying which type of graph to use.

l Scatterplots are a useful way of displaying the relationship between two equal interval or ratio variables. In testing, scatterplots effectively convey the relationship between two subscores of a test.

Generally speaking, there are three purposes that a data visualization may have: to inform the user, to persuade the user, and to inspire the user to ask more questions of the data.

Often in testing, the purpose of a visualization is to inform the user, such as the visualizations used in examinee score reports. For these, the data are just as important as the visualization. Therefore, the data must be presented from a neutral perspective (Steele and Iliinsky, 2011).

In testing, persuasive visualizations are often used in handcrafted technical or program management reports. The conclusion is more important than the data for persuasive visualizations, so attributes are often used to illuminate the designer’s interpretation of the data.

Both informative and persuasive reports can be provided to users who are interested in taking action on the information provided without having the desire to engage with the data any further. On the other hand, there are many users that are interested in diving deeper into the data (Bontempo, 2012). For these power users, visualizations that inspire and permit the user to interact with the data are required. In licensure testing, educators and regulators are an emerging class of power users whose desire for data may now match their desire for findings. Designers that create interactive data visualizations are likely to succeed with these types of users.

Although interactive design is a very important aspect of data visualization, it is an entirely separate science. Readers interested in learning more about this topic are encouraged to start with Shneiderman and Plaisant’s work.

FiguRE 1. Examples from left to right of a scatter plot, line chart, bar chart, histogram, and box plot.


l Line charts are an effective way of displaying the changes in one or more quantitative variables over time. In testing, line charts are a good way of expressing the test volume or passing rate over time.

l Bar charts are a valuable way of displaying the differences between cross-sectional quantitative values associated with different entities of a categorical variable. In testing, bar charts can successfully communicate the cross-sectional differences in the test volume of different groups.

l Histograms are the gold standard when it comes to visually displaying the distribution of values for one quantitative variable. In testing, the distribution of test scores is validly illustrated as a histogram.

l Box plots are a helpful way of visually summarizing the distributions of multiple quantitative variables. In testing, box plots are a useful way of displaying the cross-sectional performance of different groups of examinees such as those from different jurisdictions or different educational programs.

l Although pie charts are commonly used, they are not typically an effective way of communicating data since the eye has a difficult time comparing the size of objects when they are placed in a circle. Designers are generally encouraged to use bar charts instead of pie charts. However, pie charts can successfully convey valuable information when space is limited. For example, pie charts are effective at conveying a percentage associated with each state on a map such as the passing rate by state.

It should be obvious that maps are used to convey geospatial information. Historically, data visualization maps have been drawn to scale and populated with numeric values or icon sized visualizations. Some, called choropleths (Dupin, C. 1826 and Wright, J. 1938), have added complexity by shading various regions to represent a quantitative variable associated with the region. In recent years, designers have created maps which morph the size of the geographic elements (e.g., states) to match the size of one or more quantitative variables. These maps are called cartograms (Gillard, Quentin, 1979). Although cartogram maps may be fun to create and read, their real utility may be limited for testing.

There are a number of innovative visualizations that are being used frequently enough that one may question whether these should be considered innovative anymore. These include 3D, heat maps, bubble charts, tree visualizations, word cloud, and many other less common types. Since each has a specific application, designers not satisfied with the more common chart types are encouraged to explore these alternative chart types before creating a new chart type from scratch. This raises an important point when choosing a chart type. Users who are familiar with a chart type read, understand, and interpret visualizations of that type more quickly and accurately than those unfamiliar with the type. Therefore, designers are encouraged to curb their enthusiasm for innovation and select the common chart types whenever they are appropriate. Keep in mind that most data visualizations in testing are successfully created with traditional chart types.

FiguRE 2. A bubble chart and word cloud displaying the performance by topic for a failing examinee where the large circles and text represent areas where the examinee should focus their study efforts.


Populate with Data

Although this step should be self-explanatory, it is worth mentioning that innovative visualizations often require several rounds of selecting a chart type, populating the chart with data, and reflecting upon the creation. It is through this trial-and-error process that, upon reflection, designers may opt for an alternative data set or a different chart type. Although this is the most common spot for iteration, designers may also find it necessary to repeat these steps after they have begun to add attributes or optimize the layout. Keep in mind that the majority of testing visualizations do not require iteration since traditional chart types succeed in fulfilling their purpose.

At some point, data visualization designers may ask, “How much data should I provide?” The answer to this question

varies widely. On one side, in his review of Napolean’s March to Moscow, Tufte advocates for visualizations that bring together a number of different variables (Tufte, 1983). On the other side, in his advocacy of identity plots and visualizations that are now called Wright Maps (Wilson & Draney, 2000), Benjamin D. Wright has suggested that simpler visualizations that present only two variables are more effective. Either way, it is important for a designer to make conscious decisions about the data being provided, noting that too much data can muddy the power of the important data while failing to provide enough data can leave the user with unanswered questions.

A visualization is only as good as the data being visualized. Therefore, it is imperative that the accuracy of the data be validated and the efficiency in which that data is processed

FiguRE 3. The distributions of the p-values of the items of four hypothetical tests displayed in a trellised manner.


verified. With testing visualizations, I have also found that it is necessary for the designer to populate visualizations with the minimum and maximum values that are possible. By doing so, the designer can verify that the visualization is rendering properly and providing visible, useful information for the most extreme users.

Add Attributes

From a data visualization perspective, an attribute is any modification to a chart that helps to illuminate the similarities or differences amongst the objects in the visualization. Visualization attributes can be classified into two categories, those that illuminate similarities and differences amongst categorical variables and those that illuminate quantitative variables.

The following attributes are useful with categorical variables: position, color (hue), shape, fill pattern, line style, font, and sort ordering. Designers are encouraged to use these attributes to group all of the data points from one category together or to help the user to group the data points from similar categories together.

With quantitative variables, position, size (length, width, and area), color (intensity), and sort ordering can be used to help the user quickly perceive similarities or differences amongst the data. Designers are encouraged to exercise caution when using color intensity to express the differences in a quantitative variable. Color intensity does not occur linearly and can be impacted by adjacent colors. Designers are encouraged to review Robert Simmon’s work (2014) on the Subtleties of Color for a more thorough guide on using color effectively. In addition, designers must provide visualizations that are friendly to the color blind by using color blind sensitive palettes or duplicating/replacing color attributes with another attribute.

Trellising, also known as latticing to users of R statistical software, is a graphing technique that is quite effective at illuminating similarities and differences. When a designer produces a trellis chart, (s)he creates similar charts in a grid-like fashion where each chart displays the data associated with a particular category. Each chart maintains the same set of axes, which make it easy to visually compare the relationships across charts. The example on the previous page shows a set of trellised histograms for the p-values of the items of four different hypothetical tests. By trellising these histograms, it is easy for the user to compare and contrast the distributions of these four assessments.

As stated earlier, the purpose of the visualization may largely impact the extent to which attributes should be used. In my experience, persuasive data visualizations in testing do not use attributes enough. Many of the hand-made technical, financial, and program management reports provided by testing professionals could be improved greatly by adding attributes to their visualizations.

Optimize the Layout

When optimizing the layout of a data visualization, it is important to keep this phrase in mind: “Above all else, show the data” (Tufte, 1983). This helps the user efficiently process the visualization without being distracted by the non-essential elements. One strategy that Tufte advocates is maximizing the data-ink ratio, which can be achieved by minimizing the extent to which chart junk is used. In other words, designers are encouraged to minimize the extent to which gridlines, legends, tick marks, and labels interfere with the data being presented.

Another important consideration in optimizing the layout is to ensure that the visualization has integrity, meaning that the size of the effect in the data should match the size of the effect in the graphic (Tufte, 1983). Although it is tempting to display only the portions of an axis that contain visually perceptible data points, this practice is discouraged since it visually exaggerates any differences in the data. In testing, this is applicable to the reporting of percent scores, where it may be tempting but unwise to exclude values less than 50% from the axis of a visualization.

In addition, traditional graphic design layout principles such as the orientation of labels, text alignment, and number and date formats should be evaluated and adjusted. Designers are encouraged to right align numbers and to maintain the same number of digits following the decimal point. This maximizes the efficiency in which users can perceive these data. Two important graphic design elements that are worthy of attention are column width and row height. These are particularly important to tables but are also applicable to the height or width of the bars found in bar charts. Used effectively, row height and column width can help the user to quickly perceive the direction, horizontal or vertical, in which the data with the most pertinent comparisons should be read.


Concluding Remarks

The five-step process of identifying the purpose, choosing a chart type, populating with data, adding attributes, and optimizing the layout has worked well for me in creating data visualizations. Sometimes I am provided with a dataset or select the data before choosing a chart type. Since this re-ordering of the steps mimics the iterative process mentioned earlier, designers may wish to consider this ordering as a viable alternative.

Once the visualization is complete, a designer may want to consider adding annotation to further spotlight aspects of the data to the reader. Designers are encouraged to exercise some restraint in their annotations. Subtle annotations say much more than a page filled with non-essential details. If a designer believes that more annotation is required, then I recommend placing explanatory text in proximity to the visualization or embedding the visualization in a written report.

Annotations may be challenging for visualizations that are systematically produced, such as examinee score reports. However, creative algorithms can be created that may provide useful insight into the visualization for novice users.There is so much more to data visualization design than the basic information provided within this article. Those interested in more information are encouraged to use the references below as a starting point. More importantly, the best way to improve data visualization design skills is through practice.

References

Bontempo, B. (2012). Data visualization. A presentation at the Annual Meeting of the Association of Test Publishers in Palm Springs, CA.

Bontempo, B. (2014). “Testing in technology: An introduction to data visualization for testing.” CLEAR Exam Review, 24 (1): 8-13.

Dupin, C. (1826). Carte figurative de l’instruction populaire de la France.

Gillard, Q. (1979). “Places in the news: The use of cartograms in introductory geography courses.” Journal of Geography. 78: 114-115.

Iliinsky, N. & Steele, J. (2011). Designing data visualizations. Sebastopol, CA: O’Reilly Media, Inc.

Resig, John (2014). http://ejohn.org/blog/processingjs/.

Shneiderman, B. & Plaisant, C. (2009). Designing the user interface: Strategies for effective human-computer interaction (5th ed.). Reading, MA: Addison-Wesley.

Simmon, Robert (2014). http://blog.visual.ly/subtleties-of-color- references-and-resources-for-visualization-professionals/ .

Tufte, E. (2001). The visual display of quantitative information (2nd ed.). Cheshire, CT: Graphics Press.

Wilson, M., & Draney, K. (2000). Standard Mapping: A technique for setting standards and maintaining them over time. In Models and analyses for combining and calibrating items of different types over time. Invited symposium at the International Conference on Measurement and Multivariate Analysis, Banff, Canada.

Wright, J. (1938). Problems in population mapping. In J .K. Wright (ed.), Notes on statistical mapping, with reference to the mapping of population phenomena. (Mimeographed Publication no. 1, pp. 1–18). New York: AGS and Population Association of America.


legal beatCAT Got Your Test Score?

DAlE J. ATkINSoN, ESq.Dale Atkinson is a partner in the law firm of Atkinson & Atkinson. http://www.lawyers.com/atkinson&atkinson/


Regulatory boards are created and empowered to protect the public through the enforce-ment of the respective practice act. In addition, numerous other laws affect the board and its regulatory obligations, including statutes addressing open meetings, open records, ethics laws, and administrative procedures. In particular, federal and state laws protecting the interests of persons with disabilities, as defined, can create interesting and challenging issues for the regulatory boards. Recognizing otherwise qualified disabled applicants for licensure while interpreting the eligibility criteria creates the need for a balancing of inter-ests. Consider the following.

An applicant applied for licensure as a nurse in the State of Kansas. The applicant had been diagnosed with dyslexia at a young age and had been receiving accommodations through-out his education experiences, including extra time, a private room, and someone to read the questions on examinations. In addition to his dyslexia and as a common side effect, the applicant also suffers from test-taking anxiety. In April 2008 after receiving his nursing degree, the applicant contacted the Kansas State Board of Nursing (Board) to inquire about taking the National Council Licensure Examination for Registered Nurses (NCLEX-RN), specifically seeking information about accommodations similar to those received through-out his academic experiences. The applicant was told by the Board staff that he would need the following to substantiate the requested accommodations:

1. proof through school records that he suffered from dyslexia;

2. confirmation from his college that it had given him the same exam accommodations as currently requested; and

3. a letter stating the specific accommodations.

Apparently, the applicant was also told by the Board staff that he would advise him when to send the requested documents.

In November 2008, the applicant applied to the Board to take the NCLEX-RN. The ap-plication form did not provide an opportunity to indicate the need for or request actual accommodations when sitting for the exam. In February 2009, the applicant again con-tacted the Board staff regarding the submission of documents necessary to substantiate the qualification for accommodations. During that call, the Board staff informed the applicant that if he took the exam with accommodations and passed it he would be given a “restrict-ed and limited” license. On a future call where he was trying to verify accommodations, the applicant was told that the Board staff with whom he had the previous communications was no longer an employee and that the Board did not have any documentation requesting accommodations.


The applicant did not submit any further requests for accom-modations and took and failed the exam in May 2009. The NCLEX-RN is a computer adaptive test (CAT), meaning that the computer selects exam questions based upon answers provided by the individual examinee to previous questions. As examinees demonstrate knowledge, skills and abilities in content areas, the computer selects the next question(s) based upon difficulty levels. To demonstrate competence, examinees must answer at least seventy-five questions. As alleged, the applicant was only allowed to answer fifty-seven questions before the computer program “inexplicably shut down.” The NCSBN records of test results erroneously indi-cated that the applicant had answered eighty-four questions. The applicant sought to “appeal” his failing result to the Board and/or NCSBN but was told there was “no point” as a test result had never been changed.

The applicant filed litigation against the Board asserting numerous claims under the Americans with Disabilities Act (ADA), including that the Board failed to provide him with an opportunity on the application to describe the necessary accommodations; denied him reasonable accommodations; threatened to restrict his license if granted (which such threat deterred him from requesting accommodations); failed to provide an appeal procedure; and failed to provide an exam in a format other than the CAT.

The Board, as a state entity, filed a motion citing the 11th Amendment and sovereign immunity and arguing that claims for damages are subject to dismissal. The 11th Amendment provides a state and arms of the state (such as regulatory boards) with immunity from suit in federal court. However, sovereign immunity does not prevent claims for damages where Congress has abrogated the states’ immunity or when a state waives its immunity rights. The legal issue at stake in the current litigation was whether Congress unequivocally intended to abrogate the states’ immunity through the enact-ment of the ADA. In dismissing the case, the lower court held that Congress did not intend to abrogate state immunity under the ADA and thus, the federal court could not enter-tain jurisdiction over the dispute. The applicant appealed the matter to the 10th Circuit Court of Appeals.

The 10th circuit reviewed the findings of the district court and noted that its appeal standard is de novo, meaning it reviews the claim as if there was no lower court ruling. The applicant first argued that his claims alleged a violation of his constitutional rights thus giving rise to a heightened scrutiny analysis. In short, the applicant appeared to argue a violation of his substantive due process rights as a result of a “denial of access to the courts.” In addition, he argued a violation of

his rights to equal protection as a result of the Board’s alleged threat to limit his license. Both such rights, he argued, are guaranteed under the 14th Amendment of the Constitution.

Regarding access to the courts, the applicant argued that a denial by the state of access to the licensing exam was akin to a denial of access to the courts. He argued that the right to access to the courts under the 14th Amendment is guaran-teed and that his right to access to the exam and, ultimately, a professional license is tantamount to access to the courts. Because the applicant cited no authority in support of such an argument, the court was not inclined to recognize its mer-its and thus rejected the analogy.

The applicant next argued that the threat to issue a limited license if he took the examination with accommodations vio-lated his right to equal protection. As noted by the court with citations to previous jurisprudence, states are not required by the 14th Amendment to make special accommodations for the disabled, so long as the state’s actions are rational. This rational basis test is based upon case law that holds that both eligibility for a professional license and persons with disabilities are not suspect classes subjecting judicial review to a heightened scrutiny. Under a rational basis analysis, “an equal protection claim will fail if there is any reasonable con-ceivable state of facts that could provide a rational basis for the classification.” The district court found that restrictions placed upon a nursing license earned via testing accommoda-tions could meet a legitimate public safety concern.

The applicant argued that his education with accommoda-tions in a nursing program provided a basis for a finding that his dyslexia did not adversely affect his ability to practice nursing. Thus, he argued that the Board was not justified in limiting his license. The court noted that the applicant “misperceives” the nature of a rational basis inquiry. Under a rational basis analysis, the court independently considers whether there is any conceivable rational basis for the classi-fication. Such a determination is a legal conclusion and need not be based upon empirical data or any particular evidence. The district court found a rational basis for a state restrict-ing the license of an applicant tested with accommodations. Thus, it was incumbent on the applicant to negate such a finding. The appellate court found that the applicant did not refute the findings and, accordingly, could not substantiate an equal protection claim.

The applicant also included claims against the National Council of State Boards of Nursing (NCSBN) under Title III of the ADA. Title III of the ADA applies to public ac-commodations and services supplied by private entities.


The NCLEX-RN is owned by the NCSBN, which develops, administers, scores, and maintains the examination on behalf of its membership, the state boards of nursing. The NCSBN sought dismissal of the litigation based upon a lack of con-stitutional standing or, alternatively, failure to state a viable claim. The district court found the applicant lacked standing because he failed to allege a causal link between the NCSBN’s conduct and any injury suffered.

The applicant argued that the NCSBN violated the ADA by not providing the NCLEX in a format other than the CAT version and that it failed to provide him with an opportunity to appeal his score. The 10th circuit noted that the party asserting federal jurisdiction has the burden of substantiating the authority of the court to determine a matter. In order to establish standing, the applicant must meet three require-ments. First, an injury in fact occurred; second, there must be a causal connection between the injury and the action of the defendant; and, third, it must be likely that a favorable judgment will redress the plaintiff’s injury.

Although the applicant alleged there were unspecified flaws in the CAT examination format, he failed to allege that the inability to take the exam in another format or his test tak-ing anxiety caused by his dyslexia contributed to his failing result. Finally, while the applicant did allege an adequate causal connection between the denial of an appeal of his test score and the denial of licensure, the 10th circuit nonetheless held that his allegations were insufficient to establish liability under the ADA.

Thus, while the district court should not have dismissed the NCSBN based upon standing, the 10th circuit upheld the dismissal on other grounds. In short, the court held that without some connection between the computer glitch and the applicant’s dyslexia, the failure of the NCSBN to provide an appeal of his test score is not actionable under the ADA. Accordingly, the dismissal of the case by the district court was affirmed by the 10th Circuit Court of Appeals.

This opinion illustrates the complexities of applying the ADA to the licensure and testing environments. Interestingly, the court recognized the authority of a state to issue a limited license based upon accommodations granted in the examina-tion process. Such a ruling will surely have consequences to future applicants for licensure.

Turner v. National Council of State Boards of Nursing, 2014 U.S. App. LEXIS 6086 (10th Cir. 2014)

Perspectives on TestingResponses to Your Questions

CHuCk FrIEDmAN, Ph.D.program directorProfessional Examination Service


Introduction

CLEAR members, jurisdictions, boards, and other stakeholders are continually faced with new questions and practical issues about their examination programs. Numerous resources—including Resource Briefs, Frequently Asked Questions, and discussion forums—are provided on the CLEAR website to assist members in tackling such issues. At the annual conference, new information is shared through sessions and networking opportunities.

This column presents practical issues and topics from recent Ask the Experts conference sessions, where audience participants pose questions to a panel of testing experts. In this column, panelists present their perspectives on specific questions or issues raised at the annual CLEAR conference.

These responses represent the views of the contributor, are specific to the situation, and offer general guidance. Each response represents the perspectives of the individual columnist and is not to be considered an endorsement by CLEAR. Psychometrics is a blend of science and art, and each situation is unique. The responses provide general background and guidance, which can be used to inform decisions with additional input from psychometricians to fully respond to your specific issue.

Readers are encouraged to submit questions for future columns and conferences to [email protected]. Please enter “CER Perspectives on Testing” in the subject line.

n Is there a role for recall/recognition items in certification examinations?

Response provided by Steven S. Nettles, Ed.D., Program Director, Psychometrics Division, Applied Measurement Professionals, Inc.

Recall items or questions involve remembering and understanding previously learned material. These questions generally ask the candidate to define, describe, identify, recognize, or remember a term or activity.

Before a definitive answer can be made regarding the effectiveness of recall items as screens for competence, we must first determine if they should be included in a credentialing exam. The process of developing an examination program includes several steps. The initial step and the foundation of any defensible credentialing exam is the job or practice analysis. If it is not done properly, examinations built based on the results are not defensible. While there are many methods for performing a job analysis, they all have four common elements:

1. Subject Matter Experts (SMEs) are used. An Advisory Committee (AC) is assembled from a diverse and representative group of SMEs. This is done to ensure that varying perspectives of the profession are considered. It should be noted that there is no

specific minimum or maximum number of SMEs serving on a job analysis committee. What is important though is that committee membership is diverse and representative and the subject matter expertise is appropriate to meet the needs of the job analysis.

2. Evidence is collected about job activities. While the AC can be used to gather evidence about the job (e.g., as in a committee-based job analysis), it is desirable to include a larger group of SMEs in the process. The preferred model involves soliciting opinions about job activities from a larger group using survey methodology. And while the “required” sample size for a job analysis survey study is not specified, the respondent sample should be selected to appropriately sample the population of the profession.

3. Data are evaluated by the Advisory Committee. This ensures objectivity in deciding the important or significant job activities that should be included in the content domain eligible for the examination. Thus, the final content domain is data-driven and not just opinion.

4. Test specifications are established based on the job analysis results. The results are reviewed by the AC to develop the test specifications. The specifications should clearly delineate the areas to be assessed and the number of items (or points) that are assigned to each major and minor examination section.

Besides delineating the number of items on the examination for major and minor areas of the content domain, it is recommended the distribution of items also be specified that directly relate to the cognitive demands of the job. While several paradigms have been proposed to this end, I have found a simplified version of Bloom’s Taxonomy of Learning Objectives (1956)* to be useful. Once the content domain has been determined, the AC then evaluates each job activity in terms of the cognitive demands typically required for competent performance, using one of the following categories: recall, application, and analysis/evaluation. Once this process is completed, the test specifications are two dimensional – one for content, and one for cognitive level. It has been my experience that most jobs have about 15-25% job activities that are performed at the recall level.

Thus, the resulting examination should specify an approximately equal number of items at the recall level. Using this model, the examination will have substantial evidence of validity and, by definition, provide an effective screen for competence.

Bloom, B. S.; Engelhart, M. D.; Furst, E. J.; Hill, W. H.; Krathwohl, D. R. (1956). Taxonomy of educational objectives: The classification of educational goals. Handbook I: Cognitive domain. New York: David McKay Company.

n What alternatives are suggested for testing English as a Second Language (ESL) candidates?

Response provided by Heidi Lincer-Hill, Ph.D., Chief of the Office of Professional Examination Services, California Department of Consumer Affairs.

Four options are described here for testing candidates whose primary language is not English. Additional approaches to this complex issue may also be available. Each of the four alternatives has advantages and disadvantages. Option 1 is to translate/adapt the examination into one or more languages. Option 2 is to allow interpreters to assist candidates during the examination. Option 3 is to use a performance-based or practical examination. Option 4 is to require all candidates to take the examination in English.

Option 1: Translate/Adapt the ExaminationThe advantage of translating or adapting the examination is that all candidates have the opportunity to test in their native language. Assuming that English is not essential for competent performance of the skills or abilities being measured, this option seems fair and reasonable to applicants. For example, translation is a good option for a driver’s license examination. However, translating examinations is not as straightforward as it might seem. For a licensure examination that includes many technical terms, it may be difficult to achieve an accurate translation, and the process is expensive and time-consuming. There may also be concomitant issues such as the existence of different dialects of the target language. Even if there is not a dialect selection issue, there will most likely be increased applicant complaints due to claims that the examination was not translated properly.

If it is decided that an examination needs to be translated, test developers should carefully review the relevant literature and guidelines. The term “adapting” is more appropriate than “translating,” given the inherent difficulty of producing a second language examination that is equivalent to the original examination. According to the International Test Commission Guidelines for Translating and Adapting Tests (2010), several types of evidence (linguistic, psychological, and statistical) must be provided in order to ensure the equivalence of the English and the


adapted examination. The Standards for Educational and Psychological Testing (1999) also address the importance of evaluating linguistic and cultural differences.

The typical adaptation process involves first retaining bilingual subject matter experts to translate the examination. Next, additional subject matter experts should retranslate the examination back into English to verify the accuracy of the translation. Any differences should be adjudicated. Bilingual subject matter experts should also identify any cultural differences that may affect the translated test questions and, ideally, participate in the passing-score workshop to ensure any final changes are incorporated into the adapted examination. Following administration, item-level and examination-level statistical analyses should be calculated and compared for the different candidate linguistic populations. If possible, validity for the different candidate groups should also be established. Although Option 1 requires much effort and resources, many test developers continue to successfully meet this challenge.

Option 2: Allow Interpreters to Assist CandidatesThis solution works well when an organization faces the daunting prospect of translating many examinations or offering an examination in multiple languages. The main advantage of this option is flexibility, as organizations can solicit interpreters to orally translate the examination as needed. It also avoids the time and expense of more formally translating examinations. Another advantage of using interpreters is that technical terms can be presented in English. The interpretation process typically allows the candidate to view the examination in English or allows the interpreter to use English at his/her discretion.

The main disadvantage of Option 2 is that there is little control over the quality of the interpretation. Unless the same interpreter can be used multiple times (which raises security concerns), different interpreters will vary in their ability to provide an accurate translation. Organizations should evaluate the possibility of retaining certified interpreters who may translate with a greater degree of accuracy and consistency. If possible, interpreters should be vetted to help ensure they are not invested in helping applicants pass the examination.

It should be noted that both Options 1 and 2 create increased risk of misconduct and examination compromise. Giving additional individuals access to the examination raises the risk that they will use their role to assist applicants to pass or to disseminate confidential information for illicit purposes.

Steps can be taken in the administration environment to reduce the likelihood of candidates and interpreters colluding to cheat. To prevent nonverbal communication, interpreters should not be directly facing candidates during the examination. One method to accomplish this is to require communication through earphones and a headset. In addition, the interpreter and the candidate should be actively monitored by a proctor, and their communication should be audio- and/or video-recorded. Proctors can be trained to listen for telltale signals of cheating, and the mere presence of the proctor serves as a deterrent. If misconduct is suspected, an independent evaluation by a third-party interpreter can be conducted.

Option 3: Use a Performance-Based or Practical Examination Option 3 is a solution for organizations that are not committed to a multiple-choice examination, provided that the focus of the examination is on skills and abilities rather than knowledge. Creating a performance-based assessment or practical examination avoids the entire language issue altogether. One example would be a practical examination to assess cosmetology skills and abilities. However, using performance-based or practical examinations still involves the practical consideration of ensuring that examination instructions are clearly understood by ESL candidates.

Option 4: Require All Candidates to Take the Examination in EnglishThe obvious advantage of Option 4, in which the examination is identical for all candidates, is that accommodating ESL candidates requires no additional effort on the part of test developers. However, given the ever-increasing diversity of many candidate populations, this option may not be a practical or politically sensitive solution. If Option 4 is selected for a licensure examination, data should be gathered to document that English is an essential aspect of the profession. It would also be expedient to incorporate regulatory language to this effect, thereby preventing legal challenges.

References

International Test Commission (2010). International Test Commission Guidelines for Translating and Adapting Tests. [http://www.intestcom. org]

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education (1999). Standards for educational and psychological testing. Washington DC: American Education Research Association.


A nine-year joint effort concluded in July with the publication of the 2014 Standards for Educational and Psychological Testing by the American Educational Research Association (AERA). A Joint Committee with representatives from the American Psychological Association (APA), the National Council on Measurement in Education (NCME), and AERA collaborated in preparations that started in 2005. Boards of all three associations adopted the new standards, which can be ordered from AERA (http://www.aera.net/), to replace the 1999 Standards.

The Joint Committee was charged with five objectives for the 2014 Standards:

l Considering accountability issues for use of tests in educational policy

l Broadening accessibility of tests for all examinees

l Representing more comprehensively the role of testing in the workplace

l Addressing the role of technology in testing

l Improving communication of the Standards

Organizations involved in licensure and certification should pay special attention to Chapter 11, “Workplace Testing and Credentialing,” in the 2014 Standards. The chapter “was reorganized to more clearly identify when a standard is relevant to employment and/or credentialing” (2014 Standards, p 4). Standards 11.1 to 11.4 apply to both employment testing and credentialing, Standards 11.5 to 11.12 refer only to employment testing, and Standards 11.13 to 11.16 address issues for credentialing.

The 2014 Standards document is reorganized to differentiate more clearly among sections that apply to all forms of testing and those that are specific to particular applications. Part I, Foundations, includes chapters on validity, reliability/precision and errors of measurement, and fairness. Part II, Operations, addresses test design and development, scores and scales, test administration and reporting, and rights and responsibilities of test takers and users. Part III, Testing Applications, focuses on psychological testing and assessment, workplace testing and credentialing, educational testing and assessment, and program evaluation, policy and accountability.

A more detailed assessment of how the 2014 Standards may impact credentialing programs will follow in future CLEAR publications.

2014 Standards for Educational and Psychological Testing released

roN roDGErS, Ph.D.director of measurement, Continental Testing Services, Inc. (CTS)and president, Employment Research & Development Institute (ERDI)


In 2010, following over a decade of interest in the role of continuing competency for health care providers as it related to patient safety and quality of care, the National Certification Corporation (NCC), a voluntary, non-profit, national certification organization, began a transition in its long-standing certification maintenance program. The NCC Board of Directors supported this change backed by a deep awareness of the findings and recommendations of numerous widely respected and discussed reports including the foundational 1999 Institute of Medicine’s Consensus Report To Err is Human: Building A Safer Health System (IOM, 1999). The decision to move toward a Professional Development Maintenance Program model as part of an overall Continuing Competency Initiative (CCI) was based on emerging evidence that third party validation of certification maintenance needs was warranted (Davis, et al., 2006). Weighing heavily in this decision were NCC’s own 2007 pilot study findings, which reflected the conclusions of a number of related studies indicating that the sole use of self-assessment to determine continuing education needs for health care professionals was not an adequate mechanism (Byrd, Burns, Grossklags, 2013). Further, the NCC pilot study conclusions included the psychometrically based recommendation that some form of third party validation was indicated as part of a continuing competency process (Engelhart, 2008). Following NCC’s decision to pursue a certification maintenance process focused on continuing competency and life-long learning, key factors were identified as significant to the success of the initiative from the perspective of the NCC-certified population. Among those factors were travel and downtime to complete any validation tool developed; additional costs to the individual certificant as a result of having to complete the assessment; concern over increased CE hour requirements; fear of a pass/fail grading system which might jeopardize certification status; and fear of losing certification status by not participating in the assessment process. Additionally, it was felt that an opportunity to experience the new process in a non-binding phase would be helpful to allay some of the stress brought about by changes to a certification maintenance approach which had been in place since 1983.

The Development Process

NCC staff, consultants, and outside vendors spent a great deal of time and effort developing a secure assessment platform which could be accessed by individuals using virtually hundreds of personal computer devices and setups aligned with a variety of internet browser and provider interfaces. The ultimate result was 24/7 access to an

Incorporating Continuing Competency into Certification maintenance with Attention to Certificant Concerns

FrAN byrDdirector, strategic initiatives, National Certification Corporation


assessment tool, which removed any requirement for travel to a proctored site and allowed each individual to determine the best time to complete the assessment based on their own personal schedule. Certificants access the assessments via a personal, password-protected website account and, by participation in the process, indicate their assent to the confidentiality/conduct agreement presented prior to assessment launch.

Initial response to the announcement of the Continuing Competency Initiative and a new certification maintenance process included a flood of “push back” from certificants who perceived that certification maintenance fees would rise with the addition of the specialty assessment component. The NCC Board had anticipated such a response, and a proactively authorized budget supported the development and implementation of the assessment process without incurring additional certification maintenance fees for the individual certificant.

For each of NCC’s currently administered core certification examinations, a separate 125-item assessment tool was developed based on content aligned with the knowledge competencies of the specific certification examination. Each assessment was created to cover 50 total hours of CE across the core content areas, with item groupings distributed and weighted to reflect the specialty’s current certification exam content outline. The decision was made by NCC to avoid any pass/fail grade status for assessment results. This approach was chosen to reduce the perception of the assessment as an examination or test. It was also intended to remove any sense of “failure,” as the assessment is used solely as an evaluation tool for appropriately directing continuing education used for certification maintenance on an individualized basis. This departure from the traditional “one size fits all” approach to certification maintenance addressed NCC’s goal of promoting professional development personalized for the NCC-certified nurse or nurse practitioner at a specific period in their professional career.

Based on mathematical calculations, a specialty index rating range from 1 to 10 was established to address the content, weighting, and item distribution for each specialty’s core

competency areas. For purposes of determining continuing education needs based on assessment results, NCC chose to consider a specialty index rating of 7.5 as having “met standard.” The level of performance corresponding to a rating of 7.5 was seen as reflecting a mastery level above that attached to the passing score for the certification examination. Therefore, to maintain a NCC credential for any given maintenance cycle, no continuing education would be indicated in any core competency category with an index rating of 7.5 or higher. (For more information regarding the specialty index rating, see the Continuing Competency Initiative at the NCC website: http://www.nccwebsite.org/ContinuingCompetency/.) The personalized Education Plan was developed by combining the individual’s specialty index ratings and a baseline CE hour requirement of 0-15 CE hours. These baseline hours are accruable in any content areas within the specialty, and the baseline requirement itself fluctuates depending on core areas for which a 7.5 index rating is achieved and/or the weighting of any such areas in which the standard is met.1

In recognition of the time required, NCC made the decision to award 5 hours of CE for completion of the assessment tool. These CE hours are designated strictly for NCC certification maintenance applied to the cycle in which the assessment is completed, and the CE credit is applied to the learning plan area selected by the certificant. While an individual’s assessment results can impact the Education Plan with a lower overall CE hour requirement, the five-hour NCC CE credit maintained the maximum CE hour requirement at 45 hours, which had been the standard for many years. This action addressed the NCC-certified population’s concern that an increased continuing education hour requirement would naturally accompany changes in the overall certification maintenance program.

Upon completion of the assessment-based approach to certification maintenance, an Alternative Maintenance option was developed. This action was taken to address concerns about loss of certification status in the event that an individual, for whatever reason, did not participate in the recommended assessment process for a given maintenance cycle. Without the results of the third party administered assessment tool to identify and direct certification


INCorPorATING CoNTINuING ComPETENCy

The framework of each Education Plan begins with a baseline requirement of 15 hours of CE, which can be accrued in any content areas related to an individual’s certification specialty. The CE as indicated by an individual’s specialty index ratings is combined with the original baseline CE requirement to develop the final CE total for the individual Education Plan. This baseline CE hour requirement will vary depending on the number and weighting of any core competency areas for which the standard of 7.5 or greater is achieved. If an individual had specialty index ratings of 7.5 or higher in all the core competency areas of their certification specialty, the Education Plan would have a require-ment for 15 hours of CE total (reflecting the baseline hours). If an individual has no core competency areas with specialty index ratings of 7.5 or above, CE is required in all the core competency areas; therefore, the final Education Plan has no additional baseline hour requirement.

maintenance needs, the Alternative Maintenance approach requires completion of all 50 hours of the CE allocations across the core competency areas of an individual’s specialty. Certificants who use this approach earn CE to address all the core competency areas of their certification specialty to provide the validation needed; however, they miss the professional development aspect which accompanies the assessment approach, and this opt-out process is more costly and complex as an outlier.

Initial Implementation

Beginning in April of 2010, access to a Stage 1 orientation phase assessment was made available to all certificants holding credentials in active core certification specialties. Stage 1 was intended to provide an opportunity for certificants to experience the whole assessment process in “real time.” It also offered a chance to troubleshoot any technical issues that might be inherent between an individual’s computer, browser, internet provider and the secure assessment platform. NCC certificants impacted by the upcoming changes were encouraged to use this Stage 1 opportunity in maintenance cycles occurring from June 2010 through December 2013. Despite the carefully worded information distributed in multiple formats and in ongoing fashion differentiating the Stage 1 orientation phase from the Stage 2 binding phase, many individuals viewed Stage 1 as a “one and done” process and failed to recognize the need for completion of an assessment beginning with maintenance cycle deadlines in 2014. Hindsight would suggest that providing a reasonable time to communicate the certification maintenance program changes and instituting them without the optional orientation process may have been less confusing overall. Having noted that consideration, it is important to recognize that 42,865 NCC-certified individuals did complete a Stage 1 assessment, and it is likely that many benefited from this orientation opportunity.

Current Status and Conclusions

As of July 15, 2014, more than 85,000 assessment tools have been completed as part of the CCI process. Over thirty-seven percent of the personalized learning plans developed from the results of those completed assessments indicated a total CE need less than the previous standard of 45 CE hours. At this time, only 0.4% of individuals maintaining an active core credential have chosen to follow the Alternative Maintenance approach for certification maintenance.

From NCC’s perspective, the greatest challenge remains effective communication of the need to complete an assessment tool prior to beginning to earn CE for a given maintenance cycle. Many individuals in current Stage 2 cycles have yet to complete an assessment.

The major goals for NCC’s new Professional Development Maintenance Program were to establish a certification maintenance model that incorporated continuing competency and ongoing professional development as core elements. In addition, NCC recognized the importance of considering certificant concerns most likely to impact a successful transition. The NCC Board and staff members remain very aware that issues addressed during the early planning and development stages of its new certification maintenance program continue to cause concern and major opposition to continuing competency efforts throughout other health professions. Most notable in recent professional literature have been the ongoing expressions of discontent with the Maintenance of Certification (MOC) model presently under implementation for a number of physician specialties (Inglehart & Baron, 2012).

While opposition to change can never be completely ameliorated, NCC is hopeful that the attention given to potential certificant concerns during planning will ultimately result in a fully successful transition to the new Professional Development Maintenance Program by the end of 2017. Perhaps of more importance is the goal of having the NCC-certified population embrace continuing competency and life-long learning as integral values of the nationally certified nurse and nurse practitioner.

ReferencesByrd FB, Burns B, Grossklags BL. (2013). An empirical evaluation of

the adequacy of self-assessed knowledge competency in a certified population of women’s health care nurse practitioners. Journal of Nursing Education and Practice, 3, 11-20. doi: 10.5430/jnep.v3n6p11.

Davis DA, Mazmanian PE, Fordis M, Van Harrison R, Thorpe KE, Math M. (2006). Accuracy of physician self-assessment compared with observed measures of competence. JAMA, 296, 1094-1102.

Engelhart, G. (2008). An empirical evaluation of the self-competency assessment of nurse practitioners in women’s healthcare. (Unpublished manuscript).

Inglehart JK, Baron, RB. (2012). Ensuring physicians’ competence--Is maintenance of certification the answer? The New England Journal of Medicine, 367, 2543-2549.

Institute of Medicine, Consensus Report (1999). To Err is Human. Retrieved from http://iom.edu/Reports/1999/To-Err-is-Human-Building-A-Safer-Health-System.aspx.


INCorPorATING CoNTINuING ComPETENCy

CleAr403 Marquis AvenueSuite 200Lexington, KY 40502

NON PROFIT ORG.

U.S. POSTAGE

PAIDLexington, KY

Permit No. 1

Documents

CLEAR Exam Revie · CLEAR Exam Review is a journal, published twice a year, reviewing issues affecting testing and credentialing. CER is published by the Council on Licensure, Enforcement,