Center for Research
on Learning and Teaching
University of Michigan
BEST PRACTICES FOR DESIGNING AND GRADING EXAMS
Mary E. Piontek
The most obvious function of assessment methods such as exams,quizzes, papers, presentations, etc., is to enable instructors to makejudgments about the quality of student learning (i.e., assign grades).However, the methods of assessment used by faculty can also have adirect impact on the quality of student learning. Students assume that thefocus of exams and assignments reflects the educational goals mostvalued by an instructor, and they direct their learning and studyingaccordingly (McKeachie & Svinicki, 2006). Given the importance ofassessment for both faculty and student interactions about learning, howcan instructors develop exams that provide useful and relevant data abouttheir students' learning and also direct students to spend their time on theimportant aspects of a course or course unit? How do grading practicesfurther influence this process?
Creating high quality educational assessments requires both art andscience: the art of creatively engaging students in assessments that theyview as fair and meaningful, and that produce relevant data about studentachievement; and the science of assessment design, item writing, andgrading procedures (Worthen, Borg, & White, 1993). This OccasionalPaper provides an overview of the science of developing valid and reliableexams, especially multiple-choice and essay items. Additionally, thepaper describes key issues related to grading: holistic and trait-analyticrubrics, and normative and criterion grading systems.
Guidelines for Designing Valid and Reliable Exams
Ideally, effective exams have four characteristics. They are valid(providing useful information about the concepts they were designed totest), reliable (allowing consistent measurement and discriminatingbetween different levels of performance), recognizable (instruction hasprepared students for the assessment), and realistic (concerning time andeffort required to complete the assignment) (Svinicki, 1999a). Thefollowing six guidelines are designed to help you create suchassessments:
1. Focus on the most important content and behaviors you haveemphasized during the course (or particular section of the course). Startby asking yourself to identify the primary ideas, issues, and skillsencountered by students during a particular course/unit/module. Thedistribution of items should reflect the relative emphasis you gave to
Mary E. Piontek is Evaluation Researcher at the Center for Research onLearning and Teaching (CRLT). She has a Ph.D. in Measurement,Research, and Evaluation.
LP1.100018 3/18/08 2:17 PM Page 1
content coverage and skill development (Gronlund & Linn,1990). If you teach for memorization then you should assessfor memorization or classification; if you emphasizeproblem-solving strategies your exams should focus onassessing students application and analysis skills. As ageneral rule, assessments that focus too heavily on details(e.g., isolated facts, figures, etc.) can have a negative impacton student learning. "Asking learners to recall particularpieces of the information they've been taught often leads to'selective forgetting' of related information that they werenot asked to recall." In fact, testing for "relativelyunimportant points in the belief that 'testing for thefootnotes' will enhance learning,will probably lead tobetter student retention of the footnotes at the cost of themain points" (Halpern & Hakel, 2003, p. 40).
2. Write clearly and simply. Appropriate language level,sentence structure, and vocabulary are three essentialcharacteristics of all assessment items. The items should beclear and succinct, with simple, declarative sentencestructures.
3. Create more items than you will need so that you canchoose those that best measure the important content andbehaviors.
4. Group the items by same format (true-false, multiple-choice, essay) or by content or topical area.
5. Review the items after a cooling off period so that youcan critique each item for its relevance to what you haveactually taught (Thorndike, 1997; Worthen, et al., 1993).
6. Prepare test directions carefully to provide students with
Limited primarily to testing knowledgeof information. Easy to guess correctlyon many items, even if material has notbeen mastered.
Difficult and time consuming to writegood items. Possible to assess higherorder cognitive skills, but most itemsassess only knowledge. Some correctanswers can be guesses.
Higher order cognitive skills aredifficult to assess.
Difficult to identify defensible criteriafor correct answers. Limited toquestions that can be answered orcompleted in very few words.
Time consuming to administer andscore. Difficult to identify reliablecriteria for scoring. Only a limitedrange of content can be sampled duringany one testing period.
Type of Item
Short Answer or Completion
Many items can be administered in arelatively short time. Moderately easyto write; easily scored.
Can be used to assess broad range ofcontent in a brief period. Skillfullywritten items can measure higher ordercognitive skills. Can be scored quickly.
Items can be written quickly. A broadrange of content can be assessed.Scoring can be done efficiently.
Many can be administered in a briefamount of time. Relatively efficient toscore. Moderately easy to write.
Can be used to measure higher ordercognitive skills. Relatively easy to writequestions. Difficult for respondent toget correct answer by guessing.
Adapted from Table 10.1 of Worthen, et al., 1993, p. 261.
Table 1: Advantages and Disadvantages of Commonly Used Types of Achievement Test Items
LP1.100018 3/18/08 2:17 PM Page 2
information about the tests purpose, time allowed,procedures for responding to items and recording answers,and scoring/grading criteria (Gronlund & Linn, 1990;McMillan, 2001).
Advantages and Disadvantages of Commonly UsedTypes of Assessment Items
As noted in Table 1, each type of exam item has itsadvantages and disadvantages in terms of ease of design,implementation, and scoring, and in its ability to measuredifferent aspects of students knowledge or skills. Thefollowing sections of this paper highlight general guidelinesfor developing multiple-choice and essay items. Multiple-choice and essay items are often used in college-levelassessment because they readily lend themselves tomeasuring higher order thinking skills (e.g., application,justification, inference, analysis and evaluation). Yetinstructors often struggle to create, implement, and scorethese items (McMillan, 2001; Worthen, et al., 1993).
General Guidelines for Developing Multiple-Choice Items
Multiple-choice items have a number of advantages.First, multiple-choice items can measure various kinds ofknowledge, including students' understanding ofterminology, facts, principles, methods, and procedures, aswell as their ability to apply, interpret, and justify. Whencarefully designed, multiple-choice items can assess higher-order thinking skills as shown in Example 1, in whichstudents are required to generalize, analyze, and makeinferences about data in a medical patient case.
Multiple-choice items are less ambiguous than short-answer items, thereby providing a more focused assessmentof student knowledge. Multiple-choice items are superior totrue-false items in several ways: on true-false items,students can receive credit for knowing that a statement isincorrect, without knowing what is correct. Multiple-choiceitems offer greater reliability than true-false items as theopportunity for guessing is reduced with the larger numberof options. Finally, an instructor can diagnosemisunderstanding by analyzing the incorrect options chosenby students.
A disadvantage of multiple-choice items is that theyrequire developing incorrect, yet plausible, options that canbe difficult for the instructor to create. In addition, multiple-
choice questions do not allow instructors to measurestudents ability to organize and present ideas. Finally,because it is much easier to create multiple-choice itemsthat test recall and recognition rather than higher orderthinking, multiple-choice exams run the risk of notassessing the deep learning that many instructors considerimportant (Gronlund & Linn, 1990; McMillan, 2001).
Guidelines for developing multiple-choice itemsThere are nine primary guidelines for developing
multiple-choice items (Gronlund & Linn, 1990; McMillan,2001). Following these guidelines increases the validity andreliability of multiple-choice items that one might use forquizzes, homework assignments, and/or examinations.
Example 1: A Series of Multiple-Choice Items ThatAssess Higher Order Thinking:
Patient WC was admitted for 3rd degree burns over75% of his body. The attending physician asks you tostart this patient on antibiotic therapy. Which one of thefollowing is the best reason why WC would needantibiotic prophylaxis?a. His burn injuries have broken down the innate
immunity that prevents microbial invasion.b. His injuries have inhibited his cellular immunity.c. His injuries have impaired antibody production.d. His injuries have induced the bone marrow, thus
activated immune system.
Two days later, WC's labs showed: WBC 18,000 cells/mm3; 75% neutrophils (20% bandcells); 15% lymphocytes; 6% monocytes; 2% eosophils;and 2% basophils.
Which one of the following best describes WC's labresults?a. Leukocytosis with left shiftb. Normal neutrophil cou