12
Center for Research on Learning and Teaching University of Michigan BEST PRACTICES FOR DESIGNING AND GRADING EXAMS Mary E. Piontek Introduction The most obvious function of assessment methods such as exams, quizzes, papers, presentations, etc., is to enable instructors to make judgments about the quality of student learning (i.e., assign grades). However, the methods of assessment used by faculty can also have a direct impact on the quality of student learning. Students assume that the focus of exams and assignments reflects the educational goals most valued by an instructor, and they direct their learning and studying accordingly (McKeachie & Svinicki, 2006). Given the importance of assessment for both faculty and student interactions about learning, how can instructors develop exams that provide useful and relevant data about their students' learning and also direct students to spend their time on the important aspects of a course or course unit? How do grading practices further influence this process? Creating high quality educational assessments requires both art and science: the art of creatively engaging students in assessments that they view as fair and meaningful, and that produce relevant data about student achievement; and the science of assessment design, item writing, and grading procedures (Worthen, Borg, & White, 1993). This Occasional Paper provides an overview of the science of developing valid and reliable exams, especially multiple-choice and essay items. Additionally, the paper describes key issues related to grading: holistic and trait-analytic rubrics, and normative and criterion grading systems. Guidelines for Designing Valid and Reliable Exams Ideally, effective exams have four characteristics. They are valid (providing useful information about the concepts they were designed to test), reliable (allowing consistent measurement and discriminating between different levels of performance), recognizable (instruction has prepared students for the assessment), and realistic (concerning time and effort required to complete the assignment) (Svinicki, 1999a). The following six guidelines are designed to help you create such assessments: 1. Focus on the most important content and behaviors you have emphasized during the course (or particular section of the course). Start by asking yourself to identify the primary ideas, issues, and skills encountered by students during a particular course/unit/module. The distribution of items should reflect the relative emphasis you gave to Mary E. Piontek is Evaluation Researcher at the Center for Research on Learning and Teaching (CRLT). She has a Ph.D. in Measurement, Research, and Evaluation. CRLT Occasional Papers No. 24

BEST PRACTICES FOR DESIGNING AND GRADING … · on Learning and Teaching ... BEST PRACTICES FOR DESIGNING AND GRADING EXAMS ... She has a Ph.D. in Measurement, Research, and Evaluation…

Embed Size (px)

Citation preview

�Center for Research

on Learning and Teaching

University of Michigan

��

BEST PRACTICES FOR DESIGNING AND GRADING EXAMS

Mary E. Piontek

Introduction

The most obvious function of assessment methods such as exams,quizzes, papers, presentations, etc., is to enable instructors to makejudgments about the quality of student learning (i.e., assign grades).However, the methods of assessment used by faculty can also have adirect impact on the quality of student learning. Students assume that thefocus of exams and assignments reflects the educational goals mostvalued by an instructor, and they direct their learning and studyingaccordingly (McKeachie & Svinicki, 2006). Given the importance ofassessment for both faculty and student interactions about learning, howcan instructors develop exams that provide useful and relevant data abouttheir students' learning and also direct students to spend their time on theimportant aspects of a course or course unit? How do grading practicesfurther influence this process?

Creating high quality educational assessments requires both art andscience: the art of creatively engaging students in assessments that theyview as fair and meaningful, and that produce relevant data about studentachievement; and the science of assessment design, item writing, andgrading procedures (Worthen, Borg, & White, 1993). This OccasionalPaper provides an overview of the science of developing valid and reliableexams, especially multiple-choice and essay items. Additionally, thepaper describes key issues related to grading: holistic and trait-analyticrubrics, and normative and criterion grading systems.

Guidelines for Designing Valid and Reliable Exams

Ideally, effective exams have four characteristics. They are valid(providing useful information about the concepts they were designed totest), reliable (allowing consistent measurement and discriminatingbetween different levels of performance), recognizable (instruction hasprepared students for the assessment), and realistic (concerning time andeffort required to complete the assignment) (Svinicki, 1999a). Thefollowing six guidelines are designed to help you create suchassessments:

1. Focus on the most important content and behaviors you haveemphasized during the course (or particular section of the course). Startby asking yourself to identify the primary ideas, issues, and skillsencountered by students during a particular course/unit/module. Thedistribution of items should reflect the relative emphasis you gave to

Mary E. Piontek is Evaluation Researcher at the Center for Research onLearning and Teaching (CRLT). She has a Ph.D. in Measurement,Research, and Evaluation.

CRLTOccasional

Papers

No. 24

LP1.100018 3/18/08 2:17 PM Page 1

content coverage and skill development (Gronlund & Linn,1990). If you teach for memorization then you should assessfor memorization or classification; if you emphasizeproblem-solving strategies your exams should focus onassessing students’ application and analysis skills. As ageneral rule, assessments that focus too heavily on details(e.g., isolated facts, figures, etc.) can have a negative impacton student learning. "Asking learners to recall particularpieces of the information they've been taught often leads to'selective forgetting' of related information that they werenot asked to recall." In fact, testing for "relativelyunimportant points in the belief that 'testing for thefootnotes' will enhance learning,…will probably lead tobetter student retention of the footnotes at the cost of themain points" (Halpern & Hakel, 2003, p. 40).

2. Write clearly and simply. Appropriate language level,sentence structure, and vocabulary are three essentialcharacteristics of all assessment items. The items should beclear and succinct, with simple, declarative sentencestructures.

3. Create more items than you will need so that you canchoose those that best measure the important content andbehaviors.

4. Group the items by same format (true-false, multiple-choice, essay) or by content or topical area.

5. Review the items after a “cooling off ” period so that youcan critique each item for its relevance to what you haveactually taught (Thorndike, 1997; Worthen, et al., 1993).

6. Prepare test directions carefully to provide students with

2

Disadvantages

Limited primarily to testing knowledgeof information. Easy to guess correctlyon many items, even if material has notbeen mastered.

Difficult and time consuming to writegood items. Possible to assess higherorder cognitive skills, but most itemsassess only knowledge. Some correctanswers can be guesses.

Higher order cognitive skills aredifficult to assess.

Difficult to identify defensible criteriafor correct answers. Limited toquestions that can be answered orcompleted in very few words.

Time consuming to administer andscore. Difficult to identify reliablecriteria for scoring. Only a limitedrange of content can be sampled duringany one testing period.

Type of Item

True-False

Multiple-Choice

Matching

Short Answer or Completion

Essay

Advantages

Many items can be administered in arelatively short time. Moderately easyto write; easily scored.

Can be used to assess broad range ofcontent in a brief period. Skillfullywritten items can measure higher ordercognitive skills. Can be scored quickly.

Items can be written quickly. A broadrange of content can be assessed.Scoring can be done efficiently.

Many can be administered in a briefamount of time. Relatively efficient toscore. Moderately easy to write.

Can be used to measure higher ordercognitive skills. Relatively easy to writequestions. Difficult for respondent toget correct answer by guessing.

Adapted from Table 10.1 of Worthen, et al., 1993, p. 261.

Table 1: Advantages and Disadvantages of Commonly Used Types of Achievement Test Items

LP1.100018 3/18/08 2:17 PM Page 2

information about the test’s purpose, time allowed,procedures for responding to items and recording answers,and scoring/grading criteria (Gronlund & Linn, 1990;McMillan, 2001).

Advantages and Disadvantages of Commonly UsedTypes of Assessment Items

As noted in Table 1, each type of exam item has itsadvantages and disadvantages in terms of ease of design,implementation, and scoring, and in its ability to measuredifferent aspects of students’ knowledge or skills. Thefollowing sections of this paper highlight general guidelinesfor developing multiple-choice and essay items. Multiple-choice and essay items are often used in college-levelassessment because they readily lend themselves tomeasuring higher order thinking skills (e.g., application,justification, inference, analysis and evaluation). Yetinstructors often struggle to create, implement, and scorethese items (McMillan, 2001; Worthen, et al., 1993).

General Guidelines for Developing Multiple-Choice Items

Multiple-choice items have a number of advantages.First, multiple-choice items can measure various kinds ofknowledge, including students' understanding ofterminology, facts, principles, methods, and procedures, aswell as their ability to apply, interpret, and justify. Whencarefully designed, multiple-choice items can assess higher-order thinking skills as shown in Example 1, in whichstudents are required to generalize, analyze, and makeinferences about data in a medical patient case.

Multiple-choice items are less ambiguous than short-answer items, thereby providing a more focused assessmentof student knowledge. Multiple-choice items are superior totrue-false items in several ways: on true-false items,students can receive credit for knowing that a statement isincorrect, without knowing what is correct. Multiple-choiceitems offer greater reliability than true-false items as theopportunity for guessing is reduced with the larger numberof options. Finally, an instructor can diagnosemisunderstanding by analyzing the incorrect options chosenby students.

A disadvantage of multiple-choice items is that theyrequire developing incorrect, yet plausible, options that canbe difficult for the instructor to create. In addition, multiple-

choice questions do not allow instructors to measurestudents’ ability to organize and present ideas. Finally,because it is much easier to create multiple-choice itemsthat test recall and recognition rather than higher orderthinking, multiple-choice exams run the risk of notassessing the deep learning that many instructors considerimportant (Gronlund & Linn, 1990; McMillan, 2001).

Guidelines for developing multiple-choice itemsThere are nine primary guidelines for developing

multiple-choice items (Gronlund & Linn, 1990; McMillan,2001). Following these guidelines increases the validity andreliability of multiple-choice items that one might use forquizzes, homework assignments, and/or examinations.

3

Example 1: A Series of Multiple-Choice Items ThatAssess Higher Order Thinking:

Patient WC was admitted for 3rd degree burns over75% of his body. The attending physician asks you tostart this patient on antibiotic therapy. Which one of thefollowing is the best reason why WC would needantibiotic prophylaxis?a. His burn injuries have broken down the innate

immunity that prevents microbial invasion.b. His injuries have inhibited his cellular immunity.c. His injuries have impaired antibody production.d. His injuries have induced the bone marrow, thus

activated immune system.

Two days later, WC's labs showed: WBC 18,000 cells/mm3; 75% neutrophils (20% bandcells); 15% lymphocytes; 6% monocytes; 2% eosophils;and 2% basophils.

Which one of the following best describes WC's labresults?a. Leukocytosis with left shiftb. Normal neutrophil count with left shiftc. High eosinophil count in response to allergic

reactionsd. High lymphocyte count due to activation of

adaptive immunity

(Jeong Park, U-M College of Pharmacy, personalcommunication, February 4, 2008)

LP1.100018 3/18/08 2:17 PM Page 3

The first four guidelines concern the item "stem," whichposes the problem or question to which the choices refer.

1. Write the stem as a clearly described question, problem,or task.

2. Provide the information in the stem and keep theoptions as short as possible.

3. Include in the stem only the information needed to make the problem clear and specific.

The stem of the question should communicate thenature of the task to the students and present a clear problemor concept. The stem of the question should provide onlyinformation that is relevant to the problem or concept, andthe options (distractors) should be succinct.

4. Avoid the use of negatives in the stem (use only when you are measuring whether the respondent knows theexception to a rule or can detect errors).

You can word most concepts in positive terms and thusavoid the possibility that students will overlook terms of“no, not, or least” and choose an incorrect option notbecause they lack the knowledge of the concept but becausethey have misread the stated question. Italicizing,capitalizing, using bold-face, or underlying the negativeterm makes it less likely to be overlooked.

The remaining five guidelines concern the choices fromwhich students select their answer.

5. Have ONLY one correct answer.

Make certain that the item has one correct answer.Multiple-choice items usually have at least three incorrectoptions (distractors).

6. Write the correct response with no irrelevant clues.

A common mistake when designing multiple-choicequestions is to write the correct option with moreelaboration or detail, using more words, or using generalterminology rather than technical terminology.

7. Write the distractors to be plausible yet clearly wrong.

An important, and sometimes difficult to achieve,aspect of multiple-choice items is ensuring that theincorrect choices (distractors) appear to be possibly correct.Distractors are best created using common errors ormisunderstandings about the concept being assessed, andmaking them homogeneous in content and parallel in formand grammar.

8. Avoid using "all of the above," "none of the above," orother special distractors (use only when an answer canbe classified as unequivocally correct or incorrect).

All of the above and none of the above are often addedas answer options to multiple-choice items. This techniquerequires the student to read all of the options and mightincrease the difficulty of the items, but too often the use ofthese phrases is inappropriate. None of the above should berestricted to items of factual knowledge with absolutestandards of correctness. It is inappropriate for questionswhere students are asked to select “the best” answer. All ofthe above is awkward in that many students will choose it ifthey can identify at least one of the other options as correctand therefore assume all of the choices are correct – therebyobtaining a correct answer based on partial knowledge ofthe concept/content (Gronlund & Linn, 1990).

9. Use each alternative as the correct answer about thesame number of times.

Check to see whether option “a” is correct about thesame number of times as option “b” or “c” or “d” across theinstrument. It can be surprising to find that one has createdan exam in which the choice “a” is correct 90% of the time.Students quickly find such patterns and increase theirchances of “correct guessing” by selecting that answeroption by default.

4

Checklist for Writing Multiple-Choice Items

✓ Is the stem stated as clearly, directly, and simply aspossible?

✓ Is the problem self-contained in the stem?✓ Is the stem stated positively?✓ Is there only one correct answer?✓ Are all the alternatives parallel with respect to

grammatical structure, length, and complexity?✓ Are irrelevant clues avoided?✓ Are the options short?✓ Are complex options avoided?✓ Are options placed in logical order?✓ Are the distractors plausible to students who do not

know the correct answer?✓ Are correct answers spread equally among all the

choices? (McMillan, 2001, p. 150)

LP1.100018 3/18/08 2:17 PM Page 4

5

General Guidelines for Developing and Scoring Essay Items

Essays can tap complex thinking by requiring students toorganize and integrate information, interpret information,construct arguments, give explanations, evaluate the meritof ideas, and carry out other types of reasoning (Cashin,

1987; Gronlund & Linn, 1990; McMillan, 2001; Thorndike,1997; Worthen, et al., 1993). Table 2 provides examples ofessay question stems for assessing a variety of reasoningskills.

In addition to measuring complex thinking andreasoning, advantages of essays include the potential formotivating better study habits and providing the students

Example 5: Levels of Performance for Critical Thinking

4 = Exemplary: Clearly defines the issue or problem; accurately identifies the core issues; appreciates

depth and breadth of problem. Identifies and evaluates relevant significant points of view.

3 = Satisfactory: Defines the issue; identifies the core issues, but does not fully explore their depth and

breadth. Identifies and evaluates relevant general points of view.

2 = Below Satisfactory: Defines the issue, but poorly (superficially, narrowly); overlooks some core

issues. Identifies other points of view but focuses on insignificant points of view.

1 = Unsatisfactory: Fails to clearly define the issue or problem; does not recognize the core issues.

Ignores or superficially evaluates alternate points of view.

(Adapted from Critical Thinking Rubric, 2008)

Skill

Comparing

Relating Cause and Effect

Justifying

Summarizing

Generalizing

Inferring

Classifying

Creating

Applying

Analyzing

Synthesizing

Evaluating

Stem

Describe the similarities and differences between...Compare the following two methods for...

What are the major causes of...What would be the mostly likely effects of...

Which of the following alternatives do you favor and why?Explain why you agree or disagree with the following statement.

State the main points included in...Briefly summarize the contents of...

Formulate several valid generalizations for the following data.State a set of principles that can explain the following events.

In light of the information presented, what is most likely to happen when...How would person X be likely to react to the following issue?

Group the following items according to...What do the following items have in common?

List as many ways as you can think of for/to...Describe what would happen if...

Using the principles of...as a guide, describe how you would solve the following problem.

Describe a situation that illustrates the principle of...

Describe the reasoning errors in the following paragraph.List and describe the main characteristics of...

Describe a plan for providing that...Write a well-organized report that shows...

Describe the strengths and weaknesses of...Using the given criteria, write an evaluation of...

Adapted from Figure 7.11 of McMillan, 2001, p.186.

Table 2: Sample Essay Item Stems for Assessing Reasoning Skills

LP1.100018 3/19/08 7:58 AM Page 5

6

flexibility in their responses. Instructors can evaluate howwell students are able to communicate their reasoning withessay items, and they are usually less time consuming toconstruct than multiple-choice items that measurereasoning. It should, however, be noted that creating a highquality essay item takes a significant amount of skill andtime.

The major disadvantages of essays include the amount oftime faculty must devote to reading and scoring studentresponses, and the importance of developing and usingcarefully constructed criteria/rubrics to insure reliability ofscoring. Essays can assess only a limited amount of contentin one testing period/exam due to the length of time requiredfor students to respond to each essay item. As a result, essaysdo not provide a good sampling of content knowledge acrossa curriculum (Gronlund & Linn, 1990; McMillan, 2001).

Types of essay itemsWe can distinguish between two types of essay questions,

restricted-response and extended-response. A restricted-response essay focuses on assessing basic knowledge andunderstanding, and generally requires a relatively briefwritten response. Restricted-response essay questions assesstopics of limited scope, and the nature of the questionconfines the form of the written response (see Example 2).

Extended-response essay items allow students toconstruct a variety of strategies, processes, interpretations,and explanations for a given question, and to provide anyinformation they consider relevant (see Example 3). Theflexibility of an extended-response item can make it less

efficient for measuring specific learning outcomes than arestricted-response item, but it allows for greateropportunity to assess students’ organization, integration,and evaluation abilities.

Guidelines for developing essay itemsThere are six primary guidelines for developing essay

items (Gronlund & Linn, 1990; McMillan, 2001; Worthen,et al., 1993).

1. Restrict the use of essay questions to educationaloutcomes that are difficult to measure using otherformats.

2. Construct the item to elicit skills and knowledge in theeducational outcomes.

3. Write the item so that students clearly understand thespecific task.

Other assessment formats are better for measuringrecall knowledge (e.g., true-false, fill-in-the-blank,multiple-choice); the essay is able to measure deepunderstanding and mastery of complex information. Whenconstructing essay items, start by identifying the specificskills and knowledge that will be assessed. As noted earlier,Table 2 provides examples of essay item question stems forassessing a variety of reasoning skills. Building from thesesimple stem formats, you can create a restricted-response

Example 3: Extended-Response Items

Compare and contrast the social conditions, prevailingpolitical thought, and economic conditions in the U.S.North and South just prior to the outbreak of the CivilWar; defend the issue that you believe was the mostsignificant catalyst to the war.

(Worthen, et al., 1993, p. 276)

The framers of the Constitution strove to create aneffective national government that balanced the tensionbetween majority rule and the rights of minorities. Whataspects of American politics favor majority rule? Whataspects protect the rights of those not in the majority?Drawing upon material from your readings and thelectures, did the framers successfully balance thistension? Why or why not?

(Charles Shipan, UM Department of Political Science,personal communication, February 4, 2008)

Example 2: Restricted-Response Items

State two hypotheses about why birds migrate.Summarize the evidence supporting each hypothesis.

(Worthen, et al., 1993, p. 277)

Identify and explain the significance of the followingterms. For each term, be sure to define what it means,explain why it is important in American politics, andlink it to other related concepts that we have covered inclass. 1) Conformity costs 2) Voting Age Population(VAP)

(Charles Shipan, UM Department of Political Science,personal communication, February 4, 2008)

LP1.100018 3/19/08 7:58 AM Page 6

or extended-response item to assess one or more skills orcontent topics.

Once you have identified the specific skills andknowledge, you should word the question clearly andconcisely so that it communicates to the students thespecific task(s) you expected them to complete (e.g., state,formulate, evaluate, use the principle of, create a plan for,etc.). If the language is ambiguous or students feel they areguessing at “what the instructor wants me to do,” the abilityof the item to measure the intended skill or knowledgedecreases.

4. Indicate the amount of time and effort students shouldspend on each essay item.

In essay items, especially when used in multiples and/orcombined with other item formats, you should providestudents with a general time limit or time estimate to helpthem structure their responses. Providing estimates oflength of written responses to each item can also helpstudents manage their time, providing cues about the depthand breadth of information that is required to complete theitem. In restricted-response items a few paragraphs areusually sufficient to complete a task focusing on a singleeducational outcome.

5. Avoid giving students options as to which essayquestions they will answer.

A common structure in many exams is to providestudents with a choice of essay items to complete (e.g.,“choose two out of the three essay questions tocomplete…”). Instructors, and many students, often viewessay choice as a way to increase the flexibility and fairnessof the exam by allowing students to focus on those items forwhich they feel most prepared. However, the choice actuallydecreases the validity and reliability of the instrumentbecause each student is essentially taking a different test.

Creating parallel essay items (from which studentschoose a subset) that test the same educational objectives(skills, knowledge) is very difficult, and unless students areanswering the same questions that measure the sameoutcomes, scoring the essay items and the inferences madeabout student ability are less valid. While allowing studentsa choice gives them the perception that they have theopportunity to do their best work, you must also recognizethat choice entails difficulty in drawing consistent and validconclusions about student answers and performance.

6. Consider using several narrowly focused items ratherthan one broad item.

For many educational objectives aimed at higher orderreasoning skills, creating a series of essay items that elicitdifferent aspects students’ skills and knowledge can bemore efficient than attempting to create one question tocapture multiple objectives. By using multiple essay items(which all students complete), you can capture a variety ofskills and knowledge while also covering a greater breadthof course content.

Guidelines for scoring essay itemsThere are five primary guidelines for scoring essay items

(Gronlund & Linn, 1990; McMillan, 2001; Wiggins, 1998;Worthen, et al., 1993; Writing and grading essay questions,1990).

1. Outline what constitutes an expected answer (criteriafor knowledge and skills).

Developing the criteria for what you expect students toknow and be able to demonstrate in advance of giving theexam allows you to refine the wording of the essayquestions so that they best elicit the content, style, andformat you desire.

Identifying the criteria in advance also decreases thelikelihood that you will be influenced by the initial answersyou read when you begin to grade the exam. It is easy tobecome distracted by the first few answers one reads for agiven essay item and allow the skills and knowledge (or lackthereof) demonstrated in these few students’ answers to setthe scoring criteria, rather than relying on sound criteriabased on the educational outcomes of the course curricula.Using predetermined criteria increases the fairness,accuracy, and consistency of the scoring process.

2. Select an appropriate scoring method based on thecriteria.

Generally we describe scoring methods for writtencommunication as either trait analytic, where eachidentified criterion is assigned separate points, or holistic,where a single score is generated resulting in an overalljudgment of the quality. Restricted-response essays areoften graded using the trait analytic method and extended-response essays, the holistic method, but either method canbe used with either item type. To implement your approach,you can develop a rubric or scoring key that specifies theeducational criteria for scoring (as described in the previous

7

LP1.100018 3/18/08 2:17 PM Page 7

8

guideline) and the amount of credit or points to be assignedfor each criterion.

The validity of a scoring rubric is affected by twofactors: how clearly the descriptions in the rubric'scategories discriminate between levels of studentperformance; and the degree of congruence between thecriteria and the knowledge and skills imbedded in theinstruction. In other words, do the criteria outlined in therubric genuinely reflect the tasks that are important foranswering the question (use of examples, rationale forinferences, etc.)? Or do they instead describe generalbehaviors useful when answering any essay question(engaging tone, interesting format, etc.) but not critical tothe specific tasks at hand? Example 4 shows a rubric forgrading critical papers that clearly and succinctly describesexpectations for conceptual structure, rhetorical structure,thesis statement, evidence and analysis, paragraph/sectionstructure, and sentence mechanics for five levels of grades(A-D, F).

Finally, effective rubrics should avoid the use ofcomparative language such as “better than, more than,worse than, less than…” in discriminating between thelevels of performance. Rather, the rubric should describespecifically those characteristics that define and delineatethat level of performance, as is shown in Example 5, arubric for four levels of performance on critical thinking.

Trait analytic scoring provides students with specificfeedback about each criterion and a summed score out ofthe possible total points, but can be very time consuming forthe instructor to implement. Restricting the educationalcriteria to only those most essential for an individual essayquestion, perhaps 4-6 criteria, and clearly outlining howpoints will be awarded (or deducted) streamlines theprocess (McMillan, 2001).

Concerns about fairness, accuracy, and consistency ofthe essay scoring process increase when more than onegrader has responsibility for scoring essay items. Such asituation often occurs in large courses where the facultymember and multiple graduate student instructors share theresponsibility. Developing criteria and scoring rubricstogether and defining how the scoring rubrics will beimplemented helps to establish common understandingsabout expectations for students’ work. It is also helpful tohave each member of the group grade a small subset ofessay responses and then meet together to compare andcalibrate their scoring to ensure consistency beforecompleting the scoring for the entire set of exams (Writingand grading essay questions, 1990).

3. Clarify the role of writing mechanics and other factorsindependent of the educational outcomes beingmeasured.

For example, you can outline for students how variouselements of written communication (grammar, spelling,punctuation, organization and flow, use of vocabulary/terminology, use of scientific notation or formulas) figureinto the scoring criteria. You should also decide whether youwill decrease scores or ignore the inclusion of informationirrelevant to the question.

4. Use a systematic process for scoring each essay item.

5. Create anonymity for students’ responses while scoring.

Assessment guidelines suggest scoring all answers for anindividual essay question in one continuous process, ratherthan scoring all answers to all questions for an individualstudent. In other words, when faced with a set of exams toscore, read and score all of essay question #1, then shufflethe set of exams to randomly reorder them, and then readand score all of essay question #2. Grading the same essayquestion for all students creates a more uniform standard ofscoring as it is easier to remember the criteria for scoringeach answer. Shuffling decreases the effect of grader fatigueon the accuracy of the scoring, as exams that mightotherwise be scored at the end of the pile (e.g., students withlast names at the end of the alphabet or higher I.D. numbers)

Example 5: Levels of Performance for Critical Thinking

4 = Exemplary: Clearly defines the issue or problem;accurately identifies the core issues; appreciates depthand breadth of problem. Identifies and evaluatesrelevant significant points of view.3 = Satisfactory: Defines the issue; identifies the coreissues, but does not fully explore their depth andbreadth. Identifies and evaluates relevant general pointsof view.2 = Below Satisfactory: Defines the issue, but poorly(superficially, narrowly); overlooks some core issues.Identifies other points of view but focuses oninsignificant points of view.1 = Unsatisfactory: Fails to clearly define the issue orproblem; does not recognize the core issues. Ignores orsuperficially evaluates alternate points of view.

(Adapted from Critical Thinking Rubric, 2008)

LP1.100018 3/18/08 2:17 PM Page 8

Rhe

tori

cal S

truc

ture

cont

ains

a c

onvi

ncin

g ar

gum

ent

wit

h a

com

pell

ing

purp

ose;

hig

hly

resp

onsi

ve t

o th

e de

man

ds o

f th

esp

ecif

ic w

riti

ng s

itua

tion

;so

phis

tica

ted

use

of c

onve

ntio

ns o

fac

adem

ic d

isci

plin

e an

d ge

nre;

anti

cipa

tes

the

read

er’s

nee

d fo

rin

form

atio

n, e

xpla

nati

on, a

ndco

ntex

t

addr

esse

s au

dien

ce w

ith

ath

ough

tful

arg

umen

t w

ith

a cl

ear

purp

ose;

res

pond

s di

rect

ly t

o th

ede

man

ds o

f a

spec

ific

wri

ting

situ

atio

n; c

ompe

tent

use

of

the

conv

enti

ons

of a

cade

mic

dis

cipl

ine

and

genr

e; a

ddre

sses

the

rea

der’s

need

for

inf

orm

atio

n, e

xpla

nati

on,

cont

ext

pres

ents

ade

quat

e re

spon

se t

o th

ees

say

prom

pt;

pays

att

enti

on t

o th

eba

sic

elem

ents

of

the

wri

ting

situ

atio

n; s

how

s su

ffic

ient

com

pete

nce

in t

he c

onve

ntio

ns o

fac

adem

ic d

isci

plin

e an

d ge

nre;

sign

als

the

impo

rtan

ce o

f th

ere

ader

’s n

eed

for

info

rmat

ion,

expl

anat

ion,

and

con

text

show

s se

riou

s w

eakn

esse

s in

addr

essi

ng a

n au

dien

ce;

unre

spon

sive

to

the

spec

ific

wri

ting

situ

atio

n; p

oor

arti

cula

tion

of

purp

ose

in a

cade

mic

wri

ting

; of

ten

stat

es t

he o

bvio

us o

r th

ein

appr

opri

ate

show

s se

vere

dif

ficu

ltie

sco

mm

unic

atin

g th

roug

h ac

adem

icw

riti

ng

Let

ter

Gra

des

A B C D F

Con

cept

ual S

truc

ture

coge

nt a

naly

sis,

sho

ws

com

man

d of

int

erpr

etiv

ean

d co

ncep

tual

tas

ksre

quir

ed b

y as

sign

men

tan

d co

urse

mat

eria

ls:

idea

s or

igin

al, o

ften

insi

ghtf

ul, g

oing

bey

ond

idea

s di

scus

sed

in l

ectu

rean

d cl

ass

show

s a

good

unde

rsta

ndin

g of

the

tex

ts,

idea

s an

d m

etho

ds o

f th

eas

sign

men

t; g

oes

beyo

ndth

e ob

viou

s; m

ay h

ave

one

min

or f

actu

al o

rco

ncep

tual

inc

onsi

sten

cy

show

s an

und

erst

andi

ng o

fth

e ba

sic

idea

s an

din

form

atio

n in

volv

ed i

nth

e as

sign

men

t; m

ay h

ave

som

e fa

ctua

l, in

terp

retiv

e,or

con

cept

ual

erro

rs

show

s in

adeq

uate

com

man

d of

mat

eria

ls o

rha

s si

gnif

ican

t fa

ctua

l an

dco

ncep

tual

err

ors;

conf

uses

som

e si

gnif

ican

tid

eas

lack

s cr

itic

alun

ders

tand

ing

of l

ectu

res,

read

ings

, dis

cuss

ions

, or

assi

gnm

ents

Sent

ence

Mec

hani

cs

uses

sop

hist

icat

ed s

ente

nces

effe

ctiv

ely;

usu

ally

cho

oses

wor

ds a

ptly

; obs

erve

spr

ofes

sion

al c

onve

ntio

ns o

fw

ritte

n E

nglis

h an

dm

anus

crip

t for

mat

; mak

esve

ry f

ew m

inor

or

tech

nica

ler

rors

a fe

w m

echa

nica

l dif

ficu

lties

or s

tylis

tic p

robl

ems;

may

mak

e oc

casi

onal

pro

blem

atic

wor

d ch

oice

s or

syn

tax

erro

rs;

a fe

w s

pelli

ng o

r pu

nctu

atio

ner

rors

; usu

ally

pre

sent

squ

otat

ions

eff

ectiv

ely,

usi

ngap

prop

riat

e fo

rmat

freq

uent

wor

dine

ss; u

ncle

aror

aw

kwar

d se

nten

ces;

impr

ecis

e us

e of

wor

ds o

rov

er-r

elia

nce

on p

assi

vevo

ice;

con

tain

s ru

dim

enta

rygr

amm

atic

al e

rror

s; m

akes

effo

rt to

pre

sent

quo

tatio

nsac

cura

tely

som

e m

ajor

gra

mm

atic

al o

rpr

oofr

eadi

ng e

rror

s (s

ubje

ct-

verb

agr

eem

ent,

sent

ence

frag

men

ts, w

ord

form

err

ors,

etc.

); r

epea

ted

inex

act w

ord

choi

ces;

inco

rrec

t quo

tatio

nor

cita

tion

form

at

num

erou

s m

ajor

and

min

orgr

amm

atic

al e

rror

s an

dst

ylis

tic p

robl

ems;

doe

s no

tm

eet S

tand

ard

Wri

tten

Eng

lish

requ

irem

ent

Para

grap

h/Se

ctio

nSt

ruct

ure

wel

l-co

nstr

ucte

dpa

ragr

aphs

; app

ropr

iate

,cl

ear,

and

smoo

thtr

ansi

tions

; apt

arra

ngem

ent o

for

gani

zatio

nal e

lem

ents

dist

inct

uni

ts o

f th

ough

tin

par

agra

phs

cont

rolle

dby

spe

cifi

c, d

etai

led,

and

argu

able

topi

c se

nten

ces;

clea

r tr

ansi

tions

bet

wee

nde

velo

ped,

coh

erin

g, a

ndlo

gica

lly a

rran

ged

para

grap

hs

som

e aw

kwar

dtr

ansi

tions

; som

e br

ief,

wea

kly

unif

ied

orun

deve

lope

d pa

ragr

aphs

;ar

rang

emen

t may

not

appe

ar e

ntir

ely

natu

ral;

cont

ains

ext

rane

ous

info

rmat

ion

sim

plis

tic, t

ends

tona

rrat

e or

mer

ely

sum

mar

ize;

wan

ders

fro

mon

e to

pic

to a

noth

er;

illog

ical

arr

ange

men

t of

idea

s

no tr

ansi

tions

; inc

oher

ent

para

grap

hs; s

ugge

sts

poor

plan

ning

or

no s

erio

usre

visi

on

Evi

denc

e an

d A

naly

sis

wel

l-ch

osen

exa

mpl

es;

uses

pers

uasi

ve r

easo

ning

to

deve

lop

and

supp

ort

thes

is c

onsi

sten

tly;

uses

spec

ific

quo

tati

ons,

aes

thet

icde

tail

s, o

r ci

tati

ons

of s

chol

arly

sour

ces

effe

ctiv

ely;

log

ical

conn

ecti

ons

betw

een

idea

s ar

eev

iden

t

purs

ues

expl

anat

ion

and

proo

f of

thes

is c

onsi

sten

tly;

deve

lops

a m

ain

argu

men

t w

ith

expl

icit

maj

or p

oint

s,ap

prop

riat

e te

xtua

l ev

iden

ce, a

ndsu

ppor

ting

det

ail

only

par

tial

ly d

evel

ops

the

argu

men

t; s

hall

ow a

naly

sis;

som

eid

eas

and

gene

rali

zati

ons

unde

velo

ped

or u

nsup

port

ed;

mak

esli

mit

ed u

se o

f te

xtua

l ev

iden

ce;

fail

sto

int

egra

te q

uota

tion

sap

prop

riat

ely;

war

rant

s m

issi

ng

freq

uent

ly o

nly

narr

ates

; di

gres

ses

from

one

top

ic t

o an

othe

r w

itho

utde

velo

ping

ide

as o

r te

rms;

mak

esin

suff

icie

nt o

r aw

kwar

d us

e of

text

ual

evid

ence

; re

lies

on

too

few

or t

he w

rong

typ

e of

sou

rces

litt

le o

r no

dev

elop

men

t; m

ay l

ist

disj

oint

ed f

acts

or

mis

info

rmat

ion;

uses

no

quot

atio

ns o

r fa

ils

to c

ite

sour

ces

or p

lagi

ariz

es

The

sis

essa

y co

ntro

lled

by c

lear

, pre

cise

,w

ell-

defi

ned

thes

is;

isso

phis

tica

ted

inbo

th s

tate

men

tan

d in

sigh

t

clea

r, sp

ecif

ic,

argu

able

the

sis

cent

ral

to t

hees

say;

may

hav

ele

ft m

inor

ter

ms

unde

fine

d

gene

ral

thes

is o

rco

ntro

llin

g id

ea;

may

not

def

ine

seve

ral

cent

ral

term

s

thes

is v

ague

or

not

cent

ral

toar

gum

ent;

cen

tral

term

s no

t de

fine

d

no d

isce

rnib

leth

esis

(T. M

cElr

oy a

nd F

. Whi

ting

, Uni

vers

ity

of A

laba

ma,

per

sona

l co

mm

unic

atio

n, F

ebru

ary

15, 2

008)

Exa

mpl

e 4:

Gra

ding

Rub

ric

for

Cri

tica

l Pa

pers

LP1.100018 3/18/08 2:17 PM Page 9

10

are shuffled throughout the scoring process and scoredrandomly among the others.

The shuffling also creates a random order in which tograde each essay item, making it less likely that you willidentify a pattern in an individual student’s answers or baseyour score on previous impressions of that student. Youmight also choose to completely cover students’ names/I.D.numbers so that you cannot identify the student until afteryou have graded the entire exam.

You can also use these guidelines for scoring essay itemsto create scoring processes and rubrics for students’ papers,oral presentations, course projects/products/artifacts, andwebsites/technological tools.

Normative and Criterion Grading Systems

Fair and accurate grading depends on the development ofeffective and well-designed exams and assignments asdescribed above. Just as important to the process is theapproach you take to assigning grades. In general, we candivide these approaches to grading into two main categories:norm-referenced and criterion-referenced systems.Whatever system you use needs to be fair, distinguish clearlybetween different levels of achievement, and enable studentsto track their progress in the course (Grading systems,1991).

Norm-referenced gradingWhen using a norm-referenced approach, faculty

compare a student's performance to the norm of the class asa whole. A normative system assumes that skills andknowledge are distributed throughout the population (normgroup). Assessment is designed to maximize the differencein student performance, for example by avoiding examquestions that almost all students will answer correctly(Brown, 1983; Grading systems, 1991; Gronlund & Linn,1990; Isaac & Michael, 1990; Svinicki, 1999a; Svinicki,1999b). Advocates of norm-referenced grading argue it hasseveral benefits. Norm-referenced grading allows forflexibility in assigning grades to cohorts of students thatare unusually high- or low-achieving, thus allowingdistinctions to be made. It also allows faculty to correct forexams that turn out to be much more or less difficult thananticipated. Norm referenced grading can potentiallycombat grade inflation by restricting the number ofstudents receiving the highest grades. Finally, this approachcan be valuable because it identifies students who stand outin their cohort, which can be helpful for faculty writingletters of recommendation.

The most common version of this approach is gradingon the curve, in which grades on an exam are placed on astandard bell curve, in which, for example, 10% receive"A" or "F," 25% receive "B" or "D," and 35% receive theaverage grade of "C." However, a strict bell curvedemonstrates the major drawbacks of norm-referencedgrading, especially in relationship to student motivationand cooperation. Students realize in normative grading thatsome students must receive low/failing grades. It can bedifficult for an individual student to gauge his/herperformance during the semester and develop strategies forremediation because final grades are dependent upon howall others perform in total and not just on that individualstudent’s skills and abilities (Brown, 1983; Gradingsystems, 1991; Gronlund & Linn, 1990; Isaac & Michael,1990; Svinicki, 1999a; Svinicki, 1999b). Because studentsare competing for grades, faculty may also find it verydifficult to implement any type of cooperative learning orpeer support. In practice, faculty rarely use a strict bellcurve in which 35% would receive a D or F. However evena more forgiving curve has the potential to dampenmotivation and cooperation in a classroom.

Checklist for Writing Essay Items

✓ Is the targeted reasoning skill measured?✓ Is the task clearly specified?✓ Is there enough time to answer the questions?✓ Are choices among several questions avoided?

Checklist for Scoring Essays

✓ Is the answer outlined before testing students?✓ Is the scoring method–holistic or trait analytic–

appropriate?✓ Is the role of writing mechanics clarified?✓ Are items scored one at a time?✓ Is the order in which the tests are graded changed?✓ Is the identity of the student anonymous? (McMillan, 2001, p. 185, 188)

LP1.100018 3/18/08 2:17 PM Page 10

11

The consequences, especially for science majors, can besignificant, causing students to question whether theyshould persist in their major. Studies on persistence in thesciences have found that competitive grading designed tosort students – especially in introductory, gateway courses –can have a significant impact on students' decisions to leavethe sciences. The effect is particularly pronounced forunderrepresented students such as women and minorities inthe sciences (Seymour & Hewitt, 1997).

Criterion-referenced gradingCriterion-referenced grading focuses on the absolute

performance against predetermined criteria. Criterionsystems assume that grades should measure “how much” astudent has mastered a specified body of knowledge or setof skills. Assessment items are created to ensure contentcoverage, irrespective of whether they are easy or difficult(Brown, 1983; Grading systems, 1991; Gronlund & Linn,1990; Isaac & Michael, 1990; Svinicki, 1999a; Svinicki,1999b). This approach to grading rewards students for theireffort and conveys that anyone in the class can achieveexcellent results if they meet the standards. As a result, it iswell suited to collaboration rather than competition. It canalso support mastery learning (Svinicki, 1998), especially ifstudents understand the criteria in advance so that they candirect their studying accordingly.

Criterion-referenced grading does have drawbacks aswell. Students may perceive as arbitrary the system bywhich cut scores are determined in criterion systems (e.g.,an A is 90% and above, thus 89% is a B) and mayaggressively pursue faculty for “extra credit” assignmentsor lobby for additional points. Criterion systems alsorequire instructors to consider the relative difficulty of eachassessment and weight them accordingly (i.e., moredifficult or comprehensive assessments should be givenmore weight). In a poorly designed criterion system, astudent who performs well on simple, less rigorousassessments (quizzes, homework) may achieve a highergrade than a student who performed well on more difficultassessments (midterm/final exams) if the simpleassignments are weighted more heavily in the final gradecalculation (Grading systems, 1991). Criterion systems aresometimes seen as contributors to grade inflation, as it ispossible for all students to achieve grades of A if they meetthe criteria for proficiency.

In practice, most faculty will use some combination ofsystems to determine student grades. Faculty using a norm-referenced approach may look at the quality of student workwhen setting the cutoff points on a curve. When using acriterion-referenced system, faculty may look at thenumbers of students receiving very high or very low scoresto help them determine how they will set the actual grades.As you make choices about grading systems it might beworth considering the assumptions that underlie your ownview of the purpose of grading. As McKeachie and Svinicki(2006) frame the issue, "Is the purpose of the grade toidentify the 'best' students in a group (norm referencing), oris it to indicate what each student has achieved (criterionreferencing). Both are legitimate positions and can be andare argued for vociferously" (p. 132).

Conclusion

Regardless of the type of assessment instrumentdeveloped to measure students’ knowledge and skill incollege-level courses, it is important to remember to explainthe purpose of assessment to ourselves and our students,understand the importance of using valid and reliableinstruments, and carefully consider judgments made aboutdata from those assessment instruments. By followingguidelines for designing, implementing, and scoring exams,instructors can enhance the validity, reliability, and utility ofthe information they collect about students’ learning. Highquality assessment provides instructors and studentsopportunities for more meaningful and relevant teachingand learning in the classroom.

ReferencesBrown, F. G. (1983). Principles of educational and

psychological testings (3rd ed.). New York: Holt, Rinehartand Winston.

Cashin, W. E. (1987). Improving essay tests. Idea Paper,No. 17. Manhattan, KS: Center for Faculty Evaluation andDevelopment, Kansas State University.

Critical thinking rubric. (2008). Dobson, NC: SurryCommunity College.

Grading systems. (1991, April). For Your Consideration,No. 10. Chapel Hill, NC: Center for Teaching and Learning,University of North Carolina at Chapel Hill.

Gronlund, N. E., & Linn, R. L. (1990). Measurement andevaluation in teaching (6th ed.). New York: MacmillanPublishing Company.

LP1.100018 3/18/08 2:17 PM Page 11

The CRLT Occasional Papers series is published on a variable schedule by the Center for Research on Learning and Teaching at theUniversity of Michigan. Information about extra copies or back issues can be obtained by writing to: Publications, CRLT, 1071 PalmerCommons, 100 Washtenaw Avenue, Ann Arbor, MI 48109-2218.

Copyright 2008 The University of Michigan

CRLT Occasional Paper No. 24

Center for Researchon Learning and Teaching

The University of Michigan1071 Palmer Commons100 Washtenaw AvenueAnn Arbor, MI 48109-2218

Halpern, D. H., & Hakel, M. D. (2003). Applying thescience of learning to the university and beyond. Change,35(4), 37-41.

Isaac, S., & Michael, W. B. (1990). Handbook inresearch and evaluation. San Diego, CA: EdITS Publishers.

McKeachie, W. J. , & Svinicki, M. D. (2006). Assessing,testing, and evaluating: Grading is not the most importantfunction. In McKeachie’s teaching tips: Strategies,research, and theory for college and university teachers(12th ed., pp. 74-86). Boston: Houghton Mifflin Company.

McMillan, J. H. (2001). Classroom assessment:Principles and practice for effective instruction. Boston:Allyn and Bacon.

Seymour, E., & Hewitt, N. M. (1997). Talking aboutleaving: Why undergraduates leave the sciences. Boulder,CO: Westview Press.

Svinicki, M. D. (1998). Helping students understandgrades. College Teaching, 46(3), 101-105.

Svinicki, M. D. (1999a). Evaluating and gradingstudents. In Teachers and students: A sourcebook for UT-Austin faculty (pp. 1-14). Austin, TX: Center for TeachingEffectiveness, University of Texas at Austin.

Svinicki, M. D. (1999b). Some pertinent questions aboutgrading. In Teachers and students: A sourcebook for UT-Austin faculty (pp. 1-2). Austin, TX: Center for TeachingEffectiveness, University of Texas at Austin.

Thorndike, R. M. (1997). Measurement and evaluationin psychology and education. Upper Saddle River, NJ:Prentice-Hall, Inc.

Wiggins, G. P. (1998). Educative assessment: Designingassessments to inform and improve student performance.San Francisco: Jossey-Bass Publishers.

Worthen, B. R., Borg, W. R., & White, K. R. (1993).Measurement and evaluation in the schools. New York:Longman.

Writing and grading essay questions. (1990, September).For Your Consideration, No. 7. Chapel Hill, NC: Center forTeaching and Learning, University of North Carolina atChapel Hill.

LP1.100018 3/18/08 2:17 PM Page 12