Upload
truongkhanh
View
218
Download
0
Embed Size (px)
Citation preview
�Center for Research
on Learning and Teaching
University of Michigan
��
BEST PRACTICES FOR DESIGNING AND GRADING EXAMS
Mary E. Piontek
Introduction
The most obvious function of assessment methods such as exams,quizzes, papers, presentations, etc., is to enable instructors to makejudgments about the quality of student learning (i.e., assign grades).However, the methods of assessment used by faculty can also have adirect impact on the quality of student learning. Students assume that thefocus of exams and assignments reflects the educational goals mostvalued by an instructor, and they direct their learning and studyingaccordingly (McKeachie & Svinicki, 2006). Given the importance ofassessment for both faculty and student interactions about learning, howcan instructors develop exams that provide useful and relevant data abouttheir students' learning and also direct students to spend their time on theimportant aspects of a course or course unit? How do grading practicesfurther influence this process?
Creating high quality educational assessments requires both art andscience: the art of creatively engaging students in assessments that theyview as fair and meaningful, and that produce relevant data about studentachievement; and the science of assessment design, item writing, andgrading procedures (Worthen, Borg, & White, 1993). This OccasionalPaper provides an overview of the science of developing valid and reliableexams, especially multiple-choice and essay items. Additionally, thepaper describes key issues related to grading: holistic and trait-analyticrubrics, and normative and criterion grading systems.
Guidelines for Designing Valid and Reliable Exams
Ideally, effective exams have four characteristics. They are valid(providing useful information about the concepts they were designed totest), reliable (allowing consistent measurement and discriminatingbetween different levels of performance), recognizable (instruction hasprepared students for the assessment), and realistic (concerning time andeffort required to complete the assignment) (Svinicki, 1999a). Thefollowing six guidelines are designed to help you create suchassessments:
1. Focus on the most important content and behaviors you haveemphasized during the course (or particular section of the course). Startby asking yourself to identify the primary ideas, issues, and skillsencountered by students during a particular course/unit/module. Thedistribution of items should reflect the relative emphasis you gave to
Mary E. Piontek is Evaluation Researcher at the Center for Research onLearning and Teaching (CRLT). She has a Ph.D. in Measurement,Research, and Evaluation.
CRLTOccasional
Papers
No. 24
�
LP1.100018 3/18/08 2:17 PM Page 1
content coverage and skill development (Gronlund & Linn,1990). If you teach for memorization then you should assessfor memorization or classification; if you emphasizeproblem-solving strategies your exams should focus onassessing students’ application and analysis skills. As ageneral rule, assessments that focus too heavily on details(e.g., isolated facts, figures, etc.) can have a negative impacton student learning. "Asking learners to recall particularpieces of the information they've been taught often leads to'selective forgetting' of related information that they werenot asked to recall." In fact, testing for "relativelyunimportant points in the belief that 'testing for thefootnotes' will enhance learning,…will probably lead tobetter student retention of the footnotes at the cost of themain points" (Halpern & Hakel, 2003, p. 40).
2. Write clearly and simply. Appropriate language level,sentence structure, and vocabulary are three essentialcharacteristics of all assessment items. The items should beclear and succinct, with simple, declarative sentencestructures.
3. Create more items than you will need so that you canchoose those that best measure the important content andbehaviors.
4. Group the items by same format (true-false, multiple-choice, essay) or by content or topical area.
5. Review the items after a “cooling off ” period so that youcan critique each item for its relevance to what you haveactually taught (Thorndike, 1997; Worthen, et al., 1993).
6. Prepare test directions carefully to provide students with
2
Disadvantages
Limited primarily to testing knowledgeof information. Easy to guess correctlyon many items, even if material has notbeen mastered.
Difficult and time consuming to writegood items. Possible to assess higherorder cognitive skills, but most itemsassess only knowledge. Some correctanswers can be guesses.
Higher order cognitive skills aredifficult to assess.
Difficult to identify defensible criteriafor correct answers. Limited toquestions that can be answered orcompleted in very few words.
Time consuming to administer andscore. Difficult to identify reliablecriteria for scoring. Only a limitedrange of content can be sampled duringany one testing period.
Type of Item
True-False
Multiple-Choice
Matching
Short Answer or Completion
Essay
Advantages
Many items can be administered in arelatively short time. Moderately easyto write; easily scored.
Can be used to assess broad range ofcontent in a brief period. Skillfullywritten items can measure higher ordercognitive skills. Can be scored quickly.
Items can be written quickly. A broadrange of content can be assessed.Scoring can be done efficiently.
Many can be administered in a briefamount of time. Relatively efficient toscore. Moderately easy to write.
Can be used to measure higher ordercognitive skills. Relatively easy to writequestions. Difficult for respondent toget correct answer by guessing.
Adapted from Table 10.1 of Worthen, et al., 1993, p. 261.
Table 1: Advantages and Disadvantages of Commonly Used Types of Achievement Test Items
LP1.100018 3/18/08 2:17 PM Page 2
information about the test’s purpose, time allowed,procedures for responding to items and recording answers,and scoring/grading criteria (Gronlund & Linn, 1990;McMillan, 2001).
Advantages and Disadvantages of Commonly UsedTypes of Assessment Items
As noted in Table 1, each type of exam item has itsadvantages and disadvantages in terms of ease of design,implementation, and scoring, and in its ability to measuredifferent aspects of students’ knowledge or skills. Thefollowing sections of this paper highlight general guidelinesfor developing multiple-choice and essay items. Multiple-choice and essay items are often used in college-levelassessment because they readily lend themselves tomeasuring higher order thinking skills (e.g., application,justification, inference, analysis and evaluation). Yetinstructors often struggle to create, implement, and scorethese items (McMillan, 2001; Worthen, et al., 1993).
General Guidelines for Developing Multiple-Choice Items
Multiple-choice items have a number of advantages.First, multiple-choice items can measure various kinds ofknowledge, including students' understanding ofterminology, facts, principles, methods, and procedures, aswell as their ability to apply, interpret, and justify. Whencarefully designed, multiple-choice items can assess higher-order thinking skills as shown in Example 1, in whichstudents are required to generalize, analyze, and makeinferences about data in a medical patient case.
Multiple-choice items are less ambiguous than short-answer items, thereby providing a more focused assessmentof student knowledge. Multiple-choice items are superior totrue-false items in several ways: on true-false items,students can receive credit for knowing that a statement isincorrect, without knowing what is correct. Multiple-choiceitems offer greater reliability than true-false items as theopportunity for guessing is reduced with the larger numberof options. Finally, an instructor can diagnosemisunderstanding by analyzing the incorrect options chosenby students.
A disadvantage of multiple-choice items is that theyrequire developing incorrect, yet plausible, options that canbe difficult for the instructor to create. In addition, multiple-
choice questions do not allow instructors to measurestudents’ ability to organize and present ideas. Finally,because it is much easier to create multiple-choice itemsthat test recall and recognition rather than higher orderthinking, multiple-choice exams run the risk of notassessing the deep learning that many instructors considerimportant (Gronlund & Linn, 1990; McMillan, 2001).
Guidelines for developing multiple-choice itemsThere are nine primary guidelines for developing
multiple-choice items (Gronlund & Linn, 1990; McMillan,2001). Following these guidelines increases the validity andreliability of multiple-choice items that one might use forquizzes, homework assignments, and/or examinations.
3
Example 1: A Series of Multiple-Choice Items ThatAssess Higher Order Thinking:
Patient WC was admitted for 3rd degree burns over75% of his body. The attending physician asks you tostart this patient on antibiotic therapy. Which one of thefollowing is the best reason why WC would needantibiotic prophylaxis?a. His burn injuries have broken down the innate
immunity that prevents microbial invasion.b. His injuries have inhibited his cellular immunity.c. His injuries have impaired antibody production.d. His injuries have induced the bone marrow, thus
activated immune system.
Two days later, WC's labs showed: WBC 18,000 cells/mm3; 75% neutrophils (20% bandcells); 15% lymphocytes; 6% monocytes; 2% eosophils;and 2% basophils.
Which one of the following best describes WC's labresults?a. Leukocytosis with left shiftb. Normal neutrophil count with left shiftc. High eosinophil count in response to allergic
reactionsd. High lymphocyte count due to activation of
adaptive immunity
(Jeong Park, U-M College of Pharmacy, personalcommunication, February 4, 2008)
LP1.100018 3/18/08 2:17 PM Page 3
The first four guidelines concern the item "stem," whichposes the problem or question to which the choices refer.
1. Write the stem as a clearly described question, problem,or task.
2. Provide the information in the stem and keep theoptions as short as possible.
3. Include in the stem only the information needed to make the problem clear and specific.
The stem of the question should communicate thenature of the task to the students and present a clear problemor concept. The stem of the question should provide onlyinformation that is relevant to the problem or concept, andthe options (distractors) should be succinct.
4. Avoid the use of negatives in the stem (use only when you are measuring whether the respondent knows theexception to a rule or can detect errors).
You can word most concepts in positive terms and thusavoid the possibility that students will overlook terms of“no, not, or least” and choose an incorrect option notbecause they lack the knowledge of the concept but becausethey have misread the stated question. Italicizing,capitalizing, using bold-face, or underlying the negativeterm makes it less likely to be overlooked.
The remaining five guidelines concern the choices fromwhich students select their answer.
5. Have ONLY one correct answer.
Make certain that the item has one correct answer.Multiple-choice items usually have at least three incorrectoptions (distractors).
6. Write the correct response with no irrelevant clues.
A common mistake when designing multiple-choicequestions is to write the correct option with moreelaboration or detail, using more words, or using generalterminology rather than technical terminology.
7. Write the distractors to be plausible yet clearly wrong.
An important, and sometimes difficult to achieve,aspect of multiple-choice items is ensuring that theincorrect choices (distractors) appear to be possibly correct.Distractors are best created using common errors ormisunderstandings about the concept being assessed, andmaking them homogeneous in content and parallel in formand grammar.
8. Avoid using "all of the above," "none of the above," orother special distractors (use only when an answer canbe classified as unequivocally correct or incorrect).
All of the above and none of the above are often addedas answer options to multiple-choice items. This techniquerequires the student to read all of the options and mightincrease the difficulty of the items, but too often the use ofthese phrases is inappropriate. None of the above should berestricted to items of factual knowledge with absolutestandards of correctness. It is inappropriate for questionswhere students are asked to select “the best” answer. All ofthe above is awkward in that many students will choose it ifthey can identify at least one of the other options as correctand therefore assume all of the choices are correct – therebyobtaining a correct answer based on partial knowledge ofthe concept/content (Gronlund & Linn, 1990).
9. Use each alternative as the correct answer about thesame number of times.
Check to see whether option “a” is correct about thesame number of times as option “b” or “c” or “d” across theinstrument. It can be surprising to find that one has createdan exam in which the choice “a” is correct 90% of the time.Students quickly find such patterns and increase theirchances of “correct guessing” by selecting that answeroption by default.
4
Checklist for Writing Multiple-Choice Items
✓ Is the stem stated as clearly, directly, and simply aspossible?
✓ Is the problem self-contained in the stem?✓ Is the stem stated positively?✓ Is there only one correct answer?✓ Are all the alternatives parallel with respect to
grammatical structure, length, and complexity?✓ Are irrelevant clues avoided?✓ Are the options short?✓ Are complex options avoided?✓ Are options placed in logical order?✓ Are the distractors plausible to students who do not
know the correct answer?✓ Are correct answers spread equally among all the
choices? (McMillan, 2001, p. 150)
LP1.100018 3/18/08 2:17 PM Page 4
5
General Guidelines for Developing and Scoring Essay Items
Essays can tap complex thinking by requiring students toorganize and integrate information, interpret information,construct arguments, give explanations, evaluate the meritof ideas, and carry out other types of reasoning (Cashin,
1987; Gronlund & Linn, 1990; McMillan, 2001; Thorndike,1997; Worthen, et al., 1993). Table 2 provides examples ofessay question stems for assessing a variety of reasoningskills.
In addition to measuring complex thinking andreasoning, advantages of essays include the potential formotivating better study habits and providing the students
Example 5: Levels of Performance for Critical Thinking
4 = Exemplary: Clearly defines the issue or problem; accurately identifies the core issues; appreciates
depth and breadth of problem. Identifies and evaluates relevant significant points of view.
3 = Satisfactory: Defines the issue; identifies the core issues, but does not fully explore their depth and
breadth. Identifies and evaluates relevant general points of view.
2 = Below Satisfactory: Defines the issue, but poorly (superficially, narrowly); overlooks some core
issues. Identifies other points of view but focuses on insignificant points of view.
1 = Unsatisfactory: Fails to clearly define the issue or problem; does not recognize the core issues.
Ignores or superficially evaluates alternate points of view.
(Adapted from Critical Thinking Rubric, 2008)
Skill
Comparing
Relating Cause and Effect
Justifying
Summarizing
Generalizing
Inferring
Classifying
Creating
Applying
Analyzing
Synthesizing
Evaluating
Stem
Describe the similarities and differences between...Compare the following two methods for...
What are the major causes of...What would be the mostly likely effects of...
Which of the following alternatives do you favor and why?Explain why you agree or disagree with the following statement.
State the main points included in...Briefly summarize the contents of...
Formulate several valid generalizations for the following data.State a set of principles that can explain the following events.
In light of the information presented, what is most likely to happen when...How would person X be likely to react to the following issue?
Group the following items according to...What do the following items have in common?
List as many ways as you can think of for/to...Describe what would happen if...
Using the principles of...as a guide, describe how you would solve the following problem.
Describe a situation that illustrates the principle of...
Describe the reasoning errors in the following paragraph.List and describe the main characteristics of...
Describe a plan for providing that...Write a well-organized report that shows...
Describe the strengths and weaknesses of...Using the given criteria, write an evaluation of...
Adapted from Figure 7.11 of McMillan, 2001, p.186.
Table 2: Sample Essay Item Stems for Assessing Reasoning Skills
LP1.100018 3/19/08 7:58 AM Page 5
6
flexibility in their responses. Instructors can evaluate howwell students are able to communicate their reasoning withessay items, and they are usually less time consuming toconstruct than multiple-choice items that measurereasoning. It should, however, be noted that creating a highquality essay item takes a significant amount of skill andtime.
The major disadvantages of essays include the amount oftime faculty must devote to reading and scoring studentresponses, and the importance of developing and usingcarefully constructed criteria/rubrics to insure reliability ofscoring. Essays can assess only a limited amount of contentin one testing period/exam due to the length of time requiredfor students to respond to each essay item. As a result, essaysdo not provide a good sampling of content knowledge acrossa curriculum (Gronlund & Linn, 1990; McMillan, 2001).
Types of essay itemsWe can distinguish between two types of essay questions,
restricted-response and extended-response. A restricted-response essay focuses on assessing basic knowledge andunderstanding, and generally requires a relatively briefwritten response. Restricted-response essay questions assesstopics of limited scope, and the nature of the questionconfines the form of the written response (see Example 2).
Extended-response essay items allow students toconstruct a variety of strategies, processes, interpretations,and explanations for a given question, and to provide anyinformation they consider relevant (see Example 3). Theflexibility of an extended-response item can make it less
efficient for measuring specific learning outcomes than arestricted-response item, but it allows for greateropportunity to assess students’ organization, integration,and evaluation abilities.
Guidelines for developing essay itemsThere are six primary guidelines for developing essay
items (Gronlund & Linn, 1990; McMillan, 2001; Worthen,et al., 1993).
1. Restrict the use of essay questions to educationaloutcomes that are difficult to measure using otherformats.
2. Construct the item to elicit skills and knowledge in theeducational outcomes.
3. Write the item so that students clearly understand thespecific task.
Other assessment formats are better for measuringrecall knowledge (e.g., true-false, fill-in-the-blank,multiple-choice); the essay is able to measure deepunderstanding and mastery of complex information. Whenconstructing essay items, start by identifying the specificskills and knowledge that will be assessed. As noted earlier,Table 2 provides examples of essay item question stems forassessing a variety of reasoning skills. Building from thesesimple stem formats, you can create a restricted-response
Example 3: Extended-Response Items
Compare and contrast the social conditions, prevailingpolitical thought, and economic conditions in the U.S.North and South just prior to the outbreak of the CivilWar; defend the issue that you believe was the mostsignificant catalyst to the war.
(Worthen, et al., 1993, p. 276)
The framers of the Constitution strove to create aneffective national government that balanced the tensionbetween majority rule and the rights of minorities. Whataspects of American politics favor majority rule? Whataspects protect the rights of those not in the majority?Drawing upon material from your readings and thelectures, did the framers successfully balance thistension? Why or why not?
(Charles Shipan, UM Department of Political Science,personal communication, February 4, 2008)
Example 2: Restricted-Response Items
State two hypotheses about why birds migrate.Summarize the evidence supporting each hypothesis.
(Worthen, et al., 1993, p. 277)
Identify and explain the significance of the followingterms. For each term, be sure to define what it means,explain why it is important in American politics, andlink it to other related concepts that we have covered inclass. 1) Conformity costs 2) Voting Age Population(VAP)
(Charles Shipan, UM Department of Political Science,personal communication, February 4, 2008)
LP1.100018 3/19/08 7:58 AM Page 6
or extended-response item to assess one or more skills orcontent topics.
Once you have identified the specific skills andknowledge, you should word the question clearly andconcisely so that it communicates to the students thespecific task(s) you expected them to complete (e.g., state,formulate, evaluate, use the principle of, create a plan for,etc.). If the language is ambiguous or students feel they areguessing at “what the instructor wants me to do,” the abilityof the item to measure the intended skill or knowledgedecreases.
4. Indicate the amount of time and effort students shouldspend on each essay item.
In essay items, especially when used in multiples and/orcombined with other item formats, you should providestudents with a general time limit or time estimate to helpthem structure their responses. Providing estimates oflength of written responses to each item can also helpstudents manage their time, providing cues about the depthand breadth of information that is required to complete theitem. In restricted-response items a few paragraphs areusually sufficient to complete a task focusing on a singleeducational outcome.
5. Avoid giving students options as to which essayquestions they will answer.
A common structure in many exams is to providestudents with a choice of essay items to complete (e.g.,“choose two out of the three essay questions tocomplete…”). Instructors, and many students, often viewessay choice as a way to increase the flexibility and fairnessof the exam by allowing students to focus on those items forwhich they feel most prepared. However, the choice actuallydecreases the validity and reliability of the instrumentbecause each student is essentially taking a different test.
Creating parallel essay items (from which studentschoose a subset) that test the same educational objectives(skills, knowledge) is very difficult, and unless students areanswering the same questions that measure the sameoutcomes, scoring the essay items and the inferences madeabout student ability are less valid. While allowing studentsa choice gives them the perception that they have theopportunity to do their best work, you must also recognizethat choice entails difficulty in drawing consistent and validconclusions about student answers and performance.
6. Consider using several narrowly focused items ratherthan one broad item.
For many educational objectives aimed at higher orderreasoning skills, creating a series of essay items that elicitdifferent aspects students’ skills and knowledge can bemore efficient than attempting to create one question tocapture multiple objectives. By using multiple essay items(which all students complete), you can capture a variety ofskills and knowledge while also covering a greater breadthof course content.
Guidelines for scoring essay itemsThere are five primary guidelines for scoring essay items
(Gronlund & Linn, 1990; McMillan, 2001; Wiggins, 1998;Worthen, et al., 1993; Writing and grading essay questions,1990).
1. Outline what constitutes an expected answer (criteriafor knowledge and skills).
Developing the criteria for what you expect students toknow and be able to demonstrate in advance of giving theexam allows you to refine the wording of the essayquestions so that they best elicit the content, style, andformat you desire.
Identifying the criteria in advance also decreases thelikelihood that you will be influenced by the initial answersyou read when you begin to grade the exam. It is easy tobecome distracted by the first few answers one reads for agiven essay item and allow the skills and knowledge (or lackthereof) demonstrated in these few students’ answers to setthe scoring criteria, rather than relying on sound criteriabased on the educational outcomes of the course curricula.Using predetermined criteria increases the fairness,accuracy, and consistency of the scoring process.
2. Select an appropriate scoring method based on thecriteria.
Generally we describe scoring methods for writtencommunication as either trait analytic, where eachidentified criterion is assigned separate points, or holistic,where a single score is generated resulting in an overalljudgment of the quality. Restricted-response essays areoften graded using the trait analytic method and extended-response essays, the holistic method, but either method canbe used with either item type. To implement your approach,you can develop a rubric or scoring key that specifies theeducational criteria for scoring (as described in the previous
7
LP1.100018 3/18/08 2:17 PM Page 7
8
guideline) and the amount of credit or points to be assignedfor each criterion.
The validity of a scoring rubric is affected by twofactors: how clearly the descriptions in the rubric'scategories discriminate between levels of studentperformance; and the degree of congruence between thecriteria and the knowledge and skills imbedded in theinstruction. In other words, do the criteria outlined in therubric genuinely reflect the tasks that are important foranswering the question (use of examples, rationale forinferences, etc.)? Or do they instead describe generalbehaviors useful when answering any essay question(engaging tone, interesting format, etc.) but not critical tothe specific tasks at hand? Example 4 shows a rubric forgrading critical papers that clearly and succinctly describesexpectations for conceptual structure, rhetorical structure,thesis statement, evidence and analysis, paragraph/sectionstructure, and sentence mechanics for five levels of grades(A-D, F).
Finally, effective rubrics should avoid the use ofcomparative language such as “better than, more than,worse than, less than…” in discriminating between thelevels of performance. Rather, the rubric should describespecifically those characteristics that define and delineatethat level of performance, as is shown in Example 5, arubric for four levels of performance on critical thinking.
Trait analytic scoring provides students with specificfeedback about each criterion and a summed score out ofthe possible total points, but can be very time consuming forthe instructor to implement. Restricting the educationalcriteria to only those most essential for an individual essayquestion, perhaps 4-6 criteria, and clearly outlining howpoints will be awarded (or deducted) streamlines theprocess (McMillan, 2001).
Concerns about fairness, accuracy, and consistency ofthe essay scoring process increase when more than onegrader has responsibility for scoring essay items. Such asituation often occurs in large courses where the facultymember and multiple graduate student instructors share theresponsibility. Developing criteria and scoring rubricstogether and defining how the scoring rubrics will beimplemented helps to establish common understandingsabout expectations for students’ work. It is also helpful tohave each member of the group grade a small subset ofessay responses and then meet together to compare andcalibrate their scoring to ensure consistency beforecompleting the scoring for the entire set of exams (Writingand grading essay questions, 1990).
3. Clarify the role of writing mechanics and other factorsindependent of the educational outcomes beingmeasured.
For example, you can outline for students how variouselements of written communication (grammar, spelling,punctuation, organization and flow, use of vocabulary/terminology, use of scientific notation or formulas) figureinto the scoring criteria. You should also decide whether youwill decrease scores or ignore the inclusion of informationirrelevant to the question.
4. Use a systematic process for scoring each essay item.
5. Create anonymity for students’ responses while scoring.
Assessment guidelines suggest scoring all answers for anindividual essay question in one continuous process, ratherthan scoring all answers to all questions for an individualstudent. In other words, when faced with a set of exams toscore, read and score all of essay question #1, then shufflethe set of exams to randomly reorder them, and then readand score all of essay question #2. Grading the same essayquestion for all students creates a more uniform standard ofscoring as it is easier to remember the criteria for scoringeach answer. Shuffling decreases the effect of grader fatigueon the accuracy of the scoring, as exams that mightotherwise be scored at the end of the pile (e.g., students withlast names at the end of the alphabet or higher I.D. numbers)
Example 5: Levels of Performance for Critical Thinking
4 = Exemplary: Clearly defines the issue or problem;accurately identifies the core issues; appreciates depthand breadth of problem. Identifies and evaluatesrelevant significant points of view.3 = Satisfactory: Defines the issue; identifies the coreissues, but does not fully explore their depth andbreadth. Identifies and evaluates relevant general pointsof view.2 = Below Satisfactory: Defines the issue, but poorly(superficially, narrowly); overlooks some core issues.Identifies other points of view but focuses oninsignificant points of view.1 = Unsatisfactory: Fails to clearly define the issue orproblem; does not recognize the core issues. Ignores orsuperficially evaluates alternate points of view.
(Adapted from Critical Thinking Rubric, 2008)
LP1.100018 3/18/08 2:17 PM Page 8
Rhe
tori
cal S
truc
ture
cont
ains
a c
onvi
ncin
g ar
gum
ent
wit
h a
com
pell
ing
purp
ose;
hig
hly
resp
onsi
ve t
o th
e de
man
ds o
f th
esp
ecif
ic w
riti
ng s
itua
tion
;so
phis
tica
ted
use
of c
onve
ntio
ns o
fac
adem
ic d
isci
plin
e an
d ge
nre;
anti
cipa
tes
the
read
er’s
nee
d fo
rin
form
atio
n, e
xpla
nati
on, a
ndco
ntex
t
addr
esse
s au
dien
ce w
ith
ath
ough
tful
arg
umen
t w
ith
a cl
ear
purp
ose;
res
pond
s di
rect
ly t
o th
ede
man
ds o
f a
spec
ific
wri
ting
situ
atio
n; c
ompe
tent
use
of
the
conv
enti
ons
of a
cade
mic
dis
cipl
ine
and
genr
e; a
ddre
sses
the
rea
der’s
need
for
inf
orm
atio
n, e
xpla
nati
on,
cont
ext
pres
ents
ade
quat
e re
spon
se t
o th
ees
say
prom
pt;
pays
att
enti
on t
o th
eba
sic
elem
ents
of
the
wri
ting
situ
atio
n; s
how
s su
ffic
ient
com
pete
nce
in t
he c
onve
ntio
ns o
fac
adem
ic d
isci
plin
e an
d ge
nre;
sign
als
the
impo
rtan
ce o
f th
ere
ader
’s n
eed
for
info
rmat
ion,
expl
anat
ion,
and
con
text
show
s se
riou
s w
eakn
esse
s in
addr
essi
ng a
n au
dien
ce;
unre
spon
sive
to
the
spec
ific
wri
ting
situ
atio
n; p
oor
arti
cula
tion
of
purp
ose
in a
cade
mic
wri
ting
; of
ten
stat
es t
he o
bvio
us o
r th
ein
appr
opri
ate
show
s se
vere
dif
ficu
ltie
sco
mm
unic
atin
g th
roug
h ac
adem
icw
riti
ng
Let
ter
Gra
des
A B C D F
Con
cept
ual S
truc
ture
coge
nt a
naly
sis,
sho
ws
com
man
d of
int
erpr
etiv
ean
d co
ncep
tual
tas
ksre
quir
ed b
y as
sign
men
tan
d co
urse
mat
eria
ls:
idea
s or
igin
al, o
ften
insi
ghtf
ul, g
oing
bey
ond
idea
s di
scus
sed
in l
ectu
rean
d cl
ass
show
s a
good
unde
rsta
ndin
g of
the
tex
ts,
idea
s an
d m
etho
ds o
f th
eas
sign
men
t; g
oes
beyo
ndth
e ob
viou
s; m
ay h
ave
one
min
or f
actu
al o
rco
ncep
tual
inc
onsi
sten
cy
show
s an
und
erst
andi
ng o
fth
e ba
sic
idea
s an
din
form
atio
n in
volv
ed i
nth
e as
sign
men
t; m
ay h
ave
som
e fa
ctua
l, in
terp
retiv
e,or
con
cept
ual
erro
rs
show
s in
adeq
uate
com
man
d of
mat
eria
ls o
rha
s si
gnif
ican
t fa
ctua
l an
dco
ncep
tual
err
ors;
conf
uses
som
e si
gnif
ican
tid
eas
lack
s cr
itic
alun
ders
tand
ing
of l
ectu
res,
read
ings
, dis
cuss
ions
, or
assi
gnm
ents
Sent
ence
Mec
hani
cs
uses
sop
hist
icat
ed s
ente
nces
effe
ctiv
ely;
usu
ally
cho
oses
wor
ds a
ptly
; obs
erve
spr
ofes
sion
al c
onve
ntio
ns o
fw
ritte
n E
nglis
h an
dm
anus
crip
t for
mat
; mak
esve
ry f
ew m
inor
or
tech
nica
ler
rors
a fe
w m
echa
nica
l dif
ficu
lties
or s
tylis
tic p
robl
ems;
may
mak
e oc
casi
onal
pro
blem
atic
wor
d ch
oice
s or
syn
tax
erro
rs;
a fe
w s
pelli
ng o
r pu
nctu
atio
ner
rors
; usu
ally
pre
sent
squ
otat
ions
eff
ectiv
ely,
usi
ngap
prop
riat
e fo
rmat
freq
uent
wor
dine
ss; u
ncle
aror
aw
kwar
d se
nten
ces;
impr
ecis
e us
e of
wor
ds o
rov
er-r
elia
nce
on p
assi
vevo
ice;
con
tain
s ru
dim
enta
rygr
amm
atic
al e
rror
s; m
akes
effo
rt to
pre
sent
quo
tatio
nsac
cura
tely
som
e m
ajor
gra
mm
atic
al o
rpr
oofr
eadi
ng e
rror
s (s
ubje
ct-
verb
agr
eem
ent,
sent
ence
frag
men
ts, w
ord
form
err
ors,
etc.
); r
epea
ted
inex
act w
ord
choi
ces;
inco
rrec
t quo
tatio
nor
cita
tion
form
at
num
erou
s m
ajor
and
min
orgr
amm
atic
al e
rror
s an
dst
ylis
tic p
robl
ems;
doe
s no
tm
eet S
tand
ard
Wri
tten
Eng
lish
requ
irem
ent
Para
grap
h/Se
ctio
nSt
ruct
ure
wel
l-co
nstr
ucte
dpa
ragr
aphs
; app
ropr
iate
,cl
ear,
and
smoo
thtr
ansi
tions
; apt
arra
ngem
ent o
for
gani
zatio
nal e
lem
ents
dist
inct
uni
ts o
f th
ough
tin
par
agra
phs
cont
rolle
dby
spe
cifi
c, d
etai
led,
and
argu
able
topi
c se
nten
ces;
clea
r tr
ansi
tions
bet
wee
nde
velo
ped,
coh
erin
g, a
ndlo
gica
lly a
rran
ged
para
grap
hs
som
e aw
kwar
dtr
ansi
tions
; som
e br
ief,
wea
kly
unif
ied
orun
deve
lope
d pa
ragr
aphs
;ar
rang
emen
t may
not
appe
ar e
ntir
ely
natu
ral;
cont
ains
ext
rane
ous
info
rmat
ion
sim
plis
tic, t
ends
tona
rrat
e or
mer
ely
sum
mar
ize;
wan
ders
fro
mon
e to
pic
to a
noth
er;
illog
ical
arr
ange
men
t of
idea
s
no tr
ansi
tions
; inc
oher
ent
para
grap
hs; s
ugge
sts
poor
plan
ning
or
no s
erio
usre
visi
on
Evi
denc
e an
d A
naly
sis
wel
l-ch
osen
exa
mpl
es;
uses
pers
uasi
ve r
easo
ning
to
deve
lop
and
supp
ort
thes
is c
onsi
sten
tly;
uses
spec
ific
quo
tati
ons,
aes
thet
icde
tail
s, o
r ci
tati
ons
of s
chol
arly
sour
ces
effe
ctiv
ely;
log
ical
conn
ecti
ons
betw
een
idea
s ar
eev
iden
t
purs
ues
expl
anat
ion
and
proo
f of
thes
is c
onsi
sten
tly;
deve
lops
a m
ain
argu
men
t w
ith
expl
icit
maj
or p
oint
s,ap
prop
riat
e te
xtua
l ev
iden
ce, a
ndsu
ppor
ting
det
ail
only
par
tial
ly d
evel
ops
the
argu
men
t; s
hall
ow a
naly
sis;
som
eid
eas
and
gene
rali
zati
ons
unde
velo
ped
or u
nsup
port
ed;
mak
esli
mit
ed u
se o
f te
xtua
l ev
iden
ce;
fail
sto
int
egra
te q
uota
tion
sap
prop
riat
ely;
war
rant
s m
issi
ng
freq
uent
ly o
nly
narr
ates
; di
gres
ses
from
one
top
ic t
o an
othe
r w
itho
utde
velo
ping
ide
as o
r te
rms;
mak
esin
suff
icie
nt o
r aw
kwar
d us
e of
text
ual
evid
ence
; re
lies
on
too
few
or t
he w
rong
typ
e of
sou
rces
litt
le o
r no
dev
elop
men
t; m
ay l
ist
disj
oint
ed f
acts
or
mis
info
rmat
ion;
uses
no
quot
atio
ns o
r fa
ils
to c
ite
sour
ces
or p
lagi
ariz
es
The
sis
essa
y co
ntro
lled
by c
lear
, pre
cise
,w
ell-
defi
ned
thes
is;
isso
phis
tica
ted
inbo
th s
tate
men
tan
d in
sigh
t
clea
r, sp
ecif
ic,
argu
able
the
sis
cent
ral
to t
hees
say;
may
hav
ele
ft m
inor
ter
ms
unde
fine
d
gene
ral
thes
is o
rco
ntro
llin
g id
ea;
may
not
def
ine
seve
ral
cent
ral
term
s
thes
is v
ague
or
not
cent
ral
toar
gum
ent;
cen
tral
term
s no
t de
fine
d
no d
isce
rnib
leth
esis
(T. M
cElr
oy a
nd F
. Whi
ting
, Uni
vers
ity
of A
laba
ma,
per
sona
l co
mm
unic
atio
n, F
ebru
ary
15, 2
008)
Exa
mpl
e 4:
Gra
ding
Rub
ric
for
Cri
tica
l Pa
pers
LP1.100018 3/18/08 2:17 PM Page 9
10
are shuffled throughout the scoring process and scoredrandomly among the others.
The shuffling also creates a random order in which tograde each essay item, making it less likely that you willidentify a pattern in an individual student’s answers or baseyour score on previous impressions of that student. Youmight also choose to completely cover students’ names/I.D.numbers so that you cannot identify the student until afteryou have graded the entire exam.
You can also use these guidelines for scoring essay itemsto create scoring processes and rubrics for students’ papers,oral presentations, course projects/products/artifacts, andwebsites/technological tools.
Normative and Criterion Grading Systems
Fair and accurate grading depends on the development ofeffective and well-designed exams and assignments asdescribed above. Just as important to the process is theapproach you take to assigning grades. In general, we candivide these approaches to grading into two main categories:norm-referenced and criterion-referenced systems.Whatever system you use needs to be fair, distinguish clearlybetween different levels of achievement, and enable studentsto track their progress in the course (Grading systems,1991).
Norm-referenced gradingWhen using a norm-referenced approach, faculty
compare a student's performance to the norm of the class asa whole. A normative system assumes that skills andknowledge are distributed throughout the population (normgroup). Assessment is designed to maximize the differencein student performance, for example by avoiding examquestions that almost all students will answer correctly(Brown, 1983; Grading systems, 1991; Gronlund & Linn,1990; Isaac & Michael, 1990; Svinicki, 1999a; Svinicki,1999b). Advocates of norm-referenced grading argue it hasseveral benefits. Norm-referenced grading allows forflexibility in assigning grades to cohorts of students thatare unusually high- or low-achieving, thus allowingdistinctions to be made. It also allows faculty to correct forexams that turn out to be much more or less difficult thananticipated. Norm referenced grading can potentiallycombat grade inflation by restricting the number ofstudents receiving the highest grades. Finally, this approachcan be valuable because it identifies students who stand outin their cohort, which can be helpful for faculty writingletters of recommendation.
The most common version of this approach is gradingon the curve, in which grades on an exam are placed on astandard bell curve, in which, for example, 10% receive"A" or "F," 25% receive "B" or "D," and 35% receive theaverage grade of "C." However, a strict bell curvedemonstrates the major drawbacks of norm-referencedgrading, especially in relationship to student motivationand cooperation. Students realize in normative grading thatsome students must receive low/failing grades. It can bedifficult for an individual student to gauge his/herperformance during the semester and develop strategies forremediation because final grades are dependent upon howall others perform in total and not just on that individualstudent’s skills and abilities (Brown, 1983; Gradingsystems, 1991; Gronlund & Linn, 1990; Isaac & Michael,1990; Svinicki, 1999a; Svinicki, 1999b). Because studentsare competing for grades, faculty may also find it verydifficult to implement any type of cooperative learning orpeer support. In practice, faculty rarely use a strict bellcurve in which 35% would receive a D or F. However evena more forgiving curve has the potential to dampenmotivation and cooperation in a classroom.
Checklist for Writing Essay Items
✓ Is the targeted reasoning skill measured?✓ Is the task clearly specified?✓ Is there enough time to answer the questions?✓ Are choices among several questions avoided?
Checklist for Scoring Essays
✓ Is the answer outlined before testing students?✓ Is the scoring method–holistic or trait analytic–
appropriate?✓ Is the role of writing mechanics clarified?✓ Are items scored one at a time?✓ Is the order in which the tests are graded changed?✓ Is the identity of the student anonymous? (McMillan, 2001, p. 185, 188)
LP1.100018 3/18/08 2:17 PM Page 10
11
The consequences, especially for science majors, can besignificant, causing students to question whether theyshould persist in their major. Studies on persistence in thesciences have found that competitive grading designed tosort students – especially in introductory, gateway courses –can have a significant impact on students' decisions to leavethe sciences. The effect is particularly pronounced forunderrepresented students such as women and minorities inthe sciences (Seymour & Hewitt, 1997).
Criterion-referenced gradingCriterion-referenced grading focuses on the absolute
performance against predetermined criteria. Criterionsystems assume that grades should measure “how much” astudent has mastered a specified body of knowledge or setof skills. Assessment items are created to ensure contentcoverage, irrespective of whether they are easy or difficult(Brown, 1983; Grading systems, 1991; Gronlund & Linn,1990; Isaac & Michael, 1990; Svinicki, 1999a; Svinicki,1999b). This approach to grading rewards students for theireffort and conveys that anyone in the class can achieveexcellent results if they meet the standards. As a result, it iswell suited to collaboration rather than competition. It canalso support mastery learning (Svinicki, 1998), especially ifstudents understand the criteria in advance so that they candirect their studying accordingly.
Criterion-referenced grading does have drawbacks aswell. Students may perceive as arbitrary the system bywhich cut scores are determined in criterion systems (e.g.,an A is 90% and above, thus 89% is a B) and mayaggressively pursue faculty for “extra credit” assignmentsor lobby for additional points. Criterion systems alsorequire instructors to consider the relative difficulty of eachassessment and weight them accordingly (i.e., moredifficult or comprehensive assessments should be givenmore weight). In a poorly designed criterion system, astudent who performs well on simple, less rigorousassessments (quizzes, homework) may achieve a highergrade than a student who performed well on more difficultassessments (midterm/final exams) if the simpleassignments are weighted more heavily in the final gradecalculation (Grading systems, 1991). Criterion systems aresometimes seen as contributors to grade inflation, as it ispossible for all students to achieve grades of A if they meetthe criteria for proficiency.
In practice, most faculty will use some combination ofsystems to determine student grades. Faculty using a norm-referenced approach may look at the quality of student workwhen setting the cutoff points on a curve. When using acriterion-referenced system, faculty may look at thenumbers of students receiving very high or very low scoresto help them determine how they will set the actual grades.As you make choices about grading systems it might beworth considering the assumptions that underlie your ownview of the purpose of grading. As McKeachie and Svinicki(2006) frame the issue, "Is the purpose of the grade toidentify the 'best' students in a group (norm referencing), oris it to indicate what each student has achieved (criterionreferencing). Both are legitimate positions and can be andare argued for vociferously" (p. 132).
Conclusion
Regardless of the type of assessment instrumentdeveloped to measure students’ knowledge and skill incollege-level courses, it is important to remember to explainthe purpose of assessment to ourselves and our students,understand the importance of using valid and reliableinstruments, and carefully consider judgments made aboutdata from those assessment instruments. By followingguidelines for designing, implementing, and scoring exams,instructors can enhance the validity, reliability, and utility ofthe information they collect about students’ learning. Highquality assessment provides instructors and studentsopportunities for more meaningful and relevant teachingand learning in the classroom.
ReferencesBrown, F. G. (1983). Principles of educational and
psychological testings (3rd ed.). New York: Holt, Rinehartand Winston.
Cashin, W. E. (1987). Improving essay tests. Idea Paper,No. 17. Manhattan, KS: Center for Faculty Evaluation andDevelopment, Kansas State University.
Critical thinking rubric. (2008). Dobson, NC: SurryCommunity College.
Grading systems. (1991, April). For Your Consideration,No. 10. Chapel Hill, NC: Center for Teaching and Learning,University of North Carolina at Chapel Hill.
Gronlund, N. E., & Linn, R. L. (1990). Measurement andevaluation in teaching (6th ed.). New York: MacmillanPublishing Company.
LP1.100018 3/18/08 2:17 PM Page 11
The CRLT Occasional Papers series is published on a variable schedule by the Center for Research on Learning and Teaching at theUniversity of Michigan. Information about extra copies or back issues can be obtained by writing to: Publications, CRLT, 1071 PalmerCommons, 100 Washtenaw Avenue, Ann Arbor, MI 48109-2218.
Copyright 2008 The University of Michigan
CRLT Occasional Paper No. 24
Center for Researchon Learning and Teaching
The University of Michigan1071 Palmer Commons100 Washtenaw AvenueAnn Arbor, MI 48109-2218
Halpern, D. H., & Hakel, M. D. (2003). Applying thescience of learning to the university and beyond. Change,35(4), 37-41.
Isaac, S., & Michael, W. B. (1990). Handbook inresearch and evaluation. San Diego, CA: EdITS Publishers.
McKeachie, W. J. , & Svinicki, M. D. (2006). Assessing,testing, and evaluating: Grading is not the most importantfunction. In McKeachie’s teaching tips: Strategies,research, and theory for college and university teachers(12th ed., pp. 74-86). Boston: Houghton Mifflin Company.
McMillan, J. H. (2001). Classroom assessment:Principles and practice for effective instruction. Boston:Allyn and Bacon.
Seymour, E., & Hewitt, N. M. (1997). Talking aboutleaving: Why undergraduates leave the sciences. Boulder,CO: Westview Press.
Svinicki, M. D. (1998). Helping students understandgrades. College Teaching, 46(3), 101-105.
Svinicki, M. D. (1999a). Evaluating and gradingstudents. In Teachers and students: A sourcebook for UT-Austin faculty (pp. 1-14). Austin, TX: Center for TeachingEffectiveness, University of Texas at Austin.
Svinicki, M. D. (1999b). Some pertinent questions aboutgrading. In Teachers and students: A sourcebook for UT-Austin faculty (pp. 1-2). Austin, TX: Center for TeachingEffectiveness, University of Texas at Austin.
Thorndike, R. M. (1997). Measurement and evaluationin psychology and education. Upper Saddle River, NJ:Prentice-Hall, Inc.
Wiggins, G. P. (1998). Educative assessment: Designingassessments to inform and improve student performance.San Francisco: Jossey-Bass Publishers.
Worthen, B. R., Borg, W. R., & White, K. R. (1993).Measurement and evaluation in the schools. New York:Longman.
Writing and grading essay questions. (1990, September).For Your Consideration, No. 7. Chapel Hill, NC: Center forTeaching and Learning, University of North Carolina atChapel Hill.
LP1.100018 3/18/08 2:17 PM Page 12