Upload
nwea
View
1.423
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Dr. Andy Hegedus, Senior Manager, Professional Development Data Analytics, NWEA Fusion 2012, the NWEA summer conference in Portland, Oregon At times, gaps in educators understanding of assessment data limits the depth of dialogue about the implications of all kinds of uses for data. More and more often people are considering including assessment data as a piece of a formal teacher evaluation process. This is a new and complicated area in which educators are beginning to tread. Using a framework for using data in teacher evaluations, we will reinforce some of what you know about assessment data; answer some questions you may have; and deepen your understanding of the strengths and limitations of assessment data. Learning Outcome: - Deepen your understanding of assessment data - Provide a context when considering using assessment results in teacher evaluation programs Audience: - New data user - Experienced data user - District leadership - Curriculum and Instruction
Citation preview
Andy Hegedus, Ed.D.
June 2012
Assessment Literacy in a
Teacher Evaluation Frame
• How many of you think your literacy with assessments in general is “Good” or better?
• How many of you are currently figuring out how to use assessment data thoughtfully in a Teacher Evaluation process?
Trying to gauge my audience and adjust my speed . . .
• What we’ve known to be true is now being shown to be true– Using data thoughtfully improves student
achievement• There are dangers present however
– Unintended Consequences
Go forth thoughtfullywith care
“What gets measured (and attended to), gets done”
Remember the old adage?
• NCLB– Cast light on inequities– Improved performance of “Bubble Kids”– Narrowed taught curriculum
An infamous example
It’s what we do that counts
A patient’s health doesn’t change because we know their blood pressure
It’s our response that makes all the difference
Our nation has moved from a model of education reform that focused on fixing schools to a model that is focused on fixing the teaching profession
Data Use in Teacher Evaluation is our construct for today
Be considerate of the continuum of stakes involved
Support
Compensate
Terminate
Increasing levels of required rigor
Incr
easi
ng r
isk
• Growth• Depiction of progress over time along a cross-
grade scale• Value-Added
– A determination of whether growth is greater for a particular student or group of students than would be expected
Let’s get clear on terms
Marcus Normal Growth Needed Growth
Marcus’ growth
College readiness standard
Is the progress produced by this teacher dramatically different than teaching peers who deliver instruction to comparable students in comparable situations?
What question is being answered in support of
using data in evaluating teachers?
The Test
The Growth Metric
The Evaluation
The Rating
There are four key steps required to answer this question
The purpose and design of the instrument is significant
• Many assessments are not designed to measure growth
• Others do not measure growth equally well for all students
Both Status and Growth are important
Beginning
Literacy
Adult Reading
5th Grade x
x
Time 1 Time 2
StatusGrowth
Value Added = Teacher Contribution to Growth
Teachers encounter a distribution of student performance
Beginning
Literacy
Adult Reading
5th Grad
e
x x xx
xx
xx
x
x
xx
x
xx
Grade Level Performance
Norm = “Typical” for a reference population
Traditional assessment uses items reflecting the grade level standards
Beginning
Literacy
Adult Reading
4th Grade
5th Grade
6th Grade
Grade Level Standards
Traditional Assessment Item Bank
Traditional assessment uses items reflecting the grade level standards
Beginning
Literacy
Adult Reading
4th Grade
5th Grade
6th Grade
Grade Level Standards
Grade Level StandardsOverlap allows linking and scale construction
Grade Level Standards
Adaptive testing works differently
Item bank can span full range of achievement
Available item pool depthis crucial
Est. RIT
Correct
Incorrect
Tests are not equally accurate for all students
California STAR NWEA MAP
5th Grade Level Items
These differences impact measurement error
.00
.02
.04
.06
.08
.10
.12
Info
rmat
ion
165 175 185 195 205 215 225 235 245
Scale Score
Academic Warning Below Meets Exceeds
Adaptive Test
Traditional Test
Significantly Different
Error
1st 86th
• Think of a high stakes test – State Summative
– Designed to identify if a student is proficient or not
• Do they do that well?• 93% correct on Proficiency determination
• Does it go off design well?• 75% correct on Performance Levels
determination
Error can change your life!
*Testing: Not an Exact Science, Education Policy Brief, Delaware Education Research & Development Center, May 2004, http://dspace.udel.edu:8080/dspace/handle/19716/244
• Assessments must align with the teacher’s instructional responsibility– Validity
• Is it assessing what you think it’s assessing?– Reliability
• If we gave it again, would the results be consistent?
What is measured must be aligned to what is being taught
Black, P. and Wiliam, D.(2007) 'Large-scale assessment systems: Design principles drawn from international comparisons', Measurement: Interdisciplinary Research & Perspective, 5: 1, 1 — 53
• …when science is defined in terms of knowledge of facts that are taught in school…(then) those students who have been taught the facts will know them, and those who have not will…not. A test that assesses these skills is likely to be highly sensitive to instruction.
The instrument must be able to detect instruction
Black, P. and Wiliam, D.(2007) 'Large-scale assessment systems: Design principles drawn from international comparisons', Measurement: Interdisciplinary Research & Perspective, 5: 1, 1 — 53
• When ability in science is defined in terms of scientific reasoning…achievement will be less closely tied to age and exposure, and more closely related to general intelligence. In other words, science reasoning tasks are relatively insensitive to instruction.
The more complex, the harder to detect and attribute to one teacher
• Security and Cheating
• Proctoring
• Procedures
Other issues
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
80.00
90.00
Spring term Fall term
Mean spring and fall test duration in minutes by school
Dur
atio
n (M
in)
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71-6.00
-4.00
-2.00
0.00
2.00
4.00
6.00
8.00
10.00
Students taking 10+ minutes longer spring than fall All other students
Ten minutes makes a difference ~ one RIT
Gro
wth
In
dex
(RIT
)
Testing is complete . . . What is useful to answer our question?
The Test
The Growth Metric
The Evaluation
The Rating
Grade 2 Grade 3 Grade 4 Grade 5 Grade 6 Grade 7 Grade 80
10
20
30
40
50
60
70
80
90
100
MathReading
The metric matters - Let’s go underneath “Proficiency”
Difficulty of New York “Meets” Level
Nat
iona
l Per
cent
ile
College Readiness
Typical
Mathematics
No ChangeDownUp
Fall RIT
Num
ber o
f Stu
dent
sWhat gets measured and attended to
really does matter
Proficiency College Readiness
One district’s change in 5th grade mathematics performance relative to the KY proficiency cut scores
Mathematics
Below projected growthMet or above pro-jected growth
Student’s score in fall
Nu
mb
er o
f S
tud
ents
Number of 5th grade students meeting projected mathemat-ics growth in the same district
Changing from Proficiency to Growth means all kids matter
How can we make it fair?
The Test
The Growth Metric
The Evaluation
The Rating
• What if I skip this step?– Comparison is likely against normative data so the
comparison is to “typical kids in typical settings”• How fair is it to disregard context?
– Good teacher – bad school– Good teacher – challenging kids
How does your performance evaluation consider context?
Consider . . .
• Value added models control for a variety of classroom, school level, and other conditions– Over one hundred different value added models– All attempt to minimize error– Variables outside controls are assumed as random
• Results are not stable– The use of multiple-years of data is highly
recommended– Results are more likely to be stable at the
extremes
Nothing is perfect
Multiple years of data is necessary for some stability
Typical r values for measures of teaching effectiveness range between .30 and .60 (Brown Center on Education Policy, 2010)
Lowest Highest0
20
40
60
80
100
120
Year 1Year 2
Teachers with growth scores in lowest and highest quintile over two years using NWEA’s MAP
(493 teachers)
Vote – Year 2 above or below
Num
ber
of te
ache
rs
• Control for statistical error– All models attempt to
address this issue• Error is compounded with
combining two test events
– Nevertheless, many teachers’ value-added scores will fall within the range of statistical error
A variety of errors mean more stability only at the
extremes
-12.00-11.00-10.00
-9.00-8.00-7.00-6.00-5.00-4.00-3.00-2.00-1.000.001.002.003.004.005.006.007.008.009.00
10.0011.0012.00
Mathematics Growth Index Distribution by Teacher - Validity Filtered
Aver
age
Grow
th In
dex
Scor
e an
d Ra
nge
Q5
Q4
Q3
Q2
Q1
Each line in this display represents a single teacher. The graphic shows the average growth index score for each teacher (green line), plus or minus the standard error of the growth index estimate (black line). We removed stu-dents who had tests of questionable validity and teachers with fewer than 20 students.
Range of teacher value-added estimates
With one teacher, error means a lot
• Value-added models assume that variation is caused by randomness if not controlled for explicitly– Young teachers are assigned disproportionate
numbers of students with poor discipline records– Parent requests for the “best” teachers are
honored– Sound educational reasons for placement are
likely to be defensible
Assumption of randomness can have risk
implications
• Idiosyncratic cases– In self-contained classrooms,
one or two idiosyncratic cases can have a large effect on results
Lower numbers can significantly impact a teacher
level analysis
How tests are used to evaluate teachers
The Test
The Growth Metric
The Evaluation
The Rating
• How would you translate a rank order to a rating?• Data can be provided
• Value judgment ultimately used to set cut scores for points or rating
Translation into ratings can be difficult to inform with data
Decisions are value based, not empirical
• What is far below a district’s expectation is subjective
• What about• Obligation to help
teachers improve?• Quality of replacement
teachers?
• System for combining elements and producing a rating is also a value based decision– Multiple measures and principal judgment must be
included– Evaluate the extremes to make sure it makes sense
Even multiple measures need to be used well
• Principal evaluation, state test, and local assessment scores are combined– Rating and points generated separately for each category– Principal has 60% of the evaluation
• What happens at the extremes– Low end of Developing (not Ineffective) with test scores
requires 98% rating by principal to not fall to Ineffective• Effective needs 95%
– A highly effective teacher based on test scores needs 50% or higher on Principal evaluation to maintain rating
NY use of multiple measures provides an example
• Be thoughtful • Involve variety of stakeholders • Use multiple years of student achievement data• Begin with pilots to understand the accuracy and
unintended consequences• Embrace the formative advantages of growth
measurement as well as the summative
Recommendations
• Presentations and other recommended resources are available at: – www.nwea.org– www.kingsburycenter.org
• Contacting us:NWEA Main Number 503-624-1951 E-mail: [email protected]
More information