Fusion 2012 - Assessment Literacy in a Teacher Evaluation Frame

Andy Hegedus, Ed.D.

June 2012

Assessment Literacy in a

Teacher Evaluation Frame

• How many of you think your literacy with assessments in general is “Good” or better?

• How many of you are currently figuring out how to use assessment data thoughtfully in a Teacher Evaluation process?

Trying to gauge my audience and adjust my speed . . .

• What we’ve known to be true is now being shown to be true– Using data thoughtfully improves student

achievement• There are dangers present however

– Unintended Consequences

Go forth thoughtfullywith care

“What gets measured (and attended to), gets done”

Remember the old adage?

• NCLB– Cast light on inequities– Improved performance of “Bubble Kids”– Narrowed taught curriculum

An infamous example

It’s what we do that counts

A patient’s health doesn’t change because we know their blood pressure

It’s our response that makes all the difference

Our nation has moved from a model of education reform that focused on fixing schools to a model that is focused on fixing the teaching profession

Data Use in Teacher Evaluation is our construct for today

Be considerate of the continuum of stakes involved

Support

Compensate

Terminate

Increasing levels of required rigor

Incr

easi

ng r

isk

• Growth• Depiction of progress over time along a cross-

grade scale• Value-Added

– A determination of whether growth is greater for a particular student or group of students than would be expected

Let’s get clear on terms

Marcus Normal Growth Needed Growth

Marcus’ growth

College readiness standard

Is the progress produced by this teacher dramatically different than teaching peers who deliver instruction to comparable students in comparable situations?

What question is being answered in support of

using data in evaluating teachers?

The Test

The Growth Metric

The Evaluation

The Rating

There are four key steps required to answer this question

The purpose and design of the instrument is significant

• Many assessments are not designed to measure growth

• Others do not measure growth equally well for all students

http://commons/C19/Marketing/NWEA%20Images/Photos%20for%20Use/boywithpuzzle2.JPG

Both Status and Growth are important

Beginning

Literacy

Adult Reading

5th Grade x

x

Time 1 Time 2

StatusGrowth

Value Added = Teacher Contribution to Growth

Teachers encounter a distribution of student performance

Beginning

Literacy

Adult Reading

5th Grad

e

x x xx

xx

xx

x

x

xx

x

xx

Grade Level Performance

Norm = “Typical” for a reference population

Traditional assessment uses items reflecting the grade level standards

Beginning

Literacy

Adult Reading

4th Grade

5th Grade

6th Grade

Grade Level Standards

Traditional Assessment Item Bank

Traditional assessment uses items reflecting the grade level standards

Beginning

Literacy

Adult Reading

4th Grade

5th Grade

6th Grade


Grade Level StandardsOverlap allows linking and scale construction


Adaptive testing works differently

Item bank can span full range of achievement

Available item pool depthis crucial

Est. RIT

Correct

Incorrect

Tests are not equally accurate for all students

California STAR NWEA MAP

5th Grade Level Items

These differences impact measurement error

.00

.02

.04

.06

.08

.10

.12

Info

rmat

ion

165 175 185 195 205 215 225 235 245

Scale Score

Academic Warning Below Meets Exceeds

Adaptive Test

Traditional Test

Significantly Different

Error

1st 86th

• Think of a high stakes test – State Summative

– Designed to identify if a student is proficient or not

• Do they do that well?• 93% correct on Proficiency determination

• Does it go off design well?• 75% correct on Performance Levels

determination

Error can change your life!

*Testing: Not an Exact Science, Education Policy Brief, Delaware Education Research & Development Center, May 2004, http://dspace.udel.edu:8080/dspace/handle/19716/244

• Assessments must align with the teacher’s instructional responsibility– Validity

• Is it assessing what you think it’s assessing?– Reliability

• If we gave it again, would the results be consistent?

What is measured must be aligned to what is being taught

Black, P. and Wiliam, D.(2007) 'Large-scale assessment systems: Design principles drawn from international comparisons', Measurement: Interdisciplinary Research & Perspective, 5: 1, 1 — 53

• …when science is defined in terms of knowledge of facts that are taught in school…(then) those students who have been taught the facts will know them, and those who have not will…not. A test that assesses these skills is likely to be highly sensitive to instruction.

The instrument must be able to detect instruction

Black, P. and Wiliam, D.(2007) 'Large-scale assessment systems: Design principles drawn from international comparisons', Measurement: Interdisciplinary Research & Perspective, 5: 1, 1 — 53

• When ability in science is defined in terms of scientific reasoning…achievement will be less closely tied to age and exposure, and more closely related to general intelligence. In other words, science reasoning tasks are relatively insensitive to instruction.

The more complex, the harder to detect and attribute to one teacher

• Security and Cheating

• Proctoring

• Procedures

Other issues

0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

80.00

90.00

Spring term Fall term

Mean spring and fall test duration in minutes by school

Dur

atio

n (M

in)

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71-6.00

-4.00

-2.00

0.00

2.00

4.00

6.00

8.00

10.00

Students taking 10+ minutes longer spring than fall All other students

Ten minutes makes a difference ~ one RIT

Gro

wth

In

dex

(RIT

)

Testing is complete . . . What is useful to answer our question?

The Test

The Growth Metric

The Evaluation

The Rating

Grade 2 Grade 3 Grade 4 Grade 5 Grade 6 Grade 7 Grade 80

10

20

30

40

50

60

70

80

90

100

MathReading

The metric matters - Let’s go underneath “Proficiency”

Difficulty of New York “Meets” Level

Nat

iona

l Per

cent

ile

College Readiness

Typical

Mathematics

No ChangeDownUp

Fall RIT

Num

ber o

f Stu

dent

sWhat gets measured and attended to

really does matter

Proficiency College Readiness

One district’s change in 5th grade mathematics performance relative to the KY proficiency cut scores

Mathematics

Below projected growthMet or above pro-jected growth

Student’s score in fall

Nu

mb

er o

f S

tud

ents

Number of 5th grade students meeting projected mathemat-ics growth in the same district

Changing from Proficiency to Growth means all kids matter

How can we make it fair?

The Test

The Growth Metric

The Evaluation

The Rating

• What if I skip this step?– Comparison is likely against normative data so the

comparison is to “typical kids in typical settings”• How fair is it to disregard context?

– Good teacher – bad school– Good teacher – challenging kids

How does your performance evaluation consider context?

Consider . . .

• Value added models control for a variety of classroom, school level, and other conditions– Over one hundred different value added models– All attempt to minimize error– Variables outside controls are assumed as random

• Results are not stable– The use of multiple-years of data is highly

recommended– Results are more likely to be stable at the

extremes

Nothing is perfect

Multiple years of data is necessary for some stability

Typical r values for measures of teaching effectiveness range between .30 and .60 (Brown Center on Education Policy, 2010)

Lowest Highest0

20

40

60

80

100

120

Year 1Year 2

Teachers with growth scores in lowest and highest quintile over two years using NWEA’s MAP

(493 teachers)

Vote – Year 2 above or below

Num

ber

of te

ache

rs

• Control for statistical error– All models attempt to

address this issue• Error is compounded with

combining two test events

– Nevertheless, many teachers’ value-added scores will fall within the range of statistical error

A variety of errors mean more stability only at the

extremes

-12.00-11.00-10.00

-9.00-8.00-7.00-6.00-5.00-4.00-3.00-2.00-1.000.001.002.003.004.005.006.007.008.009.00

10.0011.0012.00

Mathematics Growth Index Distribution by Teacher - Validity Filtered

Aver

age

Grow

th In

dex

Scor

e an

d Ra

nge

Q5

Q4

Q3

Q2

Q1

Each line in this display represents a single teacher. The graphic shows the average growth index score for each teacher (green line), plus or minus the standard error of the growth index estimate (black line). We removed stu-dents who had tests of questionable validity and teachers with fewer than 20 students.

Range of teacher value-added estimates

With one teacher, error means a lot

• Value-added models assume that variation is caused by randomness if not controlled for explicitly– Young teachers are assigned disproportionate

numbers of students with poor discipline records– Parent requests for the “best” teachers are

honored– Sound educational reasons for placement are

likely to be defensible

Assumption of randomness can have risk

implications

• Idiosyncratic cases– In self-contained classrooms,

one or two idiosyncratic cases can have a large effect on results

Lower numbers can significantly impact a teacher

level analysis

http://commons/C19/Marketing/NWEA%20Images/Photos%20for%20Use/books.JPG

How tests are used to evaluate teachers

The Test

The Growth Metric

The Evaluation

The Rating

• How would you translate a rank order to a rating?• Data can be provided

• Value judgment ultimately used to set cut scores for points or rating

Translation into ratings can be difficult to inform with data

http://commons/C19/Marketing/NWEA%20Images/Photos%20for%20Use/binder.JPG

Decisions are value based, not empirical

• What is far below a district’s expectation is subjective

• What about• Obligation to help

teachers improve?• Quality of replacement

teachers?

• System for combining elements and producing a rating is also a value based decision– Multiple measures and principal judgment must be

included– Evaluate the extremes to make sure it makes sense

Even multiple measures need to be used well

• Principal evaluation, state test, and local assessment scores are combined– Rating and points generated separately for each category– Principal has 60% of the evaluation

• What happens at the extremes– Low end of Developing (not Ineffective) with test scores

requires 98% rating by principal to not fall to Ineffective• Effective needs 95%

– A highly effective teacher based on test scores needs 50% or higher on Principal evaluation to maintain rating

NY use of multiple measures provides an example

• Be thoughtful • Involve variety of stakeholders • Use multiple years of student achievement data• Begin with pilots to understand the accuracy and

unintended consequences• Embrace the formative advantages of growth

measurement as well as the summative

Recommendations

• Presentations and other recommended resources are available at: – www.nwea.org– www.kingsburycenter.org

• Contacting us:NWEA Main Number 503-624-1951 E-mail: [email protected]

More information

http://www.kingsburycenter.org/

http://www.kingsburycenter.org/

mailto:[email protected]