Comments & Suggestions for the RTTT Assessment Competition

Scott MarionNational Center for the Improvement of Educational

AssessmentRace to the Top Assessment Public and Expert Input Meeting

Washington, DCJanuary 20, 2010

Introductory CommentsEnds, not meansTough choicesTheory of actionStructure response as an RFPCurriculum and instruction

2Marion. Center for Assessment. Jan. 20, 2010

Ends and MeansQuestions and other USED documents either imply

or propose to require a specific way of doing thingsUnless, USED is absolutely sure that this is the

only or even the best way (for all contexts) of accomplishing the goals, then:Be exceptionally clear about the goals of the

proposed system(s)Like we discussed last fall, to the extent possible,

clarify the purposes and uses of the assessment system, AND

Allow the smart proposal writers to be creative (innovative) about the means

If you are vague, the writers will be even more vague and you lower the chances of getting what you want


Tough choicesAgain, trying to read into the questions and other

documents about the forthcoming NIA, it appears that USED will be asking consortia for…Innovative assessment practicesBroad implementationFast timelines(otherwise known as “all of the above”)

Something will give!Follow Lorrie Shepard’s guidance from December

2009Allow consortia to propose to do one (relatively)

small thing well, e.g., create an innovative assessment system for grades 4-8 mathematics in only a few states


Tough choices & ends/meansAgain, trying to read into the questions for

today, several are asking for things where both tough choices and ends/means come together….Question 2 asks about increasing the rigor and

quality of high school assessmentsQuestion 3 asks about moving to computer-

based testingRequiring #2 might hinder #3 and visa versa


A Theory of Action Before finalizing the RFP, USED should articulate

a clear and explicit theory of actionAll respondents, MUST articulate an explicit

theory of action evident in their proposalDescribes how the particular CLEAR goals will be

achieved as a result the particular assessment system(s)

Specific mechanisms—how does USED/states/consotia expect we will get from A to B? What is the evidence to support this expectation?

Explicitly describes prioritized design choices, e.g.: Influence and shape teaching and learning, OR Measuring existing knowledge, OR Making cross-state comparisons

The theory of action is a check on the logic of the underlying assumptions


Response as an RFPOperational requirements of any multi-state consortium are

overwhelmingNo state or set of states has the capacity to design and

implement a multi-state assessment systemConsortium grantees will have to issue RFPs to support the

design, development, and piloting of the many assessmentsTherefore, I suggest requiring the response from potential

consortia as an RFP A well-written RFP will make the goals, rationale, and

design clear to potential bidders It will reveal to USED reviewers the extent to which the

proposers have thought through the many aspects of the proposed assessment system.

An alternative to this would be to provide enough clarity and lead time (I doubt you can) so that consortia can actually include contractors in their proposals Clarity on cost and evaluation What can RTTT dollars be spent on and what’s off limits?


Curriculum and Instruction?USED has managed to avoid any mention of

curriculum and instructionWhile that might be a political necessity, it doesn’t

make any practical or conceptual sense if the goal is to really move our educational system forward

I recommend requiring all proposals to at least address how their assessment model addresses considerable differences in curriculum and instruction across districts and statesHow do to the proposers think these differences will

affect the validity of their assessment results?How do they propose to deal with these differences if

at all?How do they think their assessment model will further

meaningful goals if they do not deal with curriculum and instruction?


Question 1 (through course)While I find some aspects of this approach

appealing, I would not require this specific approach in the response

What are we (USED, potential consortia) trying accomplish with the through course approach?

All proposals (whether using through course or not) should be required to submit evidence/rationale in all six categories outlined in the questionThe “through-course” system carries with it some

unique considerations/sources of evidence compared to a more traditional summative assessment

Inter-rater reliability is NOT one of these additional concerns


Consortia proposing a through-course approach should have to explain/provide additional evidence for:Construct validity: How would this approach enhance

the validity of the score interpretations?Aggregation: We know very little—other than trying to

maximize reliability—about how to best aggregate the scores from the multiple events for both students and schools

OTL: How will the states/consortia deal with potential increased effects due to differences in OTL

Security: Assuming the through course components are used for accountability

Consequences: How will the consortium deal with the potential (likely) negative effects when educators are restricted from using the full potential of the through course components for instructional improvement?

Equating: How will the consortium overcome the significant challenges to valid equating of scores from year-to-year?


Question 2 (HS EOC rigor)Pleased that USED is thinking about quality and

rigor in high schoolWhy “common end-of-course summative exams”?

What is the unit for “common”? The school, district, state or consortium?

The through-course approach from Q1 can potentially enhance rigor and validity. Why limit to EOC?

More important to focus on rigor and less on “consistent” or “common”

We have written about the tradeoffs between standardization and flexibility in other contexts (Gong & Marion, 2006) and many of these considerations can be applied here

Consider Amy Guttman’s conception of “threshold”


Requiring evidentiary support for “rigor”Provide evidence that students’ performance meets

a meaningful thresholdWhat is the system of review for rigor and technical

quality—how does the consortium propose to make this work within and across states?

Has the consortium addressed the balance between flexibility and standardization and offered a convincing case for where they stand?It would help if USED would clearly signal what they

think is the right balance between standardization and flexibility

How will the state/consortia ensure that students have a fair opportunity to meet rigorous thresholds? Validity is threatened if OTL is not provided


Question 3 (CBT/CAT)Again, don’t require! What are we trying to

accomplish with CBT/CAT?If USED focuses too much on comparability

(e.g., CBT/PBT), they will stifle innovationWe can’t even ensure comparability of computer-

to-computer comparability within a single state!Infrastructure issues are dauntingCBT offers considerable potential for

enhancing access for SWD, but it could also increase construct-irrelevant varianceNimble Tools and others have demonstrated the

potential of doing it well


Evidence to support CBTAre the items designed (or at least working

towards a design) to take advantage of technological capability or are they simply saving paper?

How will the consortium states move to full implementation of CBT so they can begin using innovative item types?

What type of designed-in (not add-on) approaches is the consortium proposing for increasing access for SWD and ELL beyond what is available with paper?

How will the consortium states avoid the negative consequences of the loss of computers and computer time for instructional purposes?


Additional evidence for CATHow will the consortium determine the optimal size of

the potential item bank? How is this concern influenced when designing for multiple instead of single states?

How will the consortium monitor potential parameter and scale drift over time?

How will the CAT be designed to provide instructionally useful information (beyond a scale score)?

How will the technical aspects of the item selection algorithms be monitored across multiple states?

Will “out-of-grade” level items be allowed?If not, the potential of CAT is limited considerably, at least

for one purposeOf course, this must be balanced with social justice

concerns


Question 4 (innovation & timeline)Don’t encourage (or fund) grants that do not

move down a path toward innovation, therefore require…States/consortia to clearly articulate a vision of

what they hope to accomplish with their educational system in 10 years and provide evidence/justification for how the proposed assessment system will support this vision

A “map” and a “route” Theory of action that describes how the

states/consortium will be able to stay on the routeEvidence that states have taken steps to avoid

“painting themselves into a corner”Another reasons for funding multiple consortia--

tackling manageable programs!16Marion. Center for Assessment. Jan. 20, 2010

Question 5 (research priorities)The statistical machinery of VAM has been well-

studied and does not need special funding (more research won’t correct for non-random assignment!)

However, related areas do need funding…The design and validity of learning progressions to

support both formative assessment and measuring growth with summative assessments

Assessment designs that allow for meaningful depictions of student progress (particularly related to such learning progressions)

How to improve the quality and usefulness of VAM and growth results For instructional improvement and accountability

How to integrate VAM results with observational evidence to make valid judgments about educator quality

How to better deal with “attribution” challenges, particularly in secondary school


Question 5 (research priorities #2)We learned a lot about the generalizability of

performance assessments during the 1990sWe could certainly stand to learn a lot more about:

Integrating performance assessments scores (especially if given at a different time of year) with a range of summative assessment type scores

Equating designs with performance or mixed assessment type items

If we can learn to do equating well, we can then more readily include performance assessments as part of growth measures

Design specifications/requirements for rich and engaging performance tasks


Question 5 (additional research priorities)External accountability systems: Do they

work to achieve policy priorities? Do other forms of school reform work better?

Equating test scores when so much is changing

Validating “college ready” measures—how do we know when we’ve reached “good enough?”

Learning progressions—will require a massive development and validation program


For more informationFormal comments will be submitted by

January 29, 2010 and available on request:[email protected]


Documents

Comments & Suggestions for the RTTT Assessment Competition