Iowa’s Application of Rubrics to Evaluate Screening and Progress Tools

Iowa’s Application of Rubrics to Evaluate Screening and Progress ToolsJohn L. Hosp, PhDUniversity of Iowa

Overview of this Webinar•Share rubrics for evaluating screening

and progress tools•Describe process Iowa Department of

Education used to apply rubrics

Purpose of the Review•Survey of universal screening and

progress tools currently being used by LEAs in Iowa

•Review these tools for technical adequacy•Incorporate one tool into new state data

system•Provide access to tools for all LEAs in

state

Collaborative Effort

The National Center on Response to Intervention

http://www.uiowa.edu/

Structure of the Review Process

CoreGroup IDE staff responsible

for administration and coordination of the effort

Vetting

Group

Other IDE staff as well as stakeholders from

LEAs, AEAs, and IHEs from across the state

WorkGroup IDE and AEA staff who

conducted the actual reviews

Overview of the Review Process

•The work group was divided into 3 groups:

•Within each group, members worked in pairs

Group A Group B Group C

Key elements of tools: name, what it measures, grades it is used with, how it is administered, cost, time to administer

Technical features: reliability, validity, classification accuracy, relevance of criterion measure

Application features: alignment with CORE, training time, computer system feasibility, turn around time for data, sample, disaggregated data

Overview of the Review Process• Each pair:

▫had a copy of the materials needed to conduct the review

▫reviewed and scored their parts together and then swapped with the other pair in their group

• Pairs within each group met only if there were discrepancies in scoring▫A lead person from one of the other groups

participated to mediate reconciliation• This allowed each tool to be reviewed by every

work group member

Overview of the Review Process•All reviews will be completed and brought

to a full work group meeting•Results will be compiled and shared•Final determinations across groups for

each tool will be shared with the vetting group two weeks later

•The vetting group will have one month to review the information and provide feedback to the work group

Structure and Rationale of Rubrics•Separate rubrics for universal screening

and progress monitoring▫Many tools reviewed for both▫Different considerations

•Common header and descriptive information

•Different criteria for each group (a, b, c)

Universal Screening Rubric

Iowa Department of EducationUniversal Screening Rubric for Reading (Revised 10/24/11)

What is a Universal Screening Tool in Reading: It is a tool that is administered at school with ALL students to identify which students are at-risk for reading failure on an outcome measure. It is NOT a placement screener and would not be used with just one group of students (e.g., a language screening test)

Why use a Universal Screening Tool:It tells you which students are at-risk for not performing at the proficient level on an end of year outcome measure. These students need something more and/or different to increase their chances of becoming a proficient reader.

What feature is most critical:Classification Accuracy because it provides a demonstration of how well a tool predicts who may and may not need something more. It is critical that Universal Screening Tools identify the correct students with the greatest degree of accuracy so that resources are allocated appropriately and students who need additional assistance get it.

Header on cover page

Group AGroup A

Information Relied on to make determinations: (circle all that apply, minimum of two) Manual from publisher NCRtI Tool Chart Buros/Mental Measurement Yearbook On-Line publisher Info. Outside Resource other than Publisher or Researcher of Tool Name of Screening Tool:

Skill/Area Assessed with Screener:

Grades: (circle all that apply) K 1 2 3 4 5 6 Above6

How Screener Administered: (circle one) Group or Individual

Criteria Justification Score 3 Score 2 Score 1 Score 0 Kicked out if: Cost (minus administrative fees like printing)

Tools need to be economically viable meaning the cost would be considered “reasonable” for the state or a district to use. Funds that are currently available can be used and can be sustained. One time funding to purchase something would not be considered sustainable.

Free

$.01 to $1.00 per student

$1.01 to $2.00 per student


≥ $3.00 & over per student

Student time spent engaged with tool

The amount of student time required to obtain the data. This does not include set-up and scoring time.

≤ 5 minutes per student

6 to 10 minutes per

student

11 to 15 minutes per

student

>15 minutes per student

Group BGroup B

Criteria Justification Score 3 Score 2 Score 1 Score 0 Kicked out if: Criterion Measure used for Classification Accuracy (Sheet for Judging Criterion Measure)

The measure that is being used as a comparison must be determined to be appropriate as the criterion. In order to make this determination several features of the criterion measure must be examined.

15-12 points on criterion

measure form


measure form


measure form


measure form

Same test but uses different

subtest or composite OR

same test given at a different

time Classification Accuracy (Sheet for Judging Classification Accuracy for Screening Tool)

Tools need to demonstrate they can accurately determine which students are in need of assistance based on current performance and predicted performance on a meaningful outcome measure. This is evaluated with: Area Under the Curve (AUC), Specificity and Sensitivity

9-7 points on classification

accuracy form


accuracy form


accuracy form

0 points on classification

accuracy form

No data provided

Criterion Measure used for Universal Screening Tool (Sheet for Judging Criterion Measure)



measure form


measure form


measure form


measure form

Same test but uses different

subtest or composite OR

same test given at a different

time

Judging Criterion MeasureUsed for: Circle all that apply Screening: Classification Accuracy Screening: Criterion Validity Progress Monitoring: Criterion ValidityName of Criterion Measure: Gates

How Criterion Administered: (circle one) Group or Individual

Information Relied on to make determinations: (circle all that apply) Manual from publisher NCRtI Tool Chart Buros/Mental Measurement Yearbook On-Line publisher info. Outside Resource other than publisher or Researcher of Measure

Additional Sheet for Judging the External Criterion Measure (Revised 10/24/11)

1. An appropriate Criterion Measure is:a) External to the screening or progress monitoring tool b) A Broad skill rather than a specific skillc) Technically adequate for reliability d) Technically adequate for validitye) Validated on a broad sample that would also represent Iowa’s population

Judging Criterion Measure (cont)Feature Justification Score 3 Score 2 Score 1 Score 0 Kicked Out

Criterion Measure is: a) External to the

Screening or Progress Monitoring Tool

The criterion measure should be separate and not related to the screening or progress monitoring tool. Meaning the outside measure should be by a different author/publisher and use a different sample. (e.g., NWF can’t predict ORF by the same publisher)

External with no/little overlap. Different author/publisher, standardization group.

External with some/ a lot of overlap. Same author/publisher, and standardization group.

Internal (same test using different subtest or composite OR same test given at a different time)

b) A broad skill rather than a specific skill

We are interested in generalizing to a larger domain and therefore, the criterion measure should assess a broad area rather than splinter skills.

Broad reading skills are measured (e.g., Total reading score on ITBS)

Broad reading skills are measured but in one area (e.g., comprehension made up of two subtests)

Specific skills measured in two areas (e.g., comprehension and decoding)

Specific skill measured in one area (e.g., PA, decoding, vocabulary, spelling)

Judging Criterion Measure (cont)

c) Technically adequate for Reliability

Student performance needs to be consistently measured. Typically demonstrated with reliability under different items (alternate form, split half, coefficient alpha)

Some form of reliability above .80

Some form of reliability between .70 and .80

Some form of reliability between .60 and .70

All forms of reliability below .50

d) Technically adequate for Validity

The tool measures what it purports to measure. We focused on criterion-related validity to make this determination. The extent to which this criterion measure relates to another external measure that is determined good.

Criterion ≥ .70

Criterion .50-.69

Criterion .30 -.49

Criterion .10 - .29

e) A broad sample is used

The sample used in determining the technical adequacy for a tool should represent a broad audience. While a representative sample by grade is desirable it is often not reported therefore, taken as a whole does the population used represent all students or is it specific to a region or state?

National sample Several States (3 or more) across more than one region

States (3, 2 or 1 in one region)

Sample of convenience, does not represent a state.

Judging Classification AccuracyAdditional Sheet for Judging Classification Accuracy for Screening Tool (Revised 10/24/11)

Assessment: (Include name and grade) Complete the Additional Sheet for Judging the Criterion Measure. If it is not kicked out complete review for:

1) Area Under the Curve (AUC) 2) Specificity/Sensitivity 3) Lag time between when the assessments are given

Feature Justification Score 3 Score 2 Score 1 Score 0 Kicked Out 1) Area Under the Curve (AUC)

Technical Adequacy is Demonstrated for Area Under the Curve

Area Under the Curve is one way to gauge how accurately a tool identifies students in need of assistance. It is derived from Receiver Operating Characteristic curves (ROC) and is presented as a number to 2 decimal places. One AUC is reported for each comparison—each grade level, each subgroup, each outcome tool, etc.

AUC ≤ .90

AUC ≥ .80

AUC ≥ .70

AUC < .70

Judging Classification Accuracy (cont)2) Specificity or Sensitivity

Technical Adequacy is Demonstrated for Specificity or Sensitivity (see below)

Specificity/Sensitivity is another way to gauge how accurately a tool identifies students in need of assistance. Specificity and Sensitivity can give the same information depending on how the developer reported the comparisons. Sensitivity is often reported as accuracy of positive prediction (yes on both tools). Therefore if the developer predicted positive/proficient performance, Sensitivity will express how well the screening tool identifies students who are proficient. If predicting at-risk or non-proficient, this is what Sensitivity shows. It is important to verify what the developer is predicting so that consistent comparisons across tools can be made (see below)

Sensitivity or Specificity

≥ .90


≥ .85


≥ .80


< .80

3) Lag time between when the assessments are given Lag time- length of time between when the criterion and screening assessment is given

Time between when the assessments are given should be shorter to eliminate effects associated with differential instruction

Under two weeks

Between two weeks and 1

month

Between 1 month and 6

months

Over 6 months

Sensitivity and Specificity Considerations and Explanations

Key+ = proficiency/mastery- = nonproficiency/at-risk0 = unknown = Sensitivity = Specificity

Explanations:True means “in agreement between screening and outcome”. So true can be negative to negative in terms of student performance (i.e., negative meaning at-risk or nonproficient). This could be considered either positive or negative prediction depending on which the developer intends the tool to predict. As an example, a tool that has a primary purpose of identifying students at-risk for future failure would probably use ‘true positives’ to mean ‘those students who were accurately predicted to fail the outcome test’.Sensitivity = true positives/true positives + false negativesSpecificity = true negatives/true negatives + false positives

Consideration 1:Determine whether developer is predicting a positive outcome (i.e., proficiency, success, mastery, at or above a criterion or cut score) from a positive performance on the screening tool (i.e., at or above benchmark or a criterion or cut score) or a negative outcome (i.e., failure, nonproficiency, below a criterion or cut score) from negative performance on the screening tool (i.e., below a benchmark, criterion, or cut score). Prediction is almost always positive to positive or negative to negative; however in rare cases it might be positive to negative or negative to positive.

Figure 1a This is an example of positive to positive prediction. In this case, Sensitivity is positive performance on the screening tool predicting positive outcome.

Figure 1b

This is the opposite prediction—negative to negative as the main focus. In this case, Sensitivity is negative (or at-risk) performance on the screening tool predicting negative outcome. Using the same information in these two tables, Sensitivity in the top table

will equal Specificity in the second table. Because our purpose is to predict proficiency, in this instance we would use Specificity as the metric for judging.

Outcome + -

Screening +

-

Outcome - +

Screening -

+

Consideration 2: Some developers may include a third category—unknown prediction. If this is the case, it is still important to determine whether they are predicting a positive or negative outcome because Sensitivity and Specificity are still calculated the same way.

Figure 2a This is an example of positive to positive prediction. In this case, Sensitivity is positive performance on the screening tool predicting positive outcome. It represents a similar comparison to that in Figure 1a.

Figure 2b This is the opposite prediction—negative to negative as the main focus. In this case, Sensitivity is negative (or at-risk) performance on the screening tool predicting negative outcome. It represents a similar comparison to that in Figure 1b. Using the same information in these two tables, Sensitivity in the top table will equal Specificity in the second table. Because our purpose is

to predict proficiency, in this instance we would use Specificity as the metric for judging.

Outcome + 0 -

Screening +

0

-

Outcome - 0 +

Screening -

0

+

Consideration 3: In (hopefully) rare cases, the developer will set up the tables in opposite directions (reversing screening and outcome or using a different direction for the positive/negative for one or both). This illustrates why it is important to consider which column or row is positive and negative for both the screening and outcome tools.

Notice that the Screening and Outcome tools are transposed. This makes Sensitivity and Specificity align within rows rather than columns.

Screening - 0 +

Outcome +

0

-

Group B (cont)Criterion Validity for Universal Screening Tool. From technical manual

Tools need to demonstrate that they actually measure what they purport to measure (i.e., validity). We focused on criterion-related validity because it is a determination of the relation between the screening tool and a meaningful outcome measure.

Criterion ≥ .70

Criterion .50-.69

Criterion .30 -.49

Criterion .10 - .29

Criterion < .10 or no information

provided

Reliability for Universal Screening Tool.

Tools need to demonstrate that the test scores are stable across items and/or forms. We focused on:

alternate form split half coefficient alpha

Alternate Form > .80

Split-half > .80

Coefficient alpha >.80


Split-half > .70



Split-half > .60



Split-half > .50


There is no evidence of

reliability

Reliability across raters for Universal Screening Tool.

How reliable scores are across raters is critical to the utility of the tool. If the tool is complicated to administer and score it can be difficult to train people to use it leading to different scores from person to person.

Rater ≥.90 Rater .89-.85 Rater .84-.80 Rater ≤.75

Group CGroup C

Criteria Justification Score 3 Score 2 Score 1 Score 0 Kicked out if: Alignment with Iowa CORE/ Demonstrated Content Validity

It is critical that tools assess skills identified in the Iowa Core. Literature & Informational:

Key Ideas & Details Craft & Structure Integration of knowledge &

ideas Range of reading & level of

text complexity Foundational: (K – 1)

Print Concepts Phonological Awareness Phonics and Word

Recognition Fluency

Foundational: (2 – 5) Phonics and Word

Recognition Fluency

Has a direct alignment

with the Iowa CORE (provide

Broad Area and Specific

Skill)

Has alignment with Iowa

CORE (Provide Broad Area)

Has no alignment with the Iowa CORE

Group C (cont)Training Required The amount of time needed for

training is one consideration related to the utility of the tool. Tools that can be learned in a matter of hours and not days would be considered appropriate.

Less than 5 hours of training (1 day)

5.5 to 10 hours of training (2 days)


Over 15.5 hours of training (4+ days)

Computer Application (tool and data system)

Many tools are given on a computer which can be helpful if: schools have computers, the computers are compatible with the software, and the data reporting can be separated from the tool itself. It is also a viable option if hard copies of the tools can be used if computers are not available.

Computer or hard copy of

tool available. Data reporting

is separate

Computer application

only. Data reporting

is separate


tools available.

Data reporting is part of the

system

Computer application only. Data reporting is part of

the system

Data Administration and Data Scoring

The number of people needed to administer and score the data speaks to the efficiency of how data is collected and the reliability of scoring.

Student takes assessment

on computer and it is

automatically scored by

computer at end of test

Adult administers

assessment to student and

enters student’s

responses (in real time) into computer and

it is automatically

scored by computer at end of test

Adult administers


then calculates a score at end

of test by conducting

multiple steps

Adult administers assessment to student and then calculates a score at end of test by conducting

multiple steps AND referencing additional

materials to get a score (having to look up

information in additional tables)

Group C (cont)Data Retrieval (time for data to be useable)

The data needs to be available in a timely manner in order to use the information to make decisions about students

Data can be used instantly

Data can be used Same

day

Data can be used Next day

Data are not available until

2 – 5 days later

Takes 5+ days to use data

(have to send data out to be

scored) A broad sample is used


National sample

Several States (3 or more) across more

than one region


Sample of convenience,

does not represent a

state.

Disaggregated Data Viewing disaggregated data by subgroups (i.e, race, English language learners, economic status, special ed. status) helps determine how the tool works with each group. This information is often not reported but it should be considered if it is available.

Race, economic

status, and special ed. status are reported

separately

At least two disaggregated

groups are listed

One disaggregated group is listed

No information on

disaggregated groups

Progress Monitoring RubricHeader on cover page

Iowa Department of Education Progress Monitoring Rubric (Revised 10/24/11)

Why use Progress Monitoring Tools: They quickly and efficiently provide an indication of a student’s response to instruction. Progress monitoring tools are sensitive to student growth (i.e., skills) over time, allowing for more frequent changes in instruction. They allow teachers to better meet the needs of their students and determine how best to allocate resources. What feature is most critical: Sufficient number of equivalent forms so that student skills can be measured over time. In order to determine if students are responding positively to instruction, they need to be assessed frequently to evaluate their performance and the rate at which they are learning.

Information Relied on to make determinations: (circle all that apply, minimum of two) Manual from publisher NCRtI Tool Chart Buros/Mental Measurement Yearbook On-Line publisher Info. Outside Resource other than Publisher or Researcher of Tool Name of Progress Monitoring Tool:

Skill/Area Assessed with Progress Monitoring Tool:

Grades: (circle all that apply) K 1 2 3 4 5 6 Above6

How Progress Monitoring Administered: (circle one) Group or Individual

Name of Criterion Measure:

How Criterion Administered: (circle one) Group or Individual

Descriptive info on each work group’s section

Group ACriteria Justification Score 3 Score 2 Score 1 Score 0 Kicked out if:

Number of equivalent forms

Progress monitoring requires frequently assessing a student’s performance and making determinations based on their growth (i.e., rate of progress). In order to assess students’ learning frequently, progress monitoring is typically conducted once a week. Therefore, most progress monitoring tools have 20 to 30 alternate forms.

20 or more alternate

forms

15 – 19 alternate

forms

10 – 14 alternate

forms

9 alternate forms

< 9 alternate forms

Cost (minus administrative fees like printing)

Tools need to be economically viable meaning the cost would be considered “reasonable” for the state or a district to use. Funds that are currently available can be used and can be sustained. One time funding to purchase something would not be considered sustainable.

Free

$.01 to $1.00 per student



≥$3.00 & over per student

Student time spent engaged with tool

The amount of student time required to obtain the data. This does not include set-up and scoring time. Tools need to be efficient to use. This is especially true of measures that teachers would be using on a more frequent basis.

≤ 5 minutes per student

6 to 10 minutes per

student

11 to 15 minutes per

student

>15 minutes per student

Group BCriteria Justification Score 3 Score 2 Score 1 Score 0 Kicked out if:

Forms are of Equivalent Difficulty (Need to provide detail of what these are when publish review)

Alternate forms need to be of equivalent difficulty to be useful as a progress monitoring tool. Having many forms of equivalent difficulty allows a teacher to determine how the student is responding to instruction because the change in score can be attributed to student skill versus a change in the measure. Approaches include: Readability formulae (e.g., Fleish-

Kincaid, Spache, Lexile, FORCAST) Euclidian Distance Equipercentiles Stratified Item Sampling

Addressed equating in

multiple ways

Addressed equating in 1

way that is reasonable

Addressed equating in a way that is

NOT reasonable

Does Not Provide any indication of

equating forms

Judgment of Criterion Measure (see separate sheet for judging criterion measure)



measure form


measure form


measure form


measure form

Technical Adequacy is Demonstrated for Validity of Performance score (sometimes called Level)

Performance score is a student’s performance at a given point in time rather than a measure of his/her performance over time (i.e., rate of progress). We focused on criterion-related validity to make this determination because it is a determination of the relation between the progress monitoring tool and a meaningful outcome.

Criterion ≥ .70

Criterion .50-.69

Criterion .30 -.49

Criterion .10 - .29

Group B (cont)Technical Adequacy is Demonstrated for Reliability of Performance score

Tools need to demonstrate that the test scores are stable across item samples/forms, raters, and time. Across item samples/forms:

coefficient alpha split half KR-20 alternate forms

Across raters: interrater (i.e., interscorer,

interobserver) Across time:

Test-retest

Item samples/ forms ≥.80

Item samples/ forms .79-.70

Item samples/ forms .69-.60

Item samples/ forms ≤.59

Must Report 2/3 OR a score

of 0 in 2 or more areas.

(No tool would be kicked out due to lack of

any one.)

Rater ≥.90 Rater .89-.85 Rater .84-.80 Rater ≤.75

Time ≥.80 Time .79-.70 Time .69-.60 Time ≤.59

Technical Adequacy is Demonstrated for Reliability of slope

The Reliability of the slope tells us how well the slope represents a student’s rate of improvement. Two criteria are used:

Number of observation, that is student data points needed to calculate slope.

Coefficients, that is reliability for slope. This should be reported via HLM (also called LMM or MLM) results. If calculated via OLS, the coefficients are likely to be lower. *

10 or more observations/

data points

9-7 observations/

data points

6-4 observations/

data points

3 or fewer observations/

data points

Coefficient >.80

Coefficient >.70

Coefficient >.60

Coefficient <.59

Group B (cont)

* HLM=Hierarchical Linear Modeling LMM=Linear Mixture Modeling MLM=Multilevel Modeling OLS=Ordinary Least Squares HLM, LMM, and MLM are three different ways to describe a similar approach to analysis. Reliability of the slope should be reported as a proportion of variance accounted for by the repeated measurement over time. These methods take into account that the data points are actually related to one another because they come from the same individual. OLS does not take this into account and as such, would ascribe the extra variation to error in measurement rather than the relation among data points.

Group CCriteria Justification Score 3 Score 2 Score 1 Score 0 Kicked out if:

Alignment with Iowa CORE/ Demonstrated Content Validity

It is critical that tools assess skills identified in the Iowa Core. Literature & Informational:

Key Ideas & Details Craft & Structure Integration of knowledge &

ideas Range of reading & level of text

complexity Foundational: (K – 1)

Print Concepts Phonological Awareness Phonics and Word Recognition Fluency

Foundational: (2 – 5) Phonics and Word Recognition Fluency

Has a direct alignment

with the Iowa CORE (provide

Broad Area and Specific

Skill)

Has alignment with Iowa

CORE (Provide Broad Area)

Has no alignment with the Iowa CORE

Training Required The amount of time needed for training is one consideration related to the utility of the tool. Tools that can be learned in a matter of hours and not days would be considered appropriate.

Less than 5 hours of training (1 day)



Over 15.5 hours of training

(4+ days)

Computer Application (tool and data system)

Many tools are given on a computer which can be helpful if: schools have computers, the computers are compatible with the software, and the data reporting can be separated from the tools itself. It is also a viable option if hard copies of the tools can be used if computers are not available.


tool available. Data reporting

is separate


only. Data reporting

is separate


tool available. Data reporting is part of the

system


only. Data reporting is part of the

system

Group C (cont)Data Administration and Data Scoring

The number of people needed to administer and score the data speaks to the efficiency of how data is collected and the reliability of scoring.

Student takes assessment

on computer, it is

automatically scored by

computer at end of test

Adult administers


enters student’s

responses (in real time) into computer, it is automatically

scored by computer at end of test

Adult administers




multiple steps (adding

together scores across

many assessments, subtracting

errors to get a total score)

Adult administers




multiple steps AND

referencing additional

materials to get a score (having to

look up information in

additional tables)

Data Retrieval (time for data to be useable)

The data needs to be available in a timely manner in order to use the information to make decisions about students

Data can be used instantly

Data can be used Same

day

Data can be used Next day

Data are not available until

2 – 5 days later

Takes 5+ days to use data

(have to send data out to be

scored)

Group C (cont)

A broad sample is used


National sample

Several States (3 or more) across more

than one region


Sample of convenience,

does not represent a

state.

Disaggregated Data Viewing disaggregated data by subgroups (i.e, race, English language learners, economic status, special ed. status) helps determine how the tool works with each group. This information is often not reported but it should be considered if it is available

Race, economic

status, and special ed. status are reported

separately

At least two disaggregated

groups are listed

One disaggregated group is listed

No information

on disaggregated

groups

Findings•Many of the tools reported are not

sufficient (or appropriate) for universal screening or progress monitoring

•Some tools are appropriate for both•No tool (so far) is “perfect”•There are alternatives from which to

choose

Live Chat•Thursday April 26, 2012•2:00-3:00 EDT•Go to rti4success.org for more details

Documents

Iowa’s Application of Rubrics to Evaluate Screening and Progress Tools