Misadministration of standardized achievement tests: Can we count on test scores for the evaluation of principals and teachers? Eliot Long A*Star Audits,

Misadministration of standardized achievement tests:

Can we count on test scores for the evaluation of principals and teachers?

Eliot LongA*Star Audits, LLC - Brooklyn, NYwww.astaraudits.com

CREATE - National Evaluation InstituteAnnual Conference – October 7-9, 2010

Assessment and Evaluation for Learning

Norm

Class

Test Item Response PatternsComparison of Class to Norm

http://www.astaraudits.com/

Finding Meaning In the Difference Between Two Test Scores

Schools experience erratic, inexplicable variations in measures of achievement gains.

“This volatility results in some schools being recognized as outstanding and other schools identified as in need ofimprovement simply as the result of random fluctuations.It also means that strategies of looking to schools that show large gains for clues of what other schools should doto improve student achievement will have little chance ofidentifying those practices that are most effective.”

Robert L. Linn and Carolyn Haug (Spring 2002)Stability of school-building accountability scores and gains.Educational Evaluation and Policy Analysis, 24(1), 29-36.

What is the contribution of test administration practices?

Misadministration of TestsA broad range of behaviors with cheating at one end

Standardized test administration procedures- Follow an approved script of test directions- Follow approved procedures for use of materials and timing- Provide no unauthorized assistance to students

Misadministration of tests- Add directions for guessing (how to answer when you don’t know how to answer)- Rephrase directions and test questions- Provide hints and instruction on test content- Modify timing as deemed necessary- Suggest corrections for obvious errors- Provide answers to difficult questions- Fill in blanks / change answers following the test administration

There is no ‘bright line’ for cheating, yet all forms of misadministration undermine test score reliability

Identifying / Evaluating Misadministration of Tests“How do we know it is misadministration – or cheating?”

Methods of investigation

- Interviews with teachers, students and school administrators

- Erasure analysis - Retesting - Statistical analysis

Confirmation for statistical analysis

- Management Information Report, Jan. 2010 Dept. of Education, Office of Inspector GeneralOIG data analytics project investigated 106 test administrators indicated by the A*Star method; 83 were identified by the OIG while a number of others were eliminated due to their small number of test administrations or the statute of limitations. See Report at: www2.ed.gov/about/offices/list/oig/alternativeproducts/x11j0002.pdf

The A*Star Method

Evaluation is based on all student groups tested with the same test and same set of standardized test administration procedures.

Steps: Determine normative test item response patterns

by group achievement level

Measure each student group (i.e. classroom, school) against the group’s achievement level norm

Identify those groups that significantly differ from the norm Evaluate the nature of response pattern deviations

Identify test-takers and test items subject to improper influence

The A*Star MethodBased on group test item response patterns

A*Star response pattern analysis:

A simple plot of the percent correct (p-value) for each test question provides a highly stable response pattern.

And describes the groupsuccess with each test item.

Easier items

More difficult items

Comparison to a Peer Group Norm

Skill Level Norm:All classrooms at the same achievement level set a peer group or ‘skill level’ norm.

P-value correlation:One method of comparison is a correlation of group and skill level p-values.Here, for a 50 item test, n = 50; r = .95

Percent attempted:

The line with stars indicates the percent of students in the group who answer each item.

Skill LevelNorm

A Range of Norms for a Range of Achievement

Test-taker groups (i.e. classrooms) at different levels of achievement are grouped to provide a number of different peer group (or skill level) norms.

Norms confirm test reliabilityNorm patterns illustrateinternal consistency

Peer group norms improve the measurement of

test-taker groups and the interpretation of theresults.

8 of 27 skill level norms determined for a 2001, grade 5 math test are illustrated here.

Regular Response Patterns4 classroom patterns representing a range of achievement

Test score reliability - and our experience – expect group response patterns to closely follow the norm at all skill levels.

RS RS23 26

RS RS 30 34

Irregular Response PatternsEncouraged guessing disrupts measurement

When student responses are subject to a significant, improper influence, the group response pattern deviates from the norm in measureable ways.

The class below has a poor n = 8correlation with the norm (.74). RS = 29.4Guessing by some students and r Corr. = .80teacher actions to encourage itcontradict norm patterns.

Full class: n = 18; RS = 22.3; r Corr. = .74

n = 10 RS = 16.6 r Corr. = .44

25% Correct

Improper InfluenceSubject Group Analysis

When test administrators provide a significant level of improper assistance, the response patterns become clearly irregular.

A ‘Subject Group Analysis’ (SGA) may identify subsets of students and test Subject Group answers that are highly unlikely to occur n = 10; RS = 32.4 without assistance. r Corr. = .66

SGA P = 1.8E-08 Full Class - n = 22; RS = 27.2; r Corr. = .83

Remaining Groupn = 12; RS = 22.9r Corr. = .82

Influence that is limited to the last test items may indicate a frustration built over the test session. Influence that begins with the early items and continues is more likely a purposeful effort to raise test scores.

n = 18; RS = 33.4; r Corr. = .61SGA: n = 9; P = 7.5E-014

n = 23; RS = 29.6; r Corr. = .73 SGA: n = 12; P = 3.7E-022

n = 27; RS = 32.0; r Corr. = .75

SGA: n = 21; P < E-045

Improper Influence Comes in Many Forms & Levels

Consistency of Test AdministrationGrade 5 Math 2001 - Urban School District

Consistency in test administration When all test-taker groups are correlated with their appropriate skill level norms, the distribution of correlation coefficients indicates the consistency of the test administrations.

Group correlations are expected to be high - .90 or better. Problems in test administration are likely below .85.

Classrooms and Schools A comparison of classroom groups with school groups indicates a lower consistency in classroom test administrations. Classrooms median r = .900 Schools median r = .960

Classrooms & SchoolsIt is easier to identify misadministration in small groups

Classrooms show more volatility as compared to schools because:

- Classrooms are where the action is – by students and teachers

- Classrooms are smaller – individual student behavior may make a greater difference

- In school response patterns, problems in one classroom may be masked by good data from other classrooms.

Conversely: - Improper influence by school administrators (passing out answers

before the test session/changing answers afterward) will create improbable patterns involving large numbers of students, crossing over classrooms and creating highly unlikely patterns.

Comparing 2001 to 2008Based on school level response patterns

School correlations with their respective skill level norms MC Items Correlation No. Percent with the Norm Year Assessment Program Assessment Schools Correct Med. 1st Q. 2001 East coast urban school district grade 5 math 667 68.7% .96 .94 2008 East coast state, statewide grade 4 math 1,311 73.4% .90 .87 2008 Midwest state, statewide grade 5 math 1,702 59.2% .89 .85

Note: The east coast urban school district is not in the east coast state.

School correlations with their appropriate response pattern norms are substantially lower in 2008 as compared with 2001.

Low correlations may indicate confusion, excessive guessing and various test-taking strategies – and they may indicate purposeful efforts to raise test scores.

Low correlations always means lower test score reliability.

School Level Response Patterns2008 Irregularities in Grade 5 Math

Small school: n = 23; RS = 38.1; r Corr. = .42 Subject Group: n = 8; P = 3.3E-023

Small to medium size school: n = 47; RS = 31.8; r Corr. = .18 Compare MC to OE (Constructed response)

School Level Response Patterns2008 Irregularities at Grade 5 Math

Medium Size School n = 69; RS = 30.9; r Corr. = .70 Subject Group: n = 30; P = 5.0E-021

Large Size School n = 253; RS = 26.7; r Corr. = .87 Subject Group: n = 68; P = 4.9E-019

Identifying & Measuring Misadministration

What constitutes “significant” cases of misadministration (cheating)?”Number of test items effected

Improper influence on any test item is wrong, but influence on only a few items is more likely an effort to facilitate the test administration rather than to materially raise test scores.Number of students involved

My sense of it is that a large number of items for a few students is a greater problem than a few items for a large number of students – the latter may be a perceived problem with the items while the former an effort to raise the scores of lower performing students.Improbability of response pattern.

Any probability less than 1 in 10,000 is significant, but common, wrong answers create unusually low probabilities that may overshadow more important problems. A “six sigma” approach is conservative.

Definition used here:Minimum 10% of test itemsMinimum #SGA students times #SGA items = 5% of all responsesProbability less than 1 in 100,000 (less than 10 in one million)

AnalysisRandom and Extended Samples

Frequency of significant influence in the assessment settingSGA applied to random samples2001 - Approximately 12% of all classrooms and

45% of all schools in the urban district.2008 - Approximately 30% of all schools in statewide reviews.

Frequency of significant influence by school sizeSGA applied to extended samples selected based on school size (a) Number of classrooms (2001) (b) Number of students tested (2001 & 2008)

Frequency of school administration influenceSGA applied to extended samples selected based on:

(a) Response pattern characteristics suggestive of irregularities. (b) Selected school districts by location and size.

Table of ResultsFrequency of Significant Influence

In 2001, approximately 3% of grade 5 classrooms and 2% of elementary schools in a large urban school district are identified as involving a significant misadministration of grade 5 math tests.

In 2008, approximately 14% of elementary schools in one state and 34% of elementary schools in another state are identified as involving a significant misadministration of grade 4 and grade 5 math tests, respectively.

A portion of the identified cases of misadministration may be the result of test-taking strategies, not generally regarded as cheating; all are most likely to involve the active efforts of teachers or school administrators outside of the standardized test administration procedures and necessarily result in a lost of test score reliability.

All Classrooms & Schools* Schools with 15 to 50 Students* 2001 2008

Grade 5 Math Gd. 4 Math Gd. 5 MathUrban Dist. Urban Dist. State 1 State 2Classrooms Schools Schools Schools

All GroupsNumber 2,446 667 1,311 1,702Avg. Size 24.1 89.3 66.1 61.4Median r 0.90 0.96 0.90 0.89

Random SampleNumber 300 300 410 513Pct. of All Groups 12.3% 45.0% 31.3% 30.1%Median r 0.91 0.96 0.90 0.88 1st Qtr. r 0.88 0.94 0.87 0.85Sig. Influence 9 5 59 174Pct. of Sample 3.0% 1.7% 14.4% 33.9%

* All students tested under standardized test administration procedures.

Minimum 15 students per group.

Small Schools - 2001 & 2008

School sizeThe median correlation declines for small schools in the urban district and in both states, with the 1st quartile correlation dropping below .85 in the state samples.

The frequency of significant misadministration rises among small schools for both the urban district and State #1, but declines for State #2.

The low correlations in State #2 represent misadministration of the test, yet the form is likely to more often include confusion, excessive guessing, and misdirection as compared to larger schools in the same state Nevertheless, the frequency of significant misadministration remains exceptionally high.

All Classrooms & Schools* Schools with 15 to 50 Students* 2008 2001 2008

Gd. 4 Math Gd. 5 MathUrban Dist. State 1 State 2

Schools Schools SchoolsAll GroupsNumber 102 506 775Avg. Size 36.5 36.1 35.0Median r 0.93 0.87 0.87

Random SampleNumber 47 170 239Pct. of All Groups 46.1% 33.6% 30.8%Median r 0.93 0.88 0.86 1st Qtr. r 0.90 0.83 0.82Sig. Influence 2 30 62Pct. of Sample 4.3% 17.6% 25.9%

* All students tested under standardized test

administration procedures.

Small Schools in 2001Significant Influence is more often found in small schools

In 2001, where classroom identification is available:Schools with 1 or 2 classrooms 9.5% significant misadministrationSchools with 6 or more classrooms 3.2% significant misadministration

Number of Students in All Classrooms with SGA Results*Significant Influence Groups Significant Influence based on Excp. Items >= 5; SGA Pct. >= 5%; SGA Prob. <= E-06

Schools with 15 to 50 Students* Nbr. of Nbr. of Pct.Nbr. of Classrooms Sample Classrooms Sig. Infl.

Classrooms Nbr. of Total Average Class in the SGA Pct. of All with in the SGA Sig. Infl.per School Schools Classrooms Size RS SL r Corr. Sample Classrooms Sig. Infl. Sample Freq.

1 41 41 21.6 34.2 0.868 41 100.0% 6 14.6%2 102 204 22.2 34.4 0.884 107 52.5% 8 7.5% 9.5%3 189 567 23.4 34.4 0.892 219 38.6% 15 6.8%4 143 572 24.2 34.2 0.893 236 41.3% 12 5.1% 6.1%5 86 430 24.8 34.4 0.897 165 38.4% 11 6.7%6 55 330 25.0 34.2 0.896 142 43.0% 4 2.8%7 23 161 24.7 34.4 0.898 78 48.4% 3 3.8%8 13 104 24.9 32.8 0.904 55 52.9% 2 3.6% 3.2%9 3 27 24.9 35.4 0.883 27 100.0% 1 3.7%10 1 10 23.8 30.6 0.901 10 100.0% 0 0.0%

Totals 656 2,446 24.1 34.2 0.893 1,080 44.2% 62 5.7% 5.7%

* All classrooms have 15 or more students tested without accommodations.

Number of Students in All Classrooms with SGA Results*Significant Influence Groups Significant Influence based on Excp. Items >= 5; SGA Pct. >= 5%; SGA Prob. <= E-06

Schools with 15 to 50 Students* SGA definition:

Excep. Items >= 10% of all test itemsSGA Prob. <= E-06

Nbr. of Urban State StateTest-takers District 1 2

Up to 10 15 10 15 11 to 20 0 47 81 21 to 30 0 31 83 31 to 40 0 16 52 41 to 50 0 9 22

51 + 0 8 37

Total 15 121 290SGA Sample N 529 728 921Pct. of Sample 2.8% 16.6% 31.5%Med. Nbr. TT 6.0 21.0 26.0

More than one class >= 31 0 33 111

Pct. of SGA Sample 0.0% 27.3% 38.3%

School Administration InfluenceA more frequent element in misadministration

Administration influence 2001 0% - 2008 27% - 38%The low probability of the SGA results suggest that the influence is directed by one person or under the direction of one person. When the number of students in the SGA is large (i.e. > 30), the source of the influence is likely to be outside of the classroom – i.e. the school administration.

Expanded sampleThe SGA method has been applied to a substantially expanded sample of schools, yet on a selective, non-random basis. The goal is to expand the number of observed cases of significant influence to evaluate their nature.

The frequency of significant influence in the expanded sample is similar to the random sample and illustrates a marked difference in the number of test-takers involved in the SGA from 2001 to 2008.

DiscussionIt’s not the teachers or the tests – it’s the system.

Misadministration of high-stakes tests is a major part of the problem of volatility in test score results.

Misadministration of high-stakes tests preceded NCLB at a modest, but significant, level and has markedly increased over 2001 to 2008

The character of misadministration has changed from the entrepreneurial efforts of individual teachers to more often include the direct participation of school administrators.

Misadministration includes many forms of deviation from standardized procedures, including informal strategies to raise scores recommended by test-prep writers and school authorities, leaving a fuzzy line for where cheating begins.

Principals and teachers are not given thorough instructions on test administration do’s and don’ts. - and are left to sort out informal recommendations, invent, and scramble during test administration sessions.

Documents

Misadministration of standardized achievement tests: Can we count on test scores for the evaluation of principals and teachers? Eliot Long A*Star Audits,