27
The Case for Performance-Based Tasks without Equating Paper Presented at the National Council on Measurement in Education Vancouver, British Columbia, Canada Walter D. Way Daniel Murphy Sonya Powers Leslie Keng April 2012

The Case for Performance-Based Tasks without Equating

Embed Size (px)

Citation preview

Page 1: The Case for Performance-Based Tasks without Equating

The Case for Performance-Based Tasks without Equating Paper Presented at the National Council on Measurement in Education Vancouver, British Columbia, Canada Walter D. Way Daniel Murphy Sonya Powers Leslie Keng

April 2012

Page 2: The Case for Performance-Based Tasks without Equating

PERFORMANCE-BASED ASSESSMENTS

1

Abstract

Significant momentum exists for next-generation assessments to increasingly utilize technology

to develop and deliver performance-based assessments. Many traditional challenges with this

assessment approach still apply, including psychometric concerns related to performance-based

tasks (PBTs), which include low reliability, efficiency of measurement and the comparability of

different tasks. This paper proposes a model for performance-based assessments that assumes

random selection of PBTs from a large pool, and that assumes tasks are comparable without

equating PBTs. The model assumes that if a large number of PBTs can be randomly assigned,

then task-to-task variation across individuals will average out at the group (i.e., classroom and

school) level. The model was evaluated empirically using simulations involving a re-analysis of

data from a statewide assessment. A set of G-theory analyses was conducted to assess the

reliability of average school performance on PBTs and evaluate how variance due to the

randomly-assigned tasks compared to other sources of variation. Analysis based on the linear

student growth percentiles (SGP) model was used to assess the degree to which the model

assumption of randomly-equivalent tasks held by comparing school classifications based on PBT

growth estimates with three alternative school-level measures. The study findings support the

viability of the proposed model to support next-generation performance-based assessments for

uses related to group-level inferences.

Keywords: performance-based tasks, G theory, student growth percentiles, Common Core

Page 3: The Case for Performance-Based Tasks without Equating

PERFORMANCE-BASED ASSESSMENTS

1

The Case for Performance-Based Tasks without Equating

Significant momentum exists in the United States for next-generation assessments that go

beyond traditional multiple-choice and constructed-response item types. The focus on college

and career readiness and the increased emphasis on new skills and abilities that are needed to

succeed in the 21st century have propelled a renewed interest in performance-based assessment

(PBA; Darling-Hammond & Pecheone, 2010). This movement is further supported by the

increasingly broad ways that technology is being used to present instruction and assess learning.

It is therefore not surprising that both consortia that have been funded by the federal government

to develop assessments measuring the Common Core standards – the Partnership for Assessment

of Readiness for College and Careers (PARCC) and the SMARTER Balanced Assessment

Consortium (SBAC) – plan to include performance tasks as part of their summative tests.

Although next-generation PBA has potential, many of the traditional challenges with this

assessment approach still apply. When PBA is combined with objectively-scored assessments to

produce summative scores, which both the PARCC and SBAC plan to do, complex issues related

to aggregating scores arise (Wise, 2011). These issues are exacerbated by psychometric

concerns related to performance-based tasks, which include low reliability, efficiency of

measurement (i.e., the amount of testing time needed to achieve a desired level of reliability),

and the comparability of different tasks. Comparability from task-to-task has long been a concern

with performance-based assessments. Green (1995) was frank in concluding that a performance-

based assessment is not well-suited to “maintaining the aspects of a testing procedure that are

congenial to the equating process.” He summarized the concern as follows:

Page 4: The Case for Performance-Based Tasks without Equating

PERFORMANCE-BASED ASSESSMENTS

2

In the language of factor analysis, each test item or task usually has a small

amount of common variance and has substantial specific variance. A test combines the

results of many items or tasks, building up the common variance and washing out the

specific variance, which is usually treated as a source of error. The fewer items, or tasks,

there are, the less advantage can be taken of the immense power of aggregation to

overwhelm such error. (Green, 1995, p. 14)

The traditional notion of performance-based tasks, developed in the late 1980s and early

1990s, assumed that for a given assessment program, students would typically take the same

tasks at the same time. Thus, when an assessment was repeated with a new task, either for the

same students at a later time or for a new cohort of students, comparisons of student performance

was extremely difficult because of task-to-task variation.

The next generation of PBA has the opportunity to approach task variation differently

because of advances both in technology and psychometric approaches. For example, evidence-

centered design (ECD; Huff, Steinberg, & Matts, 2010) approaches can reduce task-to-task

variation and provide templates to aid task development. Technology can further assist

development efforts both by automatically generating variants of tasks and by making it possible

to randomly select specific tasks from an available pool of tasks. However, to take full advantage

of these features, changes in traditional models for PBA will be needed. In this paper, we

propose and evaluate a model for performance-based assessments that assumes random selection

of performance-based tasks (PBTs) from a large pool, and that assumes tasks are comparable

without equating for the purposes of aggregating scores at the group (e.g., classroom or school)

level. These two assumptions capitalize on an underlying expectation that task-to-task variation

across individuals will average out at the group level.

Page 5: The Case for Performance-Based Tasks without Equating

PERFORMANCE-BASED ASSESSMENTS

3

A Model for Performance-Based Assessments

It seems reasonable to assume that next-generation development of performance-based

tasks will be strongly influenced by ECD. For example, Mislevy and Haertel speak of “the

exploitation of efficiencies from reuse and compatibility” (Mislevy & Haertel, 2006, p. 22) that

is afforded by ECD. Luecht and his colleagues (c.f., Luecht, Burke, & Shu, 2010) have coined

the term “Assessment Engineering” (AE), which involves construct maps, evidence models, task

models and templates as a means of generating extremely large numbers of complex

performance exercises. The Literacy Design Collaborative (LDC) is a Gates Foundation project

whose purpose is to develop literacy template tasks that can be filled with curriculum content

from varied subjects. These templates can be used for teaching in the classroom but also extend

to performance task design. The potential to develop large numbers of tasks from specified task

templates is further aided by technology, which provides tools that test developers can use to

support assessment task authoring (Liu & Haertel, 2011).

If performance-based tasks are to measure deep thinking and 21st century skills, they will

still require significant assessment time. Thus, although ECD approaches can reduce task-to-task

variation, this variation will still exist at the individual student level because each student can

only take a limited number of PBTs because of practical time limitations. However, if a large

number of PBTs can be randomly assigned, then task-to-task variation across individuals will

average out when scores are aggregated across students. This is an important test design

consideration for the Common Core assessments, which must produce student achievement data

and student growth data for determinations of school, principal, and teacher effectiveness. With

large pools of PBTs, random assignment can represent a kind of domain sampling, and scores

across classes and schools can represent estimates of average domain scores. From this

Page 6: The Case for Performance-Based Tasks without Equating

PERFORMANCE-BASED ASSESSMENTS

4

viewpoint, equating is not necessary because the aggregated scores are comparable within

calculable estimates of standard errors.

An additional benefit of having a large pool of PBTs is significantly lessened security

concerns. With a large enough task pool, it would be possible to disclose tasks along with the

actual student work and the resulting scores. Furthermore, the PBT raw scores could be

interpreted directly in terms of the applied scoring rubrics. This would greatly facilitate another

requirement of the Common Core assessments, which is to produce data that informs teaching,

learning and program improvement.

Research Questions

The purpose of this paper is to describe and illustrate a model for performance-based

assessments that assumes randomly-selected assessments tasks from a large pool of PBTs are

comparable such that equating is not necessary. The model was evaluated empirically using

simulations involving a re-analysis of data from a statewide assessment. The analyses sought to

answer two main research questions. The first research question focused on the reliable of test

scores under the proposed PBT model. Specifically, how does the impact of task-to-task

variation on the reliability of PBT scores at an aggregate (e.g., school) level compare to other

sources of variation? Also, states often make inferences about schools by placing them in

performance categories based on their students’ achievement growth using standardized

measures. Our second research question therefore asked, to what degree does the assumption of

randomly-equivalent tasks hold when classifying schools based on their students’ growth on test

scores that incorporate PBTs?

Page 7: The Case for Performance-Based Tasks without Equating

PERFORMANCE-BASED ASSESSMENTS

5

Method

Data Source

An empirical simulation was conducted to explore the impact of unequated performance-

based tasks used for summative assessment purposes. The empirical simulation used real

response data from a statewide mathematics and science tests administered in grade 10 in 2009

and grade 11 in 2010. Students were matched across years, so that the data set used will include

data for each of the four tests.

Because the assessments consist of only multiple-choice items, the performance-based

task scores were “simulated” by combining randomly selected subsets of items from the math

and science tests. For each test, 50 random samples of 12 items were selected with replacement

from the complete tests. These sets represented the simulated PBTs. The simulated PBTs were

used in different ways for the two sets of analyses that followed. For the generalizability analyses

(to be described below) student scores on all 50 sets were included. For the growth analyses, it

was assumed that students took one of the sets of 12 items and the remaining test items were

considered the simulated summative test. Table 1 presents the number of items and coefficient

Alpha reliabilities of the full tests that were used for the simulations, as well as the Spearman-

Brown projected reliabilities of the shortened tests and the simulated performance tasks.

Table 1. Test Length and Reliability of the Full and Shortened Tests

Full Test Shortened Test Simulated Task Test #items ρxx' nitems ρ*xx' nitems ρ*xx'

Science Grade 10 55 0.91 43 0.89 12 0.69 Math Grade 10 56 0.93 44 0.91 12 0.74

Science Grade 11 55 0.88 43 0.85 12 0.61 Math Grade 11 60 0.90 48 0.88 12 0.65

ρxx′ refers to coefficient alpha reliability. ρ*xx′ refers to Spearman-Brown adjusted reliability.

Page 8: The Case for Performance-Based Tasks without Equating

PERFORMANCE-BASED ASSESSMENTS

6

Data Generation

For each test at each grade level, the data were generated for the study as follows:

1. Fifty sets of 12 items were selected at random and with replacement from the full test.

Each of these 12 sets was assumed to represent a simulated PBT.

2. The 0, 1 responses for each on the 50 simulated PBTs were summed and saved.

3. For each student, the entire set of 50 simulated PBTs were utilized for the generalizability

analyses.

4. For each student, one of the 50 simulated PBTs was randomly selected for use in the

growth analyses.

5. For the growth analyses, the 12 items contributing to the assigned simulated PBT were

coded as “not presented” in the student’s response data matrix for each summative test .

6. For each summative test, Rasch item parameters obtained operationally were used to

recalibrate student abilities using the data matrices with the 12-item sets excluded. For

each test, the resulting ability estimates were rescaled to have a mean of 500. Note that

the operational tests were not vertically scaled between grades 10 and 11.

Thus, for the growth analyses, each student received four scores for each subject area

(math and science): a simulated PBT raw scores and an equated summative test scale score

across two years of administration (2009 and 2010). The use of data across years allowed student

growth to be calculated from 2009 to 2010. Table 2 provides descriptive statistics for the

simulated PBTs raw scores and equated summative tests.

Page 9: The Case for Performance-Based Tasks without Equating

PERFORMANCE-BASED ASSESSMENTS

7

Table 2. Descriptive Statistics for Simulated PBTs and Summative Tests (Student Level) Summative Test (Scale Score) PBT (Raw Score)

Subject N Grade Mean (SD) Minimum Maximum Mean (SD) Minimum Maximum

Math 246,017 10 500

(100) 12 806 8.32 (2.69) 0 12

11 500 (100) 111 841 8.87

(2.33) 0 12

Science 245,438 10 500

(100) 31 849 8.60 (2.46) 0 12

11 500 (100) -43 877 8.93

(2.18) 0 12

In addition, the availability of each student’s campus identifier allowed us to aggregate

the simulated test results at the school level. Table 3 provides the descriptive informative for the

simulated PBTs raw scores and equated summative test scale scores aggregated at the school

level.

Table 3. Descriptive Statistics for Simulated PBTs and Summative Tests (School Level)

Summative Test Scale Score PBT Raw Score Subject N Grade Mean (SD) Minimum Maximum Mean (SD) Minimum Maximum

Math 1,086 10 492.61

(37.94) 381.81 673.38 8.15 (0.94) 4.73 11.33

11 494.00 (36.69) 360.65 674.83 8.77

(0.77) 5.53 11.32

Science 1,086 10 493.28

(38.64) 382.24 637.40 8.48 (0.87) 5.78 10.81

11 494.30 (36.29) 364.81 629.86 8.84

(0.71) 5.97 10.88

Generalizability Analyses of PBT Scores

A number of analyses were conducted on the simulated data to address the research

questions. Generalizability theory (G-theory; Cronbach, Gleser, Nanda & Rajaratnam, 1972;

Feldt & Brennan, 1989) analyses were applied to assess the reliability of average school

Page 10: The Case for Performance-Based Tasks without Equating

PERFORMANCE-BASED ASSESSMENTS

8

performance on the PBTs. (See also Kane & Brennan, 1977; Kane, Gillmore, & Crooks, 1976 for

a discussion of G-theory in the context of estimating the reliability of class means.)

A generalizability (G) study was designed which included students (persons [p]), schools

(s), PBTs (t), and grade level (occasions [o]) as measurement facets. The structure of the study

data was such that students were nested within schools, and tasks were nested within occasions.

Because student scores were matched across two years (i.e., grade 10 to grade 11), all students

included in the analyses took all items in both occasions. The G study design can be abbreviated

as: (p:s) x (t:o). In order to have a balanced design, and to be consistent with the criteria used in

the growth model analysis (see next subsection), schools with fewer than 30 students were

eliminated. In schools with more than 30 students, 30 students were randomly sampled for

inclusion in the study. The variance component for persons was therefore estimated based on 30

replications. There were 1,008 schools with 30 or more students for math and 1,016 schools for

science that were used to estimate the variability due to schools. In addition, for this portion of

the study, scores on each of the 50 PBTs used in the simulation were calculated for each student.

As a result, for both subjects, 50 simulated PBTs were used to estimate task variability. Finally,

the science and math assessments were given at grade 10 and grade 11, resulting in 2 replications

for the occasions facet.

Variance components were estimated using the mGENOVA software (Brennan, 2001).

The multivariate counterpart to the univariate (p:s) x (t:o) design was used, which following the

notation of Brennan (2001a), and is represented (p•:s•) x tº, where solid circles (•) indicate facets

crossed with the occasions facet and open circles (º) indicate facets nested within the occasions

facet.

Page 11: The Case for Performance-Based Tasks without Equating

PERFORMANCE-BASED ASSESSMENTS

9

Several decision (D) studies were used to evaluate the reliability that can be expected

under a variety of measurement replications. The replications that were considered were those

that might be possible operationally. Although 50 simulated tasks were used to estimate

variance components, PBTs are time consuming to administer and score. Therefore, D studies

were conducted using one to four PBTs per person. Also, it was expected that as the number of

students increased within a school, the average school performance would be more reliable

because of decreased sampling variability. Thirty students per school were used to estimate the

variability due to persons, but for D studies, sample sizes of 10, 25, 50, 75, and 100 were also

considered. Ten students might represent a particularly small class size, 25 represented an

average class size for a school with a single class per grade level, and 50, 75, and 100

represented schools with two to four classrooms per grade level. Additionally, reliability

estimates were calculated based on the Grade 10 test (occasion 1) only, the Grade 11 test

(occasion 2) only, and based on both occasions. Because schools are often rank ordered for

comparisons, the generalizability coefficient was used as the estimate of reliability for a single

occasion, and the composite generalizability coefficient was used as the estimate of reliability

across the two occasions. The index of dependability (also known as the phi coefficient) was

also calculated for each condition, as this index would be more appropriate for situations where

schools are held to an absolute criterion, such as annual yearly progress (AYP).

Growth Analyses

To provide data for the growth analyses, composite scores that combined the PBT raw

score and summative scale score for each student were created. Two composite scores that

standardized the PBTs and summative test measures were considered: 1) a composite score is

that was the (unweighted) sum of the two standardized scores, and 2) a weighted composite score

Page 12: The Case for Performance-Based Tasks without Equating

PERFORMANCE-BASED ASSESSMENTS

10

in which the standardized summative test score was weighted to count three times as much as the

standardized PBT score.

A linear student growth percentile (SGP) model was used to estimate student and school

growth. The SGP model uses quantile regression (Koenker, 2005) to estimate a conditional linear

quantile function,

( ) ( )τQ βx′== xXτ (1)

where ( )τQ is the τ th quantile of random variable Y and ( )τβ is the set of regression

coefficients. The quantile regression procedure minimizes an asymmetric loss function for each

τ in a specified set ( )10,⊂Τ , in particular for this analysis { }99030201 ,.,,.,.. K=Τ . In this

analysis, each student received an estimated growth percentile, which was the τ that minimized

the distance between the student’s observed grade 11 score and a predicted grade 11 score based

on the model. To measure school growth, the students’ growth percentiles were aggregated

within each school and the median growth percentile was calculated. Schools with less than 30

students were removed from the analyses.

The next step divided the schools into 5 equally sized groups and assigned them a grade

of A, B, C, D, or F based on the median growth percentiles for each of the measures. We

compared the school grade classifications based on students’ PBT growth estimates with the

classifications based on their growth estimates using three alternative school-level measures: the

mean summative test scale score, the mean (unweighted) composite score, and the mean

weighted composite score. It was hypothesized that the variation across PBTs would cancel out

when aggregated at the school level, in which case school classifications based on PBTs would

lead to similar inferences as those based on the summative and composite measures.

Page 13: The Case for Performance-Based Tasks without Equating

PERFORMANCE-BASED ASSESSMENTS

11

Results

Generalizability Analyses

Variance component estimates resulting from the generalizability analyses are provided

in Table 4 for math and in Table 5 for science. The school level correlation between PBT scores

in Grades 10 and 11 was 0.95 for math and 0.94 for science, indicating that student performance

was very similar from year to year within a school. Universe score variance was greater in

Grade 10 than in Grade 11 for both science and math, but the error variance was similar in the

two grades. This led to higher reliability estimates in Grade 10. A comparison of the variance

estimates indicated that relative to other sources, there was very little variability in performance

across PBTs. Likewise, there was very little interaction between schools and PBTs. The major

source of error variance came from the variability of students within schools. These data

indicated that there was more variability in student performance within a school than variability

across schools. Thus, the G study results suggested that the most efficient way to decrease error

variance would be to include as many students as possible into the averages used to evaluate

school-level performance.

Table 4. Variance Estimates for Math

Variance Component Occasion 1 (Grade 10) Occasion 2 (Grade 11)

School 0.69 0.39

Person : School 4.58 3.03

Task 0.23 0.25

School x Task 0.03 0.02

(Person : School) x Task 1.50 1.47

Page 14: The Case for Performance-Based Tasks without Equating

PERFORMANCE-BASED ASSESSMENTS

12

Table 5. Variance Estimates for Science

Variance Component Occasion 1 (Grade 10) Occasion 2 (Grade 11)

School 0.64 0.38

Person : School 3.55 2.63

Task 0.16 0.16

School x Task 0.03 0.03

(Person : School) x Task 1.51 1.42

Generalizability coefficients provide an estimate of reliability based on relative error

variance. These coefficients were provided because in many cases schools are evaluated based

on their rank order. However, in the case of AYP, schools are also evaluated against an absolute

criterion, making the index of dependability, which is based on absolute error variance, the more

conceptually appropriate estimate of reliability. Both coefficients are provided in Table 6 for

math and Table 7 for science. For the D study designs considered below, more sources of error

contribute to the calculation of absolute error variance than that contribute to the calculation of

relative error variance. For this reason, the index of dependability (phi coefficient) is always

lower than the generalizability coefficient. The implication of this is that additional replications

of the measurement procedure should be used when making school comparisons based on an

absolute criterion like AYP to achieve the same reliability obtained when making normative

comparisons of schools.

Generalizability and phi coefficients were provided for the five levels of student sample

size (10, 25, 50, 75, and 100), two of the task conditions (1 and 4), and for Grade 10, Grade 11,

and the composite of the two years. As previously mentioned, the Grade 10 scores were slightly

Page 15: The Case for Performance-Based Tasks without Equating

PERFORMANCE-BASED ASSESSMENTS

13

more reliable. The composite scores (based on the PBTs across years) were only slightly more

reliable than the scores for either of the two years because the correlation between grade 10 and

11 school-level scores was so high. The reliability of school-level scores was above 0.90 for

schools with 100 students or more, given four or more tasks taken by students per occasion. As

the number of students and the number of tasks decreased, the reliability also decreased. For

schools with 10 students, the reliability was quite low, even with four tasks.

Table 6. Generalizability and Phi Coefficients for Math

Occasion 1 (Grade 10) Occasion 2 (Grade 11) Composite

N Persons N Tasks GC Phi GC Phi GC Phi

100 4 0.93 0.86 0.91 0.79 0.93 0.88

1 0.89 0.69 0.85 0.55 0.91 0.75

75 4 0.91 0.84 0.89 0.78 0.91 0.87

1 0.87 0.68 0.83 0.54 0.89 0.74

50 4 0.87 0.81 0.84 0.74 0.87 0.83

1 0.83 0.65 0.78 0.52 0.85 0.71

25 4 0.77 0.73 0.74 0.66 0.78 0.75

1 0.72 0.58 0.66 0.46 0.74 0.64

10 4 0.58 0.55 0.53 0.49 0.59 0.57

1 0.52 0.45 0.45 0.35 0.55 0.49

Page 16: The Case for Performance-Based Tasks without Equating

PERFORMANCE-BASED ASSESSMENTS

14

Table 7. Generalizability and Phi Coefficients for Science

Occasion 1 (Grade 10) Occasion 2 (Grade 11) Composite

N Persons N Tasks GC Phi GC Phi GC Phi

100 4 0.93 0.88 0.91 0.83 0.94 0.90

1 0.89 0.73 0.84 0.62 0.91 0.79

75 4 0.91 0.87 0.89 0.81 0.92 0.89

1 0.87 0.72 0.81 0.61 0.89 0.78

50 4 0.88 0.84 0.85 0.78 0.89 0.86

1 0.83 0.69 0.77 0.58 0.85 0.75

25 4 0.80 0.76 0.75 0.69 0.80 0.78

1 0.73 0.62 0.66 0.52 0.76 0.68

10 4 0.62 0.59 0.55 0.52 0.62 0.61

1 0.55 0.48 0.46 0.39 0.57 0.52

Generalizability coefficients are plotted with black lines in Figure 1 for each combination

of student sample size and number of tasks, for math. Phi coefficients are plotted for each

combination using red lines. The same information is provided in Figure 2 for science. It is

clear from Figures 1 and 2 that school sample size had a substantial impact on the reliability of

school-level scores. The increase in reliability from schools with 10 students to schools with 25

students was around 0.2. However, the difference between the reliability obtained with 75

students and the reliability obtained with 100 students was much smaller. A further increase in

sample size would have negligible impact on the reliability of school-level scores.

Increasing the number of tasks had a much less dramatic impact on reliability. This is

expected given that the variability attributable to differences among the PBTs was much smaller

Page 17: The Case for Performance-Based Tasks without Equating

PERFORMANCE-BASED ASSESSMENTS

15

than the variability amongst students. Increasing the number of PBTs from one to four increased

phi coefficients more than generalizability coefficients because of the differences in how the task

variability contributed to the calculation of error variance for the two coefficients. However, the

improvement in reliability from including more than 2 tasks was very modest. These results

suggest that few PBTs are needed to obtain reliable information about school-level performance

as long as the number of students included in school-level scores is sufficient.

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4

NUMBER OF TASKS

REL

IAB

ILIT

Y

GC 10GC 25GC 50GC 75GC 100Phi 10Phi 25Phi 50Phi 75Phi 100

Figure 1. Generalizability and Phi Coefficients for Math PBTs by School Size and Number of Tasks.

Page 18: The Case for Performance-Based Tasks without Equating

PERFORMANCE-BASED ASSESSMENTS

16

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4

NUMBER OF TASKS

REL

IAB

ILIT

Y

GC 10GC 25GC 50GC 75GC 100Phi 10Phi 25Phi 50Phi 75Phi 100

Figure 2. Generalizability and Phi Coefficients for Science PBTs by School Size and Number of Tasks.

Growth Analyses

The histograms in Figures 3 and 4 illustrate the school median growth percentile

distributions across the four measures (summative scale score, PBT raw score, composite score,

weighted composite score) for math and science respectively. One aspect of the SGP analysis of

PBT growth to note is that the restricted range of scores across the PBT assessments did not

supply enough score points to make use of the full distribution of student growth percentiles. For

example, the PBT math student growth percentiles included only 50 of the possible 99

percentiles, and the PBT science student growth percentiles included only 48. A result of this

range restriction is evident in the histograms for the math and science PBTs in Figures 1 and 2,

where the school median growth percentile distributions of PBTs do not approximate normality

as well as do those of the other measures.

Page 19: The Case for Performance-Based Tasks without Equating

PERFORMANCE-BASED ASSESSMENTS

17

Figure 3. Histograms depicting school median growth percentiles for math across the four simulated measures.

Page 20: The Case for Performance-Based Tasks without Equating

PERFORMANCE-BASED ASSESSMENTS

18

Figure 4. Histograms depicting school median growth percentiles for science across the four simulated measures.

Page 21: The Case for Performance-Based Tasks without Equating

PERFORMANCE-BASED ASSESSMENTS

19

The restriction of PBT raw score range is also evident in Figure 5, which presents

scatterplots of the PBT and summative test student growth percentiles. The vertical gaps in the

scatter plots represent the unassigned student growth percentiles. These plots suggest that

although SGP models are well suited to the raw score scales of PBTs, in practice some thought

should be given to the range of quantiles used. It is possible that less fine-grained SGP analysis

(e.g., by using deciles) may be adequate when modeling growth for instruments with few score

points.

Nevertheless, a comparison of Figures 5 and 6 indicates that, as expected, when a large

number of PBTs was randomly assigned, task‐to‐task variation that could be considerable at the

student level tended to average out at the school level. The correlation between the summative

task and PBT student growth percentiles at the student level is .23 for math and .22 for science as

depicted in Figure 5. By contrast, the correlations between the summative task and PBT school

median growth percentiles increased to .67 for math and .64 for science as depicted in Figure 6.

Figure 5. Scatter plots depicting the relationship between the student growth percentiles for the math (left) and science (right) PBTs and summative tests.

Page 22: The Case for Performance-Based Tasks without Equating

PERFORMANCE-BASED ASSESSMENTS

20

Because differences in student-level variation tended to average out at the school level, it

was expected that inferences based on PBT growth would be similar to inferences based on

summative or composite measure growth. The school classification agreement rates based on the

median growth percentiles for math and science across the different measures are presented in

Tables 8 and 9 respectively.

Table 8. School Performance Agreement Rates Based on Median Growth Percentiles for Math PBT Growth Compared to:

Agreement Type

Summative Test

Growth

Weighted Composite Measure Growth

Composite Measure Growth

Exact 38% 42% 50% Adjacent 41% 43% 40% Exact + Adjacent 80% 85% 91% Within 2 Categories 15% 12% 8% Extreme Variation 5% 3% 1%

Figure 6. Scatter plots depicting the relationship between the school median growth percentiles for the math (left) and science (right) PBTs and summative tests.

Page 23: The Case for Performance-Based Tasks without Equating

PERFORMANCE-BASED ASSESSMENTS

21

Table 9. School Performance Agreement Rates Based on Median Growth Percentiles for Science Performance Based Task Growth Compared to:

Agreement Type

Summative Test

Growth

Weighted Composite Measure Growth

Composite Measure Growth

Exact 40% 40% 44% Adjacent 41% 43% 47% Exact + Adjacent 81% 83% 91% Within 2 Categories 14% 13% 8% Extreme Variation 5% 3% 1%

The exact agreement rates among median growth percentiles based on PBTs and the

other measures (e.g., schools rated an A for growth under the PBT and other measures) examined

in the study range from 38% to 50%. Therefore, classifications based on median growth

percentiles for PBTs would be likely to place schools into different performance categories than

classifications based on median growth percentiles for summative tests and composite measures.

However, there seemed to be a reasonable amount of consistency among the classifications. The

percentages of schools classified to either the same or within one category of each other (e.g.,

schools rated an A for growth under the PBT and a B for growth under the other measures)

ranged from 80% to 91% across conditions.

Furthermore, none of the comparisons demonstrated high rates of extreme variation in

school ratings (e.g., schools rated an A based on the PBT growth was a D or F on other

measures). Therefore, the results suggested that, in practice, inferences based on PBT growth

would be similar those based on growth using summative measures or measures that combine the

two types of scores.

Page 24: The Case for Performance-Based Tasks without Equating

PERFORMANCE-BASED ASSESSMENTS

22

Discussion

The results of the analyses help answer the two research questions of interest about the

proposed model for performance-based assessments that assumes random-equivalent PBTs

selected from a large pool such that equating is not necessary.

First, results from the G theory analyses indicate very little variability in performance at

the school level due to PBTs or the interaction between schools and PBTs. Therefore, the results

imply that the impact of task-to-task variation on the reliability of school-level PBT scores is

small. The results also suggest that few PBTs were needed to obtain reliable information about

school-level performance. The primary source of error variance identified in the G study was

attributed to variability of student performance on the PBTs within schools. This finding is

consistent with previous research on the reliability of class means (Kane, Gillmore, & Crooks,

1976), and suggests that the best way to increase the reliability of school level PBT measurement

is to increase the number of students observed within schools. Reliability estimates were low for

samples of 25 students per school but reached acceptable levels for samples of 50 students or

more.

Second, in comparing the school classifications based on the four different growth

estimates (PBT raw score, summative scale score, composite score, and weighted composite

score), the study found that inferences based on PBT growth estimates would be similar in

practice to inferences based on growth estimates from the other three types of measures.

Therefore, the growth analysis results support the G-study finding that school-level measurement

using randomly equivalent PBTs appears to be a viable option given sufficient sample size per

school. This is an important test design consideration for the Common Core assessments, which

must produce student achievement data and student growth data for determinations of school,

Page 25: The Case for Performance-Based Tasks without Equating

PERFORMANCE-BASED ASSESSMENTS

23

principal, and teacher effectiveness. The results suggest that random assignment from large pools

of PBTs can represent a kind of domain sampling, and scores across classes and schools can

represent estimates of average domain scores.

Next-generation assessments will increasingly utilize technology to develop and deliver

PBTs, supported by new assessment design approaches such as ECD. By generating large pools

of PBTs according to task models and templates, the psychometric assumptions associated with

performance-based task scoring, reporting, and aggregation can be simplified. This paper serves

as an initial investigation into the use of a performance-based assessment model in which

equating is not required. The findings support the viability of such a model with the potential to

support next-generation performance-based assessments where scores are aggregated across

groups to make inferences about teacher or school performance.

It is important to recognize that the results of this study are specific to the conditions

evaluated. It should be noted that only one student cohort was evaluated (albeit, across two

years) in this study and one content area was considered at a time. A more complex G study

design could incorporate additional cohorts and examine content area as a facet. In addition, the

study was limited in that PBTs were simulated from multiple-choice response data. Clearly,

further investigation of the model using real PBTs is needed. Finally, although results support the

reliability of randomly equivalent PBTs when used to measure performance at the aggregate

level, the results suggested relatively unreliable results at the student level using the PBTs.

Further research regarding the appropriate use of PBT measurement at the student level,

particularly within summative assessment systems, is warranted.

Page 26: The Case for Performance-Based Tasks without Equating

PERFORMANCE-BASED ASSESSMENTS

24

References

Betebenner, D. W. (2009). Norm-and criterion-referenced student growth. Educational Measurement: Issues and Practice, 28, 42-51.

Brennan, R. L. (2001a). Generalizability theory. New York: Springer-Verlag.

Brennan, R. L. (2001b). mGENOVA [Computer software and manual]. Iowa City, IA: Center for Advanced Studies in Measurement and Assessment, The University of Iowa. (Available on http://www.education.uiowa.edu/casma).

Cronbach, L. J., Gleser, G. C., Nanda, H., and Rajaratnam, N (1972). The dependability of behavioral measures: Theory of generalizability for scores and profiles. New York: Wiley.

Darling-Hammond, L. and Pecheone, R. (2010, March). Developing an internationally comparable balanced assessment system that supports high-quality learning. Paper presented at the National Conference on Next-Generation K-12 Assessment Systems. Available at: http://www.k12center.org/rsc/pdf/Darling-HammondPechoneSystemModel.pdf.

Feldt, L. S., & Brennan, R. L. (1989). Reliability. In R. L. Linn (Ed.), Educational measurement (3rd ed.), pp. 105-146. New York: Macmillan.

Green, B.F. (1995). Comparability of scores from performance assessments. Educational Measurement: Issues and Practice, 14, 13-15.

Huff, K., Steinberg, L., & Matts, T. (2010). The promises and challenges of implementing evidence-centered design in large-scale assessment. Applied Measurement in Education, 23, 310-324.

Kane, M. T., & Brennan, R. L. (1977). The generalizability of class means. Review of Educational Research, 47, 267-292.

Kane, M. T., Gillmore, G. M., & Crooks, T. J. (1976). Student evaluations of teaching: The generalizability of class means. Journal of Educational Measurement, 13, 171-183.

Koenker, R. (2005). Quantile regression. New York, NY: Cambridge University Press.

Liu, M., & Haertel, G. (2011). Design patterns: A tool to support assessment task authoring (Draft Large-Scale Assessment Technical Report 11). Menlo Park, CA: SRI International.

Luecht, R., Burke, M., & Shu, Z. (2010, April). Controlling difficulty and security for complex computerized performance exercises using Assessment Engineering. Paper presented at the annual meeting of the National Council on Measurement in Education, Denver, CO.

Page 27: The Case for Performance-Based Tasks without Equating

PERFORMANCE-BASED ASSESSMENTS

25

Mislevy, R., & Haertel, G. (2006). Implications of evidence-centered design for educational testing (Draft PADI Technical Report 17). Menlo Park, CA: SRI International.

U. S. Department of Education. Overview information: Race to the Top Fund Assessment Program; Notice inviting applications for new awards for fiscal year (FY) 2010. 75 Federal Register, 18171-18185. (April 9, 2010).

Wise, L.L. (2011, February). Picking up the pieces: Aggregating results from through-course assessments. Paper presented at the Invitational Research Symposium on Through-Course Assessments. Available at: http://www.k12center.org/rsc/pdf/TCSA_Symposium_Final_Paper_Wise.pdf.