32
CONSORTIUM FOR POLICY RESEARCH IN EDUCATION University of Pennsylvania • Harvard University • Stanford University University of Michigan • University of Wisconsin-Madison Wisconsin Center for Education Research, University of Wisconsin-Madison 1025 West Johnson Street, Room 653, Madison, WI, 53706-1796 Phone 608.263.4260 Fax 608.263.6448 The Relationship Between Standards-Based Teacher Evaluation Scores and Student Achievement: Replication and Extensions at Three Sites Anthony T. Milanowski Consortium for Policy Research In Education University of Wisconsin-Madison Madison, WI 53706 (608) 262-9872 [email protected] Steven M. Kimball Consortium for Policy Research In Education University of Wisconsin-Madison Madison, WI 53706 (608) 265-6201 [email protected] Brad White Consortium for Policy Research In Education University of Wisconsin-Madison Madison, WI 53706 March 2004 CPRE-UW Working Paper Series TC-04-01 This paper was prepared for the Consortium for Policy Research in Education, Wisconsin Center for Education Research, University of Wisconsin-Madison for presentation at the American Educational Research Association annual meeting held April 12-16, 2004 in San Diego, California. The research reported in this paper was supported by a grant from the U.S. Department of Education, Office of Educational Research and Improvement, National Institute on Educational Governance, Finance, Policymaking and Management, to the Consortium for Policy Research in Education (CPRE) and the Wisconsin Center for Education Research, School of Education, University of Wisconsin-Madison (Grant No. OERI-R308A60003). The opinions expressed are those of the authors and do not necessarily reflect the view of the National Institute on Educational Governance, Finance, Policymaking and Management, Office of Educational Research and Improvement, U.S. Department of Education, the institutional partners of CPRE, or the Wisconsin Center for Education Research. .

The relationship between standards-based teacher evaluation scores and student achievement: Replication and extensions at three sites

Embed Size (px)

Citation preview

C O N S O R T I U M F O R P O L I C Y R E S E A R C H I N E D U C A T I O N University of Pennsylvania • Harvard University • Stanford University University of Michigan • University of Wisconsin-Madison

Wisconsin Center for Education Research, University of Wisconsin-Madison 1025 West Johnson Street, Room 653, Madison, WI, 53706-1796 ■ Phone 608.263.4260 ■ Fax 608.263.6448

The Relationship Between Standards-Based

Teacher Evaluation Scores and Student Achievement: Replication and Extensions at Three Sites

Anthony T. Milanowski

Consortium for Policy Research In Education University of Wisconsin-Madison

Madison, WI 53706 (608) 262-9872

[email protected]

Steven M. Kimball Consortium for Policy Research In Education

University of Wisconsin-Madison Madison, WI 53706

(608) 265-6201 [email protected]

Brad White

Consortium for Policy Research In Education University of Wisconsin-Madison

Madison, WI 53706

March 2004 CPRE-UW Working Paper Series

TC-04-01 This paper was prepared for the Consortium for Policy Research in Education, Wisconsin Center for Education Research, University of Wisconsin-Madison for presentation at the American Educational Research Association annual meeting held April 12-16, 2004 in San Diego, California. The research reported in this paper was supported by a grant from the U.S. Department of Education, Office of Educational Research and Improvement, National Institute on Educational Governance, Finance, Policymaking and Management, to the Consortium for Policy Research in Education (CPRE) and the Wisconsin Center for Education Research, School of Education, University of Wisconsin-Madison (Grant No. OERI-R308A60003). The opinions expressed are those of the authors and do not necessarily reflect the view of the National Institute on Educational Governance, Finance, Policymaking and Management, Office of Educational Research and Improvement, U.S. Department of Education, the institutional partners of CPRE, or the Wisconsin Center for Education Research. .

2

Standards-based teacher evaluation represents one strategy for both improving instruction and

complying with the expectations of external stakeholders that teachers be held accountable for their

performance. Consistent with the movement for standards for students, this approach starts with a

comprehensive model or description of what teachers should know and be able to do, represented by

explicit standards covering multiple domains and including multiple levels of performance defined by

detailed behavioral rating scales. It typically requires more intensive collection of evidence, including

frequent observations of classroom practice and use of artifacts such as lesson plans and samples of

student work, in order to provide a richer picture of teacher performance. Besides the movement toward

standards for students, the roots of standards-based evaluation also include the desire to represent a more

complex conception of teaching and learning for teacher licensing and certification (Porter, Youngs, and

Odden, 2001) and the need for a comprehensive practice model to guide new teacher induction and

mentoring. Dissatisfaction with evaluation approaches that provide little guidance for teachers’ efforts to

improve practice (Moore Johnson, 1990; Stiggens and Duke, 1988) has also been an influence. One

prominent embodiment of the standards-based evaluation concept is the Framework for Teaching

(Danielson,1996; Danielson and McGreal, 2000).

We have argued elsewhere that standards-based teacher evaluation systems constitute a

performance competency model with the potential to improve instruction by affecting teacher selection

and retention, motivating teachers to improve their skills, and promoting a shared conception of good

teaching (Milanowski and Kimball, 2003; Kimball, Milanowski, and Heneman, 2003). In essence,

standards-based teacher evaluation systems provide both incentives and guidance for teachers to change

their practice toward the model embodied in the standards. But the potential of standards based teacher

evaluation for improving student achievement depends on the link between practices described by the

standards and student learning. Unless teaching according to the standards leads to more student learning,

implementing a standards-based evaluation system will not contribute to improved student achievement.

One type of evidence that would support the case that standards-based evaluation can lead to more student

3

learning is a significant empirical relationship between teaching according to the standards (as measured

by the teacher evaluation scores) and value-added measures of student achievement

As school organizations move to standards-based evaluation systems, they should also be

interested in the reliability and validity of the evaluation scores produced, especially when these scores

are used for decisions with consequences for teachers, such as termination, tenure, and pay for

performance. One aspect of validity is the relationship between teacher performance as measured by the

evaluation system, and student learning: whether students of teachers whose performance has been rated

higher learn more. To the extent that teacher evaluation scores are empirically related to measures of

student achievement, an organization using such scores for consequential decisions has criterion-related

or empirical validity evidence that this use is justified.

From a research perspective, Odden, Borman, and Fermanich (2004) have argued that standards-

based teacher evaluation scores might be useful in research on teacher effects on student learning.

Teachers’ scores from well-designed, practice-based teacher evaluation systems could be considered

measures of instructional practice that can be used in studies that try to identify the effects of

communities, schools, and teachers on student learning (Odden, Borman, and Fermanich , 2004). But

given the poor reputation of teacher evaluation for validity (Peterson, 2002), the relationship between

standards-based teacher evaluation scores and measures of student achievement needs to be demonstrated

before using these scores in research on teacher effects or teacher quality.

We have been studying standards-based teacher evaluation in three school organizations using a

standards-based evaluation system: the Cincinnati (Ohio) Public Schools, the Vaughn Next Century

Learning Center (a charter school in Los Angeles, California) and the Washoe County (Nevada) School

District. The Framework for Teaching served as the foundation for two of the three evaluation systems we

have been studying and substantially influenced the third. Besides looking at implementation issues and

teacher reactions to the systems, we have assessed the relationship between teachers’ evaluation scores

and student achievement, as measured by value-added methods. Our initial results were reported at last

year’s AERA meeting and in articles by Gallagher (2004), Kimball, White, Milanowski, and Borman

4

(2004) and Milanowski (2004). These studies used a value-added approach in which elementary and

middle school students’ test scores in the year the teacher was evaluated were modeled as a function of

prior year test scores and student characteristics such as gender, ethnicity, participation in free/reduced

price lunch programs, special education status and limited English proficiency. Two-level random

intercept hierarchical linear models were used to estimate the relationship of test scores to evaluation

scores, represented as a level 2 variable. Other models with no predictors at level 2 were used to estimate

empirical Bayes intercept residuals (representing average classroom student achievement, controlling for

prior year test scores and student characteristics) that were then correlated with evaluation scores.

Table 1 summarizes the results from our first set of analyses of the teacher evaluation score-

student achievement relationship in the form of average correlations between evaluation scores and

empirical Bayes intercept residuals that represent the average relative level of measured student

achievement in the classroom, controlling for prior learning and student characteristics.

Table 1 Average Correlations Between Teacher Evaluation Scores and Estimates of Average Student Achievement Within Classrooms for Three Research Sites Tested Subject Site Reading Math Other Cincinnati .32 .43 .27 (Science) Vaughn .50 .21 .18 (Language Arts) Washoe .21 .19 -

These results suggest that scores from standards-based teacher evaluation systems can have a

substantial relationship with measures of the student achievement of the teachers’ students. However, the

size of the relationship varied across research sites, across grades within sites, and across academic

subjects. Some of this variation is likely attributable to differences in programs across sites and across

evaluators for different grade levels or subjects (especially in Cincinnati, where evaluators from outside

each school were assigned to evaluate specific teachers based on the evaluator’s grade level and subject

experience). Measurement error in the tests of student achievement and the evaluation scores might also

explain some of the variability, and a considerable portion could be due to sampling error, since the

5

number of teachers included in each site, subject, and grade analysis were not that large. The largest

samples were found in Washoe, where data for about 120 teachers per grade were available, but samples

in Cincinnati and Vaughn were much smaller, at 20-40 per grade and subject. As Hunter and Schmidt

(1990) argue, much of the variation among studies of this type can be due to sampling error when samples

are this small. Thus, though the results we obtained from the first wave of data collected were promising,

they are in need of replication.

This paper reports on the results of analysis of an additional year of evaluation and student

achievement data at these research sites. It addresses three research questions:

1) What are the relationships between student achievement and teacher evaluation scores in a second set of data, and how do these relationships compare with our previously-reported results? 2) How much does the relationship between teacher evaluation scores and student achievement change when removing controls for student characteristics (other than pre-test)?

3) What is the relationship of teacher evaluation scores to teacher experience, and does controlling for experience change the relationship between evaluation scores and student achievement?

The first question addresses the issue of replication or stability of the teacher evaluation – student

achievement relationship. Knowing whether this relationship is consistent from year to year is important

in understanding the construct validity of the evaluation scores as well as allowing users of these

standards-based systems to assess whether the criterion-related validity of the evaluation scores is limited

to specific teachers, students, and years.

The second question is motivated by concerns raised by some reviewers of our initial studies.

While most researchers looking for teacher effects on student achievement appear to believe that

controlling for student characteristics (e.g. ethnicity and socio-economic status) that may influence test

scores provides a more accurate way to assess teacher effects, it has also been argued that these

characteristics should not be considered in the assessment of teacher performance because teachers should

produce similar learning gains for all groups of students. It has also been argued that controlling for

factors like ethnicity sends the message that lower levels of achievement are expected for some students.

6

To assess the degree to which controlling for student characteristics affects the relationship between

evaluation scores and student achievement, an analysis was also conducted which did not include the

gender, ethnic, special education, and free/reduced lunch controls.

The third question is of interest because it could be that the teacher evaluation scores are

measuring little more than teacher experience, which in turn is related to student achievement (though, it

must be added, the research record is not consistent in linking experience and student achievement). If

teacher evaluation scores do not add to the prediction of student achievement over and above the level of

experience, then the value of investing resources is standards-based evaluation systems for use in

differentiating teacher pay is questionable. The relationship is also if interest because it bears on the

construct validity of the teacher evaluation scores. If teaching performance increases with experience in

the first few years of teaching, after which the relationship weakens (see King Rice, 2003 for a summary

of research on experience and student achievement), we would expect to see this pattern in the

relationship between evaluation scores and teacher experience at these sites.

Site Background

Cincinnati Public Schools (CPS) is a large urban district with 70 schools and programs enrolling

about 48,000 students, and employing over 3,000 teachers. It has low student achievement relative to

surrounding suburban districts, and a high proportion of African-American students and students eligible

for free or reduced-price lunch. State accountability programs and public expectations put pressure on the

district to raise average levels of student test scores. The district developed its teacher evaluation system

based on the Framework for Teaching (Danielson, 1996), utilizing Danielson’s four domains: planning

and preparation, creating an environment for learning, teaching for learning, and professionalism.

However, CPS reduced the number of standards from 24 to 16. Teachers receive comprehensive

evaluations in their first and third years, and every fifth year thereafter. The comprehensive evaluation

involved an assessment by a teacher from outside the school as well as a building administrator. In

interim years, teachers receive a less rigorous assessment from a building administrator (principal or

7

assistant principal). The evaluation system was initially designed to serve as the basis for a performance

pay plan as well as for formative and professional development purposes. For the period under study,

however, the evaluation system has not been used for pay purposes, and has been applied primarily to less

experienced teachers.

Vaughn Next Century Learning Center is a public charter school in San Fernando, California.

Previously a public school in the Los Angeles Unified School Distinct, the school converted to charter

status in July of 1993. During the period of our study, the school served about 1,200 students in pre-K

through grade 5. The student population is 94% Hispanic, and most students are not considered to be

English proficient. Almost all students are eligible for free or reduced-price lunch. Vaughn has more

than 70 staff, of which about 40 are K-5 teachers. Vaughn began developing its evaluation system during

the 1997-98 school year and implemented it for volunteers in 1998-99 school year. The school began to

use the system for all teachers the next year. Teachers are evaluated yearly, and the evaluation results are

used as the basis for a pay for performance system, as well as developmental and accountability purposes.

Washoe County School District is a large western district that includes the cities of Reno and

Sparks, Nevada and surrounding communities. There are 88 schools, over 60,000 students and about

3,300 teachers in the district. The district has been using a standards-based teacher evaluation system

adapted from the Framework for Teaching since 2000. According to state law, teachers must undergo a

performance evaluation each year. Teachers are evaluated annually in Washoe County on different

performance domains depending on their stage in the evaluation cycle. Teachers in their first or second

year of probation (pre-tenure) are evaluated on all four performance domains. Post-probationary teachers

are evaluated on one or two domains, depending on whether they are in a “minor” or “major” evaluation

year. The evaluation process was designed to provide a common framework for evaluation discussions

and promote teacher reflection on practice. Evaluation decisions have no direct bearing on salary, but do

serve as the basis for summative evaluation decisions, such as contract renewal and tenure.

8

Further information about these sites and references to additional descriptive material can be

found in articles by Gallagher (2004), Kimball (2002), Kimball, White, Borman and Milanowski (2004),

Milanowski (2004), and Milanowski and Heneman (2001).

Method

Analyses

All of the analyses were based on the value-added paradigm, using two-level hierarchical linear

models in which individual student achievement on a subject test is represented as a function of the prior

year’s test score in that subject, and a variety of student-level control variables intended to represent

factors associated with test performance but that are not in teachers’ control, such as ethnicity and English

proficiency. At level two (the classroom level), one variation of the analysis incorporated teacher

evaluation score as a predictor of the random intercepts for each classroom, representing the average level

of student achievement, controlling for prior year test score and student demographic characteristics.

Another variation used a random intercept model to obtain empirical Bayes (EB) intercept residuals,

representing the average student performance in each teacher’s classroom, and then correlated these

residuals with teacher evaluation scores. For each approach, the basic level 1 model was:

Post-test = β0 + β1 pretest+ β2X2 + … βnXn + R Where X2…Xn represent various student characteristics such as gender, ethnicity, or free and reduced

price lunch status. All level 1 predictors were grand-mean centered. It should be noted that different

control variables were available and appropriate at each site. Information on the specific models used for

each site are shown in the Appendices. The first set of analyses, used to calculate the correlations

between the EB intercept residuals and teacher evaluation scores, used a simple level 2 specification:

β0j = γ00 + u0j

9

At level two, the u0 represented the teacher-specific differences from the average of the group intercepts.

The EB residuals from this model were used as the measure of the average student performance relevant

to each teacher (i.e. classroom average achievement). Given the grand mean centering, the EB intercept

residuals represent the difference for the “average” student: average in prior year test score and other

characteristics at level 1. The slopes for all Level 1 variables were treated as fixed. For two sites,

Cincinnati and Washoe, the analyses were done by grade and subject.

The empirical Bayes intercept residuals were then correlated with teacher evaluation scores for

those teachers for which evaluation scores were available. Correlations were then combined across grades

in order to obtain a summary estimate of the relationship between evaluation scores and student

achievement. By analogy with meta-analysis, each grade within a subject was treated as a separate study

and the correlations combined using the standard formulas for a random effects treatment.1 Upper and

lower bounds for the 95% confidence intervals were also calculated.

Another approach used a more complex level 2 specification to obtain estimates of the effect of

rated teacher performance and experience differences on test scores. These analyses included the

evaluation score and a measure of teacher experience, as level two predictors of the random intercepts

representing classroom average student achievement:

β0j = γ00 + γ01 evaluation score + γ02 experience measure + u0j.

Here, the coefficients for evaluation score and experience estimate the effect in test score points of

variations in rated teacher performance and teacher experience. Again, the slopes for all Level 1 variables

were treated as fixed.

These models were used to maximize comparability to the results we obtained in our previous

studies. Last year, though we investigated the variance in slopes for some level one predictors, especially

1 An r to z transformation was done and a weighted average of the z’s was calculated with the inverse of the variances as weights. Standard errors were calculated for this average, and 95% confidence intervals. These values

10

prior test score, there was no consistent indication of reliable slope variance across grades and subjects.

So for this round of analyses, we continued to use fixed slope models, though we did investigate some

models with random slopes. The results of these analyses are discussed below and in the paper by

Borman and Kimball.

Measures

Teacher Evaluation Scores

In Cincinnati, teachers undergoing the comprehensive evaluation received a score on each one of

four domains: planning and preparation, creating an environment for learning, teaching for learning, and

professionalism, based on scores on the dimensions, called standards, within each domain. For each

standard, raters considered a four-level rating scale or ‘rubric’ defining unsatisfactory, basic, proficient,

and distinguished performance. For two domains (environment for learning and teaching for learning)

teachers’ performance was evaluated based on six classroom observations. Four of these were made by a

teacher evaluator from outside the school who had subject matter and grade level expertise similar to that

of the teacher being evaluated. Building administrators (principals and assistant principals) did the other

two observations. Based on summaries of the six observations, teacher evaluators made a final summative

rating on each of the standards in these domains. Administrators, based on a portfolio including lesson

and unit plans, attendance records, student work, family contact logs, and documentation of professional

development activities, rated teachers on the standards in the planning and professionalism domains.

Standard-level scores were then aggregated to a domain level score for each of the four domains using

tables provided by the district. As mentioned above, the standards and rubrics were adapted by the district

from Danielson’s Framework for teaching. The scores on the four domains were added to yield a

composite evaluation score to represent overall teacher performance. The average intercorrelation

between domain scores for all the teachers evaluated in 2002-03 was .56, and coefficient alpha was .84.

were then transformed back into correlation coefficients. See Shadish and Haddock (1994) for a description if the details of these calculations.

11

Because CPS only evaluates a subset of teachers each year, complete evaluation scores were

available for 318 teachers for the 2002-03 school year. But because most of the teachers evaluated

taught subjects or grades for which no state or district standardized tests were given, and because teachers

with less than 3 students tested were excluded, evaluation scores for only 131 unique teachers were

included in the analysis. (Some teachers appear in two or more subject/grade analyses.)

At Vaughn, the teacher evaluation system included 12 domains: lesson planning and classroom

management, plus subject-specific domains covering literacy, language development, mathematics,

special education, history and social science, science, instruction in primary language for English

learners, arts, technology, physical education, and teaming. Vaughn teachers are assessed two times per

year using a four level rating scale with unsatisfactory, basic, proficient, and distinguished levels. An

administrator, a peer, and the teacher her/himself rate on the applicable domains based on classroom

observations, discussions, and review of artifacts. The average for the two semesters was used as the

evaluation score for each domain. Since not all the domains apply to all teachers, our analyses have

concentrated on the five that are applied to all: lesson planning, classroom management, literacy,

language development, and mathematics. For the main analyses, we assessed the relationship of student

reading achievement to the literacy teacher evaluation scores, of math achievement to the math evaluation

scores, and language arts achievement to language development evaluation scores. This was done

because it was expected that a subject-specific evaluation would be more strongly related to student

achievement than scores from other subject-specific evaluations of the more generic planning and

classroom management scores, and to replicate Gallagher’s (2004) analysis.

In Washoe, the four Framework domains of Planning and Preparation, Classroom Environment,

Instruction, and Professional Responsibilities are the basis for the system. Each domain contains multiple

elements which are rated by principals or assistant principals, using rubrics closely based on those in

Danielson’s 1996 book. Evidence may include a teacher self-assessment, a pre-observation data sheet

(lesson plan), classroom and non-classroom observations with pre- and post-observation conferences,

instructional artifacts (e.g., assignments and student work), a reflection form, a three-week unit plan, and

12

logs of professional activities and parent contacts. The system provides for three types of evaluation:

probationary, post-probationary major, and post-probationary minor. Teachers new to the district are

considered probationary and are evaluated on all four of the performance domains, where they must meet

at least level 1 (target for growth) scores on all 68 elements. Probationary teachers are observed at least

nine times over three periods of the year. Teachers in post-probationary status undergo a ‘major

evaluation’ on two performance domains. They are formally observed three times over the course of the

year. In the next two years, they receive “minor” evaluations, focusing on one domain and involving at

least one observation during the year. Over the course of the three year major-minor cycle, teachers are

evaluated on all four domains, but most are not evaluated on al domains each year. However, if a teacher

is not evaluated on the Instruction Domain, they are evaluated using a supplemental evaluation form with

four dimensions consisting of selected components and elements from the Planning and Preparation and

Instruction domains. Evaluators rate these dimensions using four performance designations (i.e.,

unsatisfactory, target for growth, proficient, and area of strength). Because these dimension ratings were

available for almost all teachers, they were used as the measure of teacher performance in our analyses.

The scores on the four performance dimensions were averaged to derive a single indicator of teacher

performance. The average correlation among the dimensions was .72, and coefficient alpha was .91.

Teacher Experience

At each site, information on teacher experience was also collected. In Cincinnati and Washoe,

teachers’ position on the salary schedule (their ‘step’) was used as an indicator of experience. This

measure was chosen because a count of years employed by the district would not include teaching

experience in other districts. Because both sites gave teachers credit on the pay schedule for prior

experience, the step is a better indicator of total experience for many teacher than years of service with

the district. However, it should be noted that teachers ‘top out’ at the highest step and therefore the step

no longer represents their relative experience. This was less of a problem in Cincinnati than in Washoe,

because relatively few highly senior teachers in Cincinnati were evaluated during 2002-03, and the

13

Cincinnati schedule had more steps. At Vaughn, the only measure of experience available was the

teacher’s years of service with the school. Since some teachers were hired with prior experience, the years

of school service measure is not always an accurate reflection of total professional experience.

Student Achievement

For each site, student achievement was represented by test scores on standardized, largely closed

response instruments. In Cincinnati, we used scores on state criterion-referenced tests given in grades 4

and 6, and district ‘off year’ tests developed by test publishers or the TerraNova, administered in grades

2, 3, 5, 6, and 7. For Vaughn, the SAT-9 was used for all grades and both years. In Washoe, scores on

state criterion-referenced tests from grades 3 and 5, district-developed criterion-referenced tests for grade

6, and the TerraNova test given in grade 4, were used. The tests used for each site and grade are listed in

Appendix Table 1. These data were provided by the districts or the test publishers.

Other Student Characteristics Data on student characteristics, typically gender, ethnicity, English proficiency, special education

status, and free and reduced lunch were provided to us by the sites.

Results

Replication of Previous Analyses

This section summarizes the results of the analyses of the teacher evaluation score – student

achievement relationship we have done on a second year of data from our Cincinnati, Vaughn, and

Washoe sites. Table 2 reports the average correlations within each site between the total teacher

evaluation scores and the empirical Bayes intercept residuals from the models in which controls for

student characteristics were included. These results are the best comparisons to those from our first round

of analyses, as shown in Table 1 above.

Table 2 Correlations Between Teacher Evaluation Scores and Estimates of Average Student Achievement Based on Empirical Bayes Intercept Residuals from Models with Controls for Student Characteristics

14

Tested Subject Site Reading Math Other Cincinnati .28 .34 -.02* (Science) Vaughn .61 .45 .38 (Language Arts) Washoe .25 .24 - * Confidence Interval Includes 0 More detailed results, including results by grade, are reported in Appendix Table 2.

Except for science in Cincinnati, the correlations are all positive and of roughly similar

magnitude to those from our first round of analysis (see Table 1 above). Notable differences include a

much lower correlation for science in Cincinnati, and higher correlations for mathematics and language

arts for Vaughn.

In order to provide an idea of the potential importance of having a more highly-rated teacher, we

also calculated the average estimated change in student achievement associated with a one level change in

teacher evaluation score. Specifically, we calculated the number of standard deviations in test score that

were associated with a change in teacher ratings of one overall level (i.e. from Basic to Proficient or

Proficient to Distinguished on all domains). Table 3 presents these results. This provides a type of effect

size measure that allows comparisons across grades, subjects, and sites.

Table 3 Effect of a One Level Change in Teacher Evaluation Score on Student Achievement, (in Standard Deviation Units) Tested Subject Site Reading Math Other Cincinnati .14 .18 -.01 (Science) Vaughn .25 .37 .21 (Language Arts) Washoe .14 .19 - These effects, though small to moderate in size, could add up to a substantial advantage for a student with

two or three consecutive teachers rated at the ‘distinguished’ rather than ‘proficient’ level, or the

‘proficient’ rather than ‘basic’ level. As we found last year, there are considerable differences across

grades and subjects within each site. (See the Appendices for these details.).

As mentioned in the Methods section, we also experimented with models that allowed for a

random slope across classrooms for the effect of prior year score. Unlike our first round of analyses, this

15

year there was consistent evidence for significant slope variation at both Cincinnati and Vaughn.

Generally, however, correlations between EB intercept residuals from these models and teacher

evaluation scores are similar to those based on the random intercept models, and teacher evaluation scores

have little consistent correlation with EB slope residuals. It remains to be explored what variables might

be associated with slope variations across classrooms at these sites. The companion paper by Borman and

Kimball explores the issue of random slopes in Washoe.

Results from Models without Controls for Student Demographic Characteristics

To assess whether controlling for student characteristics affects the relationship between evaluation scores

and student achievement, an analysis was also conducted which did not include the gender, ethnic, special

education, and free/reduced lunch controls at level 1. Table 4 reports the average correlations within each

site between the empirical Bayes intercept residuals from the models that did not control for these student

demographic characteristics, and the total teacher evaluation scores.

Table 4 Correlations Between Teacher Evaluation Scores and Estimates of Average Student Achievement Based on Empirical Bayes Intercept Residuals from Models without Controls for Student Characteristics Tested Subject Site Reading Math Other Cincinnati .27 .31 .02* (Science) Vaughn .60 .43 .32 (Language Arts) Washoe .26 .25 - * Confidence Interval includes 0

Again, we also calculated the average estimated change in student achievement (number of

standard deviations) associated with a one level change in teacher evaluation score. Table 5 reports these

estimates.

Table 5 Effect of a One Level Change in Teacher Evaluation Score on Student Achievement (in Standard Deviation Units) Tested Subject Site Reading Math Other

16

Cincinnati .16 .20 .02 (Science) Vaughn .25 .37 .21 (Language Arts) Washoe .16 .21 - Comparing the results reported in Tables 5 and 6 with those reported for the models with student level

demographic controls shows that there is little difference between the correlations or achievement effects

estimated with and without these controls. It is likely that most of the effects of factors such as socio-

economic status are highly correlated with prior year test scores, so that controlling for these scores

eliminate much of the effect of the demographic characteristics on current year test scores.

Relationship Between Evaluation Scores, Teacher Experience, and Student Achievement. Our first analysis relevant to the question of the relationship of evaluation scores to teacher

experience was simply to correlate the scores with our (admittedly imperfect) measure of teacher

experience. Table 6 reports the correlation coefficients.

Table 6 Correlation of Teacher Evaluation Scores with Teacher Experience Site Correlation Cincinnati .11 Vaughn (average of 5 domains) .33 Washoe .39 Note that in Cincinnati and Vaughn most of the teachers who were evaluated were also low in experience.

In Cincinnati, this is by design; most senior teachers were exempted from comprehensive evaluation after

the 2000-01 school year. At Vaughn, relatively high turnover of senior teachers a few years ago has

required hiring many less experienced teachers. At all sites, there are relatively few experienced teachers

with low scores, which would be expected if low performers tend to leave the organization, or were

terminated in the first few years. In general, plots of evaluation scores versus experience show a pattern in

which evaluation scores increase in the first three to five years then level off. This is what one would

17

expect if new teachers need a few years of experience to develop proficient performance and if those that

do not do so leave.

We also examined the correlation between experience and student achievement, as measured by

the EB residuals from the random intercept models with demographic controls. Table 7 shows the

correlations.

Table 7 Correlations Between Teacher Experience and Estimates of Average Student Achievement Based on Empirical Bayes Intercept Residuals from Models with Controls for Student Characteristics Tested Subject Site Reading Math Other Cincinnati -.19 -.20 -.02 (Science) Vaughn .20 .32 .23 (Language Arts) Washoe .15 .16 - These correlations are lower than the evaluation score – EB residual correlations reported in Table 2. This

suggests that experience, at least as measured here, is not as good a predictor of student performance as

the teacher evaluation scores. Surprisingly, for most grades, the correlations were negative in Cincinnati,

and the combined correlations are also negative. A possible explanation for this result is greater leniency

in the evaluation of highly-senior teachers. Because most senior teachers were exempt from

comprehensive evaluation, those who were evaluated had either been identified as having performance

problems by school administrators, or were seeking status as lead teachers. Both situations required a

comprehensive evaluation. In either case, evaluators may have been more lenient with these teachers due

to the stakes attached.

We also tried to assess the impact of teacher experience by including it along with the teacher

evaluation score as a predictor of classroom intercepts at level 2 of our models. We then calculated the

change in student achievement associated with a one level increase in teacher evaluation score,

controlling for teacher experience. Table 8 below reports the number of standard deviations in test scores

that are associated with a change in one level on all the domains (Cincinnati and Washoe) or the subject

18

specific domain (Vaughn) of the evaluation systems. These are compared with the effects estimated for

evaluation score alone, as reported in Table 3 above.

Table 8 Effect on Student Achievement of 1 Level Increase in Teacher Evaluation Score, with and without Controlling for Teacher Experience (Weighted Average across Grades) Cincinnati Vaughn Washoe TES

Alone Controlling for

Experience TES

Alone Controlling for

Experience TES

Alone Controlling for

Experience Reading .14 .16 .23 .23 .14 .12 Math .18 .22 .22 .16 .19 .17 Science -.01 -.00 - - - - Language Arts - - .21 .18 - -

As the table shows, the differences in effect are small. The estimated effect of teacher evaluation score

declines in Washoe and at Vaughn, but increases in Cincinnati, as might be expected given the positive

correlations between student achievement and experience at the former two sites and the possible leniency

toward senior teachers in Cincinnati. These results suggest that teacher evaluation scores are measuring

something besides teacher experience, and that scores from standards-based teacher evaluation systems

can explain variance in teachers’ average student achievement above and beyond that explained by

teacher experience.

Discussion The results reported above show that the scores produced by these standards-based teacher

evaluation systems have a substantial positive relationship with the achievement of the evaluated

teachers’ students. The results are comparable to those obtained from similar analyses we presented at

last year’s AERA meeting. Though there were some notable differences (i.e. higher correlations between

evaluation scores and student achievement in mathematics and language arts at Vaughn, and a near zero

correlation for science achievement in Cincinnati) we consider the results to represent a constructive

replication of our earlier results. So with respect to our first research question, there is the expected

19

relationship between student achievement and teacher evaluation scores, and this relationship is largely

similar to that found in our previous study. Together, these studies suggest that evaluation scores from

well-designed and implemented standards-based teacher evaluation systems can be used for decisions

about teachers, and have potential for use in studies of teacher effects on student achievement.

It should be noted, however, that differences in the strength of the relationship differ again this

year across subjects, and across grades within the Cincinnati and Washoe sites. For Cincinnati especially,

the variation in the strength of the evaluation score-student achievement relationships across subjects and

grades is considerable. Though sampling and measurement error are undoubtedly responsible for some of

this, the relationship for science is again the weakest. Last year, the correlations for fourth grade reading

were also small. In Washoe, no similar pattern has yet emerged.

Our second research question was whether not controlling for student characteristics associated

with test scores such as ethnicity, special education status, and English proficiency changes the

relationship between teacher evaluation scores and student achievement. Our results showed only small

differences in the evaluation score – student achievement relationships. While the correlations and effect

estimates were typically slightly lower without these controls, in a few cases they were slightly higher.

Our findings last year were largely similar. At least at these sites, it does not appear to make a big

difference overall whether these student characteristics are included as controls or not, probably because

most of their effects are already included in the prior year test scores used as controls. This is not to say

that models intended to estimate teacher effects on student achievement shouldn’t include student

characteristics outside teachers’ control. But it does suggest that the relationship between teacher

evaluation scores and student achievement does not depend on taking into account these student

characteristics. So this relationship does not need to be discounted by those who object, for substantive or

symbolic reasons, to taking student background characteristics into account in evaluating teacher

performance. A related question, however, deserves more analysis: do teachers whose students are

primarily low income, non-white, or limited in English proficiency have less opportunity to teach in ways

that allow them to earn high evaluation scores? Those who have studied disproportionate certification

20

rates for non-white teachers by the National Board for Professional Teaching Standards have raised this

possibility with respect to the National Board Standards (Bond, 1998). We intend to investigate this issue

by analyzing teachers’ evaluation scores as a function of classroom and school demographic composition

and prior achievement. This will give an indication of whether evaluation scores are lower for teachers of

low income, non-white, etc. students.

Our third research question involved the relationship between teacher evaluation scores and

teacher experience, and whether controlling for experience changes the relationship between evaluation

scores and student achievement. With respect to the first issue, there was, as might be expected, a

relationship between rated performance and experience, but the relationship weakens as experience

increases past 5 or so years. That beginning teachers are on the average weaker performers should be

expected, both because experience is needed to apply the lessons of teacher preparation, and because

poorer performers are more likely to leave over time. One can interpret this finding to support the

construct validity of the teacher evaluation scores as measures of teacher performance because they have

the expected relationship with experience: stronger at lower and weaker at higher experience levels.

With respect to the second part of the question, controlling for teacher experience reduced the

strength of the evaluation score – student achievement relationship slightly at Vaughn and in Washoe, but

strengthened it slightly in Cincinnati. This is evidence that the teacher evaluation scores are measuring

something different than experience. They add to the explanation of student achievement over and above

teacher experience. In our analyses, teacher position on the step part of the salary schedule had a weaker

relationship to student achievement than evaluation scores. This suggests that pay for performance could

be a more appropriate basis rewarding teachers for their contribution to student achievement than systems

based on seniority, as argued by Odden and Kelley (2002).

Limitations and Directions for Future Research

As in our previous research, a substantial number of students in Cincinnati and Washoe could not

be included in the analyses because of missing test information (especially prior year test scores),

21

movement between schools during the year, and difficulties linking students to teachers. While we doubt

that the overall results would have changed very much, it would be reassuring to have been able to

include a higher proportion of the enrolled students. Future work will explore the use of propensity scores

and missing data imputation to estimate the possible bias in our current results from excluding students

with missing data.

Another limitation is that all of our analyses have involved teachers and students at the

elementary and (in Cincinnati) middle school levels. This limitation is due to the lack of testing in all

grades at the high school level in Cincinnati and Washoe, and difficulties in linking teachers with students

for specific subjects after sixth grade in Washoe. Thus we do not know if scores from these evaluation

systems for high school teachers also predict average student achievement. This is an important issue

because in our interviews with teachers, some at the high school level told us that the evaluation standards

were oriented to the elementary and middle school levels.

As mentioned briefly above, we also experimented with models that included random slopes for

the effect of prior year student achievement on current year achievement. Unlike the results we reported

last year, this year there appeared to be more consistent evidence for differences in slopes across

classrooms. Though our preliminary results (not reported here in due to time and space limitations)

suggest that including random slopes does not alter the correlations between teacher evaluation scores and

EB intercept residuals (representing average classroom student achievement) very much, it does

complicate the question of the relationship between teacher evaluation scores and student achievement. If

all classrooms have similar relationships between current and prior year test scores, the correlation

between EB intercept residuals and evaluation scores tells the essential story about how the evaluation

scores relate to student achievement (controlling for prior achievement). But if the relationship between

current and prior student achievement varies with the level of prior achievement, it is possible that

teachers are differentially effective for those with low versus high prior achievement. The average level

of classroom achievement may no longer be a sufficient criterion for assessing the validity of the teacher

evaluation scores. If one teacher’s classroom has high average achievement (after controlling for prior

22

achievement) but students with lower prior achievement do worse than similar students of a teacher

whose classroom has lower average achievement, it is simplistic to conclude that the former is the ‘better’

teacher. Random slopes for ethnicity and free/reduced price lunch status raise a similar concern.

Preliminary analyses of the relationship between classroom slopes and teacher evaluation scores have not

shown a consistent relationship. The issue is addressed further using data from Washoe in the companion

paper by Borman and Kimball. We also hope to do more work on this issue using data from Cincinnati

and Vaughn.

Lastly, it would be useful to replicate these results again. We hope to collect one more year of

evaluation and student achievement data for Cincinnati and Washoe, and two more for Vaughn. Since

conditions at each of the sites are constantly changing, the persistence of the evaluation scores – student

achievement relationship cannot be taken for granted. Not only are minor modifications made in the

evaluation systems, but new evaluators are introduced and the population of teachers evaluated changes.

Since evaluators are an important part of these relatively high inference performance assessment systems,

including more evaluators by analyzing more years of data helps to provide a more generalizable estimate

of the evaluation score-student achievement relationship. We are also studying the ‘validity’ of specific

evaluators’ scores, and process by which evaluators make rating decisions. We hope this work will

contribute to a better understanding of standards-based teacher evaluation as it plays out in the field.

23

References Bond, L. (1998). Disparate impact and teacher certification. Journal of Personnel Evaluation in Education, 12:2, 211-220. Danielson, C. (1996). Enhancing Professional Practice: A Framework for Teaching. Alexandria, VA: Association for Supervision and Curriculum Development. Danielson, C., and McGreal, T.L.(2000). Teacher Evaluation to Enhance Professional Practice. Alexandria, VA: Association for Supervision and Curriculum Development. Gallagher, H.A. (2004). Vaughn Elementary’s Innovative Teacher Evaluation System: Are Teacher Evaluation Scores Related to Growth in Student Achievement? To be published in the Peabody Journal of Education, Spring, 2004. Kimball, S. M. (2002). Analysis of Feedback, Enabling Conditions and Fairness Perceptions of Teachers in Three School Districts with New Standards-Based Evaluation Systems. Journal of Personnel Evaluation in Education, 16(4), 241-268. Kimball, S.M., White, B., Milanowski, A.T., and Borman, G (2004). Examining the Relationship between Teacher Evaluation and Student Assessment Results in Washoe County. To be published in the Peabody Journal of Education, Spring, 2004. Kimball, S., Milanowski. A.T., and Heneman. H.G. III. (2003, November). Research Results and Formative Recommendations from the Study of the Washoe County Teacher Performance Evaluation System. Paper presented at the American Evaluation Association Seventeenth Annual Meeting, Sparks, Nevada.

King Rice, J. (2003) Teacher Quality: Understanding the Effectiveness of Teacher Attributes. Washington, DC: Economic Policy Institute. Milanowski, A.T. (forthcoming). The relationship between teacher performance evaluation scores and student achievement: Evidence from Cincinnati. To be published in the Peabody Journal of Education, Spring, 2004. Milanowski, A.T., and Heneman, H. G. III. (2001). Assessment of teacher reactions to a standards-based teacher evaluation system: A pilot study. Journal of Personnel Evaluation in Education, 15:3, 193-212. Milanowski, A.T., and Kimball, S. (2003, April). The Framework-Based Teacher Performance Assessment Systems in Cincinnati and Washoe. Paper presented at the Annual Meeting of the American Educational Research Association, Chicago, IL. Moore Johnson, S. (1990). Teachers at Work: Achieving Success in Our Schools. NY: Basic Books. Odden, A., Borman, G., and Fermanich, M. (2004). Assessing Teacher, Classroom, and School Effects, Including Fiscal Effects. To be published in the Peabody Journal of Education, Spring, 2004.

24

Odden, A. and Kelley, C. (2002). Paying Teachers for What They Know and Do: New and Smarter Compensation Strategies to Improve Schools. (2nd. Ed.). Thousand Oaks, CA: Corwin Press. Peterson, K. D. (2000). Teacher Evaluation: A Comprehensive Guide to New Directions and Practice. 2nd Ed. Thousand Oaks, CA: Corwin Press. Hunter, J.E., and Schmidt, F.L. (1990). Methods of Meta-analysis: Correcting Error and Bias in Research Findings. Newbury Park, CA: Sage. Stiggens, R.J., and Duke, D. (1988). The Case for Commitment to Teacher Growth: Research on Teacher Evaluation. Albany, NY: State University of New York Press. Porter, A. C., Youngs, P., and Odden, A. (2001). Advances in teacher assessments and their uses. In V. Richardson (Ed.), Handbook of Research on Teaching, Fourth Edition. Washington, DC: American Educational Research Association, 259-297.

25

Appendix 1 – Cincinnati Details Model Used to Generate Empirical Bayes Intercept Residuals Used in Analyses At level 1, the test score in the year evaluated was modeled as a function of the same subject test score from the prior year, the number of days the student was enrolled at the school where testing took place, dummy variables for gender, limited English proficiency, receipt of free/reduced price lunch, special education status, and African-American ethnicity. At level two, a random intercept was specified, but no predictors were included. Al slopes were considered fixed. Measures of Student Achievement The measures of student achievement used are shown in Table 1. Because no science test was given in third grade for 2001-02, the both prior year reading and mathematics tests given in grade three were used as a control for prior student achievement for the science fourth grade analysis. Table 1 Student Achievement Measures by Grade Grade 2002-03 Test 2001-02 Test (Control for Prior Achievement)

3 District Test District Test (reading and math for science) 4 State Proficiency Test District Test 5 TerraNova State Proficiency Test 6 State Proficiency Test District Test 7 District Test State Proficiency Test 8 TerraNova District Test

Number of Students In Analyses As was the case in last year’s analyses, a considerable number of students on the district roster in the Spring of 2003 could not be included due to missing data. Most students were lost form the analyses because there was no test data from the prior school year. A considerable number was lost because the they changed schools mid-year. As was done last year, students not enrolled in the school were they were tested for at least 71 days were dropped form the analyses. A small number of students were dropped because they could not be matched to a teacher for the subjects tested. A few students (no more than 10-15 per grade/subject combination) were excluded because their scores on the current year test or the prior year test were outliers. Table 2 below shows the number of students in the district’s schools by grade and subject, the number tested in 2002-03, and the number used in the analyses.

26

Table 2 Number of Students on Roster, Tested, and Whose Scores Were Used in Analyses - Cincinnati 3rd 4th 5th 6th 7th 8th Total Number of Students on CPS Roster, Spring 2003

3,556

3,353

3,215

3,337

3,449

3,054

Number Tested in 02-03 Reading 3,374 3,270 3,171 3,197 2,953 2,629 Math - 3,268 3,164 3,200 2,929 2,611 Science - 3,265 3,150 3,191 2,919 2,601 Number Used in HLM Models Reading 2,827 2,506 2,364 2,373 2,129 1,975 Math - 2,410 2,386 2,410 2,102 1,951 Science - 2,394 2,400 2,397 2,044 1,994 Analyses Results by Grade and Subject Table 3 shows the correlations between the EB intercept residuals and the total teacher evaluation score by grade and subject, the combined correlations across grades by subject, the standard errors of the combined correlations, and the confidence intervals. These correlations were calculated using the EB intercept residuals of those teachers with three or more students tested. As mentioned in the main body of the paper, there is considerable variation in the size and even sign of the correlations. No results for grade 3 in science and math are shown because no tests in those subjects were given in grade 2 in the prior year. Table 3 Correlations Between Empirical Bayes Intercept Residuals and Total TES Score, by Grade and Subject (Number of teachers in parentheses) Grade Reading Math Science 3

.35 (35)

-

-

4 .07 (24) .32 (19) .12 (20) 5 .23 (24) .09 (22) .13 (20) 6 .35 (24) .48 (22) -.15 (20) 7 .42 (15) .25 (14) -.29 (11) 8 .27 (12) .74 (8) -.32 (5) Combined .28 .34 -.02 Standard Error .11 .12 .12 95% Confidence Interval .11 - .44 .12 -.53 -.09 -.46

27

Table 4 Effect of a One Level Change in Teacher Evaluation Scorea on Student Achievement, by Grade and Subject, in Standard Deviation Units Grade Reading Math Science 3

.24

-

-

4 .04 .37 .09 5 .12 .06 .09 6 .14 .20 -.05 7 .12 .07 -.24 8 .08 .21 -.13 Weighted Avg: .14 .18 -.01 (a) For example, from ‘proficient’ to ‘distinguished’ on all four domains Model Parameter Estimates A complete set of results from the HLM analyses of the data, including model coefficients and standard errors, is available from the authors on request.

28

Appendix 2 – Vaughn Details Site-specific Two-Level Models At level 1, the basic models used in the analyses of the data from this site had the reading, mathematics or language arts test score of the student in the year the teacher was evaluated on the left hand side, and the corresponding prior year test score, and dummy variables for and grade level (with grade 2 as the left-out category). Models including student demographic characteristics added days of attendance and dummy variables for gender, special education status, retention in grade from the prior year, beginning English proficiency status, fully proficient status (so the left-out category for proficiency was partial proficiency), and whether the parent had finished high school. Following initial analysis of Vaughn data done by Gallagher (2004), missing values for the attendance and parental education variables were imputed based on school means. Dummy variables were added to indicate imputation, though none of these were had coefficients that were statistically significant at conventional levels. Variables for free/reduced price lunch and ethnicity were not included since all students were eligible for the former and the vast majority of students are of Hispanic ethnicity. At level 2, all models specified a random intercept and fixed slopes for all level 1 predictors. The models used to obtain the empirical Bayes residuals used to represent classroom average student achievement had no predictors at level 2. Other models included evaluation scores and years of tenure at Vaughn as predictors of the random intercepts. Student Achievement At Vaughn, the Stanford 9 test was the measure of student achievement used grades in both years. Because of this uniformity, and because this test is supposed to be scaled so that one year’s test scores are comparable with , we combined grades in the Vaughn models and include grade-level dummy variables, rather than estimating separate models by grades. Number of Students in the Analyses At Vaughn, there was proportionately less loss of students due to missing data than at the other sites, though students were not included in the analyses due especially to missing prior year test scores. Table 1 shows the number of students on the roster in grades 2-5, and those tested and used in the analyses, by subject. Table 1 Number of Students on Roster, Tested, and Whose Scores Were Used in Analyses – Vaughn Reading Math Language Arts Total Number of Students on Roster, Spring 2003

580

580

580

Number Tested in 02-03 561 570 578 Number Used in HLM Models 537 549 541

29

Parameter Estimates for Random Intercept Model Including Student Demographic Characteristics and Subject Specific Domain Evaluation Scores as Level 2 Predictors of Random Intercepts Table 2 Model Coefficients and Standard Errors – Random Intercept with Controls for Student Characteristics Reading Math Science Level 1 Variables Coefficient Std.

Error Coefficient Std.

Error Coefficient Std.

Error Intercept 568.13 - 564.69 582.65 Reading Prior Year 0.71 0.3 0.68 0.03 0.56 0.03 Special Ed Status (=1) -2.49 5.46 1.70 6.39 -0.11 5.47 Gender (F=1) 2.46 1.60 0.39 1.76 0.77 1.88 Retained (=1) -2.09 5.62 2.42 6.01 6.89 6.13 Grade 3 (=1) -3.57 2.66 -19.12 5.88 -6.02 4.35 Grade 4 (=1) 6.52 2.90 -7.04 6.10 15.68 4.43 Grade 5 (=1) -3.61 3.31 -11.20 6.51 4.91 4.72 Beginning English (=1) -7.25 2.54 -4.91 3.02 -11.08 3.14 Proficient English (=1) -0.11 3.42 1.86 3.87 3.84 3.84 Attendance (Days) -0.01 0.12 0.04 0.13 0.27 0.14 Father HS Grad (=1) 0.60 1.74 -2.52 1.96 -1.28 2.09 Imputed Attendance -4.44 4.33 -3.74 5.13 -3.25 5.33 Imputed Father Educ. 2.18 3.58 7.66 5.42 12.99 5.03 Level 2 Subject-Specific Evaluation Score

10.23

2.22

14.10

2.93

7.78

3.28

Note: All Level 1 variables were grand mean centered. A complete set of results from the other models, including coefficients and standard errors, is available from the authors on request.

30

Appendix 3 – Washoe Details Washoe Two-Level Model At level 1 the models included the 2002-03 test results regressed on the prior year test results in the same subject, and dummy variables representing student gender, non-white ethnicity, special education status, and qualification for free and reduced price-lunches. At level 2, all models specified a random intercept and fixed slopes for each level 1 predictor. The models for obtaining the empirical Bayes residuals used to represent classroom average student achievement had no predictors at level 2. To predict the average intercept in other level 2 models, variables were added for the teacher evaluation composite measure and teaching experience, which is the step designation on the district salary schedule. Measures of Student Achievement The student achievement measures used in the analysis are shown in Table 1 below. The tests included criterion referenced tests developed by the district for grades 4 and 6, state criterion referenced tests developed by Harcourt Brace for grades 3 and 5, and the TerraNova norm-referenced exam (Comprehensive Test of Basic Skills) administered in grade 4. The exam sequence is intended to assess student performance relative to state and district academic standards. Results from these assessments for mathematics and reading were used in the analyses Table 1 Student Achievement Measures by Grade Grade 2002-03 Test 2001-02 Test

4 District Criterion Referenced Test State Proficiency Test 5 State Proficiency Test TerraNova 6 District Criterion Referenced Test State Proficiency Test

Number of Students in Analyses A large number of students who were tested on the outcome measures could not be included in the study. Most of the missing students were lost because their teachers were not assessed on the applied evaluation measure used in the analyses. Other students were lost because they did not have pre-test results or because their teacher could not be identified from the district student database made available at the time of this study. Table 2 shows the number of students tested and the number included in the models at each grade level. Table 2 Number of Students on Roster, Tested, and Whose Scores Were Used in Analyses - Washoe

31

4th 5th 6th Total Number of Students on Roster, Spring 2003

4,764 4,910 4,914

Number Tested in 02-03 Reading 4,500 4,474 4,520 Math 4,539 4,484 4,533 Number Used in HLM Models* Reading 2527 2176 2632 Math 2527 2176 2632 * Listwise exclusion was used in creating the models. Table 3 Correlations Between Empirical Bayes Intercept Residuals and Total TES Score, by Grade and Subject (Number of teachers in parentheses) Grade Reading Math 4

.29 (131)

.22 (131)

5 .09 (135) .32 (135) 6 .27 (131) .17 (131) Combined .25 .24 Standard Error .05 .05 95% Confidence Interval .15 - .34 .14 -.33 Table 4 Effect of a One Level Change in Teacher Evaluation Scorea on Student Achievement, by Grade and Subject Grade Reading Math 4

.16

.17

5 .11 .24 6 .22 .21 Weighted Average: .16 .21 (a) For example, from ‘proficient’ to ‘distinguished’ on all four performance components Model Parameter Estimates

32

Table 5 provides results from the full random intercept model including student pretest and demographic characteristics at Level 1 and the teacher evaluation score composite and experience measure as Level 2 predictors of the random intercepts or average classroom achievement. Table 5 Grade 4-6 Model Coefficients and Standard Errors – Random Intercept with Controls for Student Characteristics Grade Four Grade Five Grade Six Reading Math Reading Math Reading Math Level 1 Variables Intercept 25.23 (.87) 27.36 (1.03) 292.30 (5.88) 275.70 (9.49) 23.81 (.93) 21.33 (2.0) Pretest .06 (.00) .05 (.00) 1.13 (.03) 1.28 (.03) .05 (.00) .05 (.00) Free and reduced lunch status (=1)

-.43 (.23) -.40 (.27) -6.69 (1.86) -1.43 (2.29) -1.03 (.23) -.46 (.24)

Special education status (=1)

-3.21 (.39) -2.77 (.36) -11.10 (4.00) -14.18 (3.77) -3.16 (.45) -3.86 (.42)

Gender (F=1) .15 (.18) -.26 (.18) 4.02 (1.61) -.70 (1.62) .13 (.16) -.40 (.19) Ethnicity (non-white=1) -1.21 (.21) -.90 (.22) -3.35 (1.92) -7.52 (1.86) -1.47 (.23) -.82 (.19) Level 2 Teacher Evaluation Composite

1.08 (.35) .94 (.45) 5.16 (2.53) 13.68 (3.96) .86 (.37) .97 (.82)

Teacher Experience .01 (.02) .03 (.03) .07 (.21) .04 (.29) .03 (.02) .06 (.05) Note: All Level 1 variables were grand mean centered. A complete set of results from the other models, including coefficients and standard errors, is available from the authors on request.