Upload
independent
View
0
Download
0
Embed Size (px)
Citation preview
C O N S O R T I U M F O R P O L I C Y R E S E A R C H I N E D U C A T I O N University of Pennsylvania • Harvard University • Stanford University University of Michigan • University of Wisconsin-Madison
Wisconsin Center for Education Research, University of Wisconsin-Madison 1025 West Johnson Street, Room 653, Madison, WI, 53706-1796 ■ Phone 608.263.4260 ■ Fax 608.263.6448
The Relationship Between Standards-Based
Teacher Evaluation Scores and Student Achievement: Replication and Extensions at Three Sites
Anthony T. Milanowski
Consortium for Policy Research In Education University of Wisconsin-Madison
Madison, WI 53706 (608) 262-9872
Steven M. Kimball Consortium for Policy Research In Education
University of Wisconsin-Madison Madison, WI 53706
(608) 265-6201 [email protected]
Brad White
Consortium for Policy Research In Education University of Wisconsin-Madison
Madison, WI 53706
March 2004 CPRE-UW Working Paper Series
TC-04-01 This paper was prepared for the Consortium for Policy Research in Education, Wisconsin Center for Education Research, University of Wisconsin-Madison for presentation at the American Educational Research Association annual meeting held April 12-16, 2004 in San Diego, California. The research reported in this paper was supported by a grant from the U.S. Department of Education, Office of Educational Research and Improvement, National Institute on Educational Governance, Finance, Policymaking and Management, to the Consortium for Policy Research in Education (CPRE) and the Wisconsin Center for Education Research, School of Education, University of Wisconsin-Madison (Grant No. OERI-R308A60003). The opinions expressed are those of the authors and do not necessarily reflect the view of the National Institute on Educational Governance, Finance, Policymaking and Management, Office of Educational Research and Improvement, U.S. Department of Education, the institutional partners of CPRE, or the Wisconsin Center for Education Research. .
2
Standards-based teacher evaluation represents one strategy for both improving instruction and
complying with the expectations of external stakeholders that teachers be held accountable for their
performance. Consistent with the movement for standards for students, this approach starts with a
comprehensive model or description of what teachers should know and be able to do, represented by
explicit standards covering multiple domains and including multiple levels of performance defined by
detailed behavioral rating scales. It typically requires more intensive collection of evidence, including
frequent observations of classroom practice and use of artifacts such as lesson plans and samples of
student work, in order to provide a richer picture of teacher performance. Besides the movement toward
standards for students, the roots of standards-based evaluation also include the desire to represent a more
complex conception of teaching and learning for teacher licensing and certification (Porter, Youngs, and
Odden, 2001) and the need for a comprehensive practice model to guide new teacher induction and
mentoring. Dissatisfaction with evaluation approaches that provide little guidance for teachers’ efforts to
improve practice (Moore Johnson, 1990; Stiggens and Duke, 1988) has also been an influence. One
prominent embodiment of the standards-based evaluation concept is the Framework for Teaching
(Danielson,1996; Danielson and McGreal, 2000).
We have argued elsewhere that standards-based teacher evaluation systems constitute a
performance competency model with the potential to improve instruction by affecting teacher selection
and retention, motivating teachers to improve their skills, and promoting a shared conception of good
teaching (Milanowski and Kimball, 2003; Kimball, Milanowski, and Heneman, 2003). In essence,
standards-based teacher evaluation systems provide both incentives and guidance for teachers to change
their practice toward the model embodied in the standards. But the potential of standards based teacher
evaluation for improving student achievement depends on the link between practices described by the
standards and student learning. Unless teaching according to the standards leads to more student learning,
implementing a standards-based evaluation system will not contribute to improved student achievement.
One type of evidence that would support the case that standards-based evaluation can lead to more student
3
learning is a significant empirical relationship between teaching according to the standards (as measured
by the teacher evaluation scores) and value-added measures of student achievement
As school organizations move to standards-based evaluation systems, they should also be
interested in the reliability and validity of the evaluation scores produced, especially when these scores
are used for decisions with consequences for teachers, such as termination, tenure, and pay for
performance. One aspect of validity is the relationship between teacher performance as measured by the
evaluation system, and student learning: whether students of teachers whose performance has been rated
higher learn more. To the extent that teacher evaluation scores are empirically related to measures of
student achievement, an organization using such scores for consequential decisions has criterion-related
or empirical validity evidence that this use is justified.
From a research perspective, Odden, Borman, and Fermanich (2004) have argued that standards-
based teacher evaluation scores might be useful in research on teacher effects on student learning.
Teachers’ scores from well-designed, practice-based teacher evaluation systems could be considered
measures of instructional practice that can be used in studies that try to identify the effects of
communities, schools, and teachers on student learning (Odden, Borman, and Fermanich , 2004). But
given the poor reputation of teacher evaluation for validity (Peterson, 2002), the relationship between
standards-based teacher evaluation scores and measures of student achievement needs to be demonstrated
before using these scores in research on teacher effects or teacher quality.
We have been studying standards-based teacher evaluation in three school organizations using a
standards-based evaluation system: the Cincinnati (Ohio) Public Schools, the Vaughn Next Century
Learning Center (a charter school in Los Angeles, California) and the Washoe County (Nevada) School
District. The Framework for Teaching served as the foundation for two of the three evaluation systems we
have been studying and substantially influenced the third. Besides looking at implementation issues and
teacher reactions to the systems, we have assessed the relationship between teachers’ evaluation scores
and student achievement, as measured by value-added methods. Our initial results were reported at last
year’s AERA meeting and in articles by Gallagher (2004), Kimball, White, Milanowski, and Borman
4
(2004) and Milanowski (2004). These studies used a value-added approach in which elementary and
middle school students’ test scores in the year the teacher was evaluated were modeled as a function of
prior year test scores and student characteristics such as gender, ethnicity, participation in free/reduced
price lunch programs, special education status and limited English proficiency. Two-level random
intercept hierarchical linear models were used to estimate the relationship of test scores to evaluation
scores, represented as a level 2 variable. Other models with no predictors at level 2 were used to estimate
empirical Bayes intercept residuals (representing average classroom student achievement, controlling for
prior year test scores and student characteristics) that were then correlated with evaluation scores.
Table 1 summarizes the results from our first set of analyses of the teacher evaluation score-
student achievement relationship in the form of average correlations between evaluation scores and
empirical Bayes intercept residuals that represent the average relative level of measured student
achievement in the classroom, controlling for prior learning and student characteristics.
Table 1 Average Correlations Between Teacher Evaluation Scores and Estimates of Average Student Achievement Within Classrooms for Three Research Sites Tested Subject Site Reading Math Other Cincinnati .32 .43 .27 (Science) Vaughn .50 .21 .18 (Language Arts) Washoe .21 .19 -
These results suggest that scores from standards-based teacher evaluation systems can have a
substantial relationship with measures of the student achievement of the teachers’ students. However, the
size of the relationship varied across research sites, across grades within sites, and across academic
subjects. Some of this variation is likely attributable to differences in programs across sites and across
evaluators for different grade levels or subjects (especially in Cincinnati, where evaluators from outside
each school were assigned to evaluate specific teachers based on the evaluator’s grade level and subject
experience). Measurement error in the tests of student achievement and the evaluation scores might also
explain some of the variability, and a considerable portion could be due to sampling error, since the
5
number of teachers included in each site, subject, and grade analysis were not that large. The largest
samples were found in Washoe, where data for about 120 teachers per grade were available, but samples
in Cincinnati and Vaughn were much smaller, at 20-40 per grade and subject. As Hunter and Schmidt
(1990) argue, much of the variation among studies of this type can be due to sampling error when samples
are this small. Thus, though the results we obtained from the first wave of data collected were promising,
they are in need of replication.
This paper reports on the results of analysis of an additional year of evaluation and student
achievement data at these research sites. It addresses three research questions:
1) What are the relationships between student achievement and teacher evaluation scores in a second set of data, and how do these relationships compare with our previously-reported results? 2) How much does the relationship between teacher evaluation scores and student achievement change when removing controls for student characteristics (other than pre-test)?
3) What is the relationship of teacher evaluation scores to teacher experience, and does controlling for experience change the relationship between evaluation scores and student achievement?
The first question addresses the issue of replication or stability of the teacher evaluation – student
achievement relationship. Knowing whether this relationship is consistent from year to year is important
in understanding the construct validity of the evaluation scores as well as allowing users of these
standards-based systems to assess whether the criterion-related validity of the evaluation scores is limited
to specific teachers, students, and years.
The second question is motivated by concerns raised by some reviewers of our initial studies.
While most researchers looking for teacher effects on student achievement appear to believe that
controlling for student characteristics (e.g. ethnicity and socio-economic status) that may influence test
scores provides a more accurate way to assess teacher effects, it has also been argued that these
characteristics should not be considered in the assessment of teacher performance because teachers should
produce similar learning gains for all groups of students. It has also been argued that controlling for
factors like ethnicity sends the message that lower levels of achievement are expected for some students.
6
To assess the degree to which controlling for student characteristics affects the relationship between
evaluation scores and student achievement, an analysis was also conducted which did not include the
gender, ethnic, special education, and free/reduced lunch controls.
The third question is of interest because it could be that the teacher evaluation scores are
measuring little more than teacher experience, which in turn is related to student achievement (though, it
must be added, the research record is not consistent in linking experience and student achievement). If
teacher evaluation scores do not add to the prediction of student achievement over and above the level of
experience, then the value of investing resources is standards-based evaluation systems for use in
differentiating teacher pay is questionable. The relationship is also if interest because it bears on the
construct validity of the teacher evaluation scores. If teaching performance increases with experience in
the first few years of teaching, after which the relationship weakens (see King Rice, 2003 for a summary
of research on experience and student achievement), we would expect to see this pattern in the
relationship between evaluation scores and teacher experience at these sites.
Site Background
Cincinnati Public Schools (CPS) is a large urban district with 70 schools and programs enrolling
about 48,000 students, and employing over 3,000 teachers. It has low student achievement relative to
surrounding suburban districts, and a high proportion of African-American students and students eligible
for free or reduced-price lunch. State accountability programs and public expectations put pressure on the
district to raise average levels of student test scores. The district developed its teacher evaluation system
based on the Framework for Teaching (Danielson, 1996), utilizing Danielson’s four domains: planning
and preparation, creating an environment for learning, teaching for learning, and professionalism.
However, CPS reduced the number of standards from 24 to 16. Teachers receive comprehensive
evaluations in their first and third years, and every fifth year thereafter. The comprehensive evaluation
involved an assessment by a teacher from outside the school as well as a building administrator. In
interim years, teachers receive a less rigorous assessment from a building administrator (principal or
7
assistant principal). The evaluation system was initially designed to serve as the basis for a performance
pay plan as well as for formative and professional development purposes. For the period under study,
however, the evaluation system has not been used for pay purposes, and has been applied primarily to less
experienced teachers.
Vaughn Next Century Learning Center is a public charter school in San Fernando, California.
Previously a public school in the Los Angeles Unified School Distinct, the school converted to charter
status in July of 1993. During the period of our study, the school served about 1,200 students in pre-K
through grade 5. The student population is 94% Hispanic, and most students are not considered to be
English proficient. Almost all students are eligible for free or reduced-price lunch. Vaughn has more
than 70 staff, of which about 40 are K-5 teachers. Vaughn began developing its evaluation system during
the 1997-98 school year and implemented it for volunteers in 1998-99 school year. The school began to
use the system for all teachers the next year. Teachers are evaluated yearly, and the evaluation results are
used as the basis for a pay for performance system, as well as developmental and accountability purposes.
Washoe County School District is a large western district that includes the cities of Reno and
Sparks, Nevada and surrounding communities. There are 88 schools, over 60,000 students and about
3,300 teachers in the district. The district has been using a standards-based teacher evaluation system
adapted from the Framework for Teaching since 2000. According to state law, teachers must undergo a
performance evaluation each year. Teachers are evaluated annually in Washoe County on different
performance domains depending on their stage in the evaluation cycle. Teachers in their first or second
year of probation (pre-tenure) are evaluated on all four performance domains. Post-probationary teachers
are evaluated on one or two domains, depending on whether they are in a “minor” or “major” evaluation
year. The evaluation process was designed to provide a common framework for evaluation discussions
and promote teacher reflection on practice. Evaluation decisions have no direct bearing on salary, but do
serve as the basis for summative evaluation decisions, such as contract renewal and tenure.
8
Further information about these sites and references to additional descriptive material can be
found in articles by Gallagher (2004), Kimball (2002), Kimball, White, Borman and Milanowski (2004),
Milanowski (2004), and Milanowski and Heneman (2001).
Method
Analyses
All of the analyses were based on the value-added paradigm, using two-level hierarchical linear
models in which individual student achievement on a subject test is represented as a function of the prior
year’s test score in that subject, and a variety of student-level control variables intended to represent
factors associated with test performance but that are not in teachers’ control, such as ethnicity and English
proficiency. At level two (the classroom level), one variation of the analysis incorporated teacher
evaluation score as a predictor of the random intercepts for each classroom, representing the average level
of student achievement, controlling for prior year test score and student demographic characteristics.
Another variation used a random intercept model to obtain empirical Bayes (EB) intercept residuals,
representing the average student performance in each teacher’s classroom, and then correlated these
residuals with teacher evaluation scores. For each approach, the basic level 1 model was:
Post-test = β0 + β1 pretest+ β2X2 + … βnXn + R Where X2…Xn represent various student characteristics such as gender, ethnicity, or free and reduced
price lunch status. All level 1 predictors were grand-mean centered. It should be noted that different
control variables were available and appropriate at each site. Information on the specific models used for
each site are shown in the Appendices. The first set of analyses, used to calculate the correlations
between the EB intercept residuals and teacher evaluation scores, used a simple level 2 specification:
β0j = γ00 + u0j
9
At level two, the u0 represented the teacher-specific differences from the average of the group intercepts.
The EB residuals from this model were used as the measure of the average student performance relevant
to each teacher (i.e. classroom average achievement). Given the grand mean centering, the EB intercept
residuals represent the difference for the “average” student: average in prior year test score and other
characteristics at level 1. The slopes for all Level 1 variables were treated as fixed. For two sites,
Cincinnati and Washoe, the analyses were done by grade and subject.
The empirical Bayes intercept residuals were then correlated with teacher evaluation scores for
those teachers for which evaluation scores were available. Correlations were then combined across grades
in order to obtain a summary estimate of the relationship between evaluation scores and student
achievement. By analogy with meta-analysis, each grade within a subject was treated as a separate study
and the correlations combined using the standard formulas for a random effects treatment.1 Upper and
lower bounds for the 95% confidence intervals were also calculated.
Another approach used a more complex level 2 specification to obtain estimates of the effect of
rated teacher performance and experience differences on test scores. These analyses included the
evaluation score and a measure of teacher experience, as level two predictors of the random intercepts
representing classroom average student achievement:
β0j = γ00 + γ01 evaluation score + γ02 experience measure + u0j.
Here, the coefficients for evaluation score and experience estimate the effect in test score points of
variations in rated teacher performance and teacher experience. Again, the slopes for all Level 1 variables
were treated as fixed.
These models were used to maximize comparability to the results we obtained in our previous
studies. Last year, though we investigated the variance in slopes for some level one predictors, especially
1 An r to z transformation was done and a weighted average of the z’s was calculated with the inverse of the variances as weights. Standard errors were calculated for this average, and 95% confidence intervals. These values
10
prior test score, there was no consistent indication of reliable slope variance across grades and subjects.
So for this round of analyses, we continued to use fixed slope models, though we did investigate some
models with random slopes. The results of these analyses are discussed below and in the paper by
Borman and Kimball.
Measures
Teacher Evaluation Scores
In Cincinnati, teachers undergoing the comprehensive evaluation received a score on each one of
four domains: planning and preparation, creating an environment for learning, teaching for learning, and
professionalism, based on scores on the dimensions, called standards, within each domain. For each
standard, raters considered a four-level rating scale or ‘rubric’ defining unsatisfactory, basic, proficient,
and distinguished performance. For two domains (environment for learning and teaching for learning)
teachers’ performance was evaluated based on six classroom observations. Four of these were made by a
teacher evaluator from outside the school who had subject matter and grade level expertise similar to that
of the teacher being evaluated. Building administrators (principals and assistant principals) did the other
two observations. Based on summaries of the six observations, teacher evaluators made a final summative
rating on each of the standards in these domains. Administrators, based on a portfolio including lesson
and unit plans, attendance records, student work, family contact logs, and documentation of professional
development activities, rated teachers on the standards in the planning and professionalism domains.
Standard-level scores were then aggregated to a domain level score for each of the four domains using
tables provided by the district. As mentioned above, the standards and rubrics were adapted by the district
from Danielson’s Framework for teaching. The scores on the four domains were added to yield a
composite evaluation score to represent overall teacher performance. The average intercorrelation
between domain scores for all the teachers evaluated in 2002-03 was .56, and coefficient alpha was .84.
were then transformed back into correlation coefficients. See Shadish and Haddock (1994) for a description if the details of these calculations.
11
Because CPS only evaluates a subset of teachers each year, complete evaluation scores were
available for 318 teachers for the 2002-03 school year. But because most of the teachers evaluated
taught subjects or grades for which no state or district standardized tests were given, and because teachers
with less than 3 students tested were excluded, evaluation scores for only 131 unique teachers were
included in the analysis. (Some teachers appear in two or more subject/grade analyses.)
At Vaughn, the teacher evaluation system included 12 domains: lesson planning and classroom
management, plus subject-specific domains covering literacy, language development, mathematics,
special education, history and social science, science, instruction in primary language for English
learners, arts, technology, physical education, and teaming. Vaughn teachers are assessed two times per
year using a four level rating scale with unsatisfactory, basic, proficient, and distinguished levels. An
administrator, a peer, and the teacher her/himself rate on the applicable domains based on classroom
observations, discussions, and review of artifacts. The average for the two semesters was used as the
evaluation score for each domain. Since not all the domains apply to all teachers, our analyses have
concentrated on the five that are applied to all: lesson planning, classroom management, literacy,
language development, and mathematics. For the main analyses, we assessed the relationship of student
reading achievement to the literacy teacher evaluation scores, of math achievement to the math evaluation
scores, and language arts achievement to language development evaluation scores. This was done
because it was expected that a subject-specific evaluation would be more strongly related to student
achievement than scores from other subject-specific evaluations of the more generic planning and
classroom management scores, and to replicate Gallagher’s (2004) analysis.
In Washoe, the four Framework domains of Planning and Preparation, Classroom Environment,
Instruction, and Professional Responsibilities are the basis for the system. Each domain contains multiple
elements which are rated by principals or assistant principals, using rubrics closely based on those in
Danielson’s 1996 book. Evidence may include a teacher self-assessment, a pre-observation data sheet
(lesson plan), classroom and non-classroom observations with pre- and post-observation conferences,
instructional artifacts (e.g., assignments and student work), a reflection form, a three-week unit plan, and
12
logs of professional activities and parent contacts. The system provides for three types of evaluation:
probationary, post-probationary major, and post-probationary minor. Teachers new to the district are
considered probationary and are evaluated on all four of the performance domains, where they must meet
at least level 1 (target for growth) scores on all 68 elements. Probationary teachers are observed at least
nine times over three periods of the year. Teachers in post-probationary status undergo a ‘major
evaluation’ on two performance domains. They are formally observed three times over the course of the
year. In the next two years, they receive “minor” evaluations, focusing on one domain and involving at
least one observation during the year. Over the course of the three year major-minor cycle, teachers are
evaluated on all four domains, but most are not evaluated on al domains each year. However, if a teacher
is not evaluated on the Instruction Domain, they are evaluated using a supplemental evaluation form with
four dimensions consisting of selected components and elements from the Planning and Preparation and
Instruction domains. Evaluators rate these dimensions using four performance designations (i.e.,
unsatisfactory, target for growth, proficient, and area of strength). Because these dimension ratings were
available for almost all teachers, they were used as the measure of teacher performance in our analyses.
The scores on the four performance dimensions were averaged to derive a single indicator of teacher
performance. The average correlation among the dimensions was .72, and coefficient alpha was .91.
Teacher Experience
At each site, information on teacher experience was also collected. In Cincinnati and Washoe,
teachers’ position on the salary schedule (their ‘step’) was used as an indicator of experience. This
measure was chosen because a count of years employed by the district would not include teaching
experience in other districts. Because both sites gave teachers credit on the pay schedule for prior
experience, the step is a better indicator of total experience for many teacher than years of service with
the district. However, it should be noted that teachers ‘top out’ at the highest step and therefore the step
no longer represents their relative experience. This was less of a problem in Cincinnati than in Washoe,
because relatively few highly senior teachers in Cincinnati were evaluated during 2002-03, and the
13
Cincinnati schedule had more steps. At Vaughn, the only measure of experience available was the
teacher’s years of service with the school. Since some teachers were hired with prior experience, the years
of school service measure is not always an accurate reflection of total professional experience.
Student Achievement
For each site, student achievement was represented by test scores on standardized, largely closed
response instruments. In Cincinnati, we used scores on state criterion-referenced tests given in grades 4
and 6, and district ‘off year’ tests developed by test publishers or the TerraNova, administered in grades
2, 3, 5, 6, and 7. For Vaughn, the SAT-9 was used for all grades and both years. In Washoe, scores on
state criterion-referenced tests from grades 3 and 5, district-developed criterion-referenced tests for grade
6, and the TerraNova test given in grade 4, were used. The tests used for each site and grade are listed in
Appendix Table 1. These data were provided by the districts or the test publishers.
Other Student Characteristics Data on student characteristics, typically gender, ethnicity, English proficiency, special education
status, and free and reduced lunch were provided to us by the sites.
Results
Replication of Previous Analyses
This section summarizes the results of the analyses of the teacher evaluation score – student
achievement relationship we have done on a second year of data from our Cincinnati, Vaughn, and
Washoe sites. Table 2 reports the average correlations within each site between the total teacher
evaluation scores and the empirical Bayes intercept residuals from the models in which controls for
student characteristics were included. These results are the best comparisons to those from our first round
of analyses, as shown in Table 1 above.
Table 2 Correlations Between Teacher Evaluation Scores and Estimates of Average Student Achievement Based on Empirical Bayes Intercept Residuals from Models with Controls for Student Characteristics
14
Tested Subject Site Reading Math Other Cincinnati .28 .34 -.02* (Science) Vaughn .61 .45 .38 (Language Arts) Washoe .25 .24 - * Confidence Interval Includes 0 More detailed results, including results by grade, are reported in Appendix Table 2.
Except for science in Cincinnati, the correlations are all positive and of roughly similar
magnitude to those from our first round of analysis (see Table 1 above). Notable differences include a
much lower correlation for science in Cincinnati, and higher correlations for mathematics and language
arts for Vaughn.
In order to provide an idea of the potential importance of having a more highly-rated teacher, we
also calculated the average estimated change in student achievement associated with a one level change in
teacher evaluation score. Specifically, we calculated the number of standard deviations in test score that
were associated with a change in teacher ratings of one overall level (i.e. from Basic to Proficient or
Proficient to Distinguished on all domains). Table 3 presents these results. This provides a type of effect
size measure that allows comparisons across grades, subjects, and sites.
Table 3 Effect of a One Level Change in Teacher Evaluation Score on Student Achievement, (in Standard Deviation Units) Tested Subject Site Reading Math Other Cincinnati .14 .18 -.01 (Science) Vaughn .25 .37 .21 (Language Arts) Washoe .14 .19 - These effects, though small to moderate in size, could add up to a substantial advantage for a student with
two or three consecutive teachers rated at the ‘distinguished’ rather than ‘proficient’ level, or the
‘proficient’ rather than ‘basic’ level. As we found last year, there are considerable differences across
grades and subjects within each site. (See the Appendices for these details.).
As mentioned in the Methods section, we also experimented with models that allowed for a
random slope across classrooms for the effect of prior year score. Unlike our first round of analyses, this
15
year there was consistent evidence for significant slope variation at both Cincinnati and Vaughn.
Generally, however, correlations between EB intercept residuals from these models and teacher
evaluation scores are similar to those based on the random intercept models, and teacher evaluation scores
have little consistent correlation with EB slope residuals. It remains to be explored what variables might
be associated with slope variations across classrooms at these sites. The companion paper by Borman and
Kimball explores the issue of random slopes in Washoe.
Results from Models without Controls for Student Demographic Characteristics
To assess whether controlling for student characteristics affects the relationship between evaluation scores
and student achievement, an analysis was also conducted which did not include the gender, ethnic, special
education, and free/reduced lunch controls at level 1. Table 4 reports the average correlations within each
site between the empirical Bayes intercept residuals from the models that did not control for these student
demographic characteristics, and the total teacher evaluation scores.
Table 4 Correlations Between Teacher Evaluation Scores and Estimates of Average Student Achievement Based on Empirical Bayes Intercept Residuals from Models without Controls for Student Characteristics Tested Subject Site Reading Math Other Cincinnati .27 .31 .02* (Science) Vaughn .60 .43 .32 (Language Arts) Washoe .26 .25 - * Confidence Interval includes 0
Again, we also calculated the average estimated change in student achievement (number of
standard deviations) associated with a one level change in teacher evaluation score. Table 5 reports these
estimates.
Table 5 Effect of a One Level Change in Teacher Evaluation Score on Student Achievement (in Standard Deviation Units) Tested Subject Site Reading Math Other
16
Cincinnati .16 .20 .02 (Science) Vaughn .25 .37 .21 (Language Arts) Washoe .16 .21 - Comparing the results reported in Tables 5 and 6 with those reported for the models with student level
demographic controls shows that there is little difference between the correlations or achievement effects
estimated with and without these controls. It is likely that most of the effects of factors such as socio-
economic status are highly correlated with prior year test scores, so that controlling for these scores
eliminate much of the effect of the demographic characteristics on current year test scores.
Relationship Between Evaluation Scores, Teacher Experience, and Student Achievement. Our first analysis relevant to the question of the relationship of evaluation scores to teacher
experience was simply to correlate the scores with our (admittedly imperfect) measure of teacher
experience. Table 6 reports the correlation coefficients.
Table 6 Correlation of Teacher Evaluation Scores with Teacher Experience Site Correlation Cincinnati .11 Vaughn (average of 5 domains) .33 Washoe .39 Note that in Cincinnati and Vaughn most of the teachers who were evaluated were also low in experience.
In Cincinnati, this is by design; most senior teachers were exempted from comprehensive evaluation after
the 2000-01 school year. At Vaughn, relatively high turnover of senior teachers a few years ago has
required hiring many less experienced teachers. At all sites, there are relatively few experienced teachers
with low scores, which would be expected if low performers tend to leave the organization, or were
terminated in the first few years. In general, plots of evaluation scores versus experience show a pattern in
which evaluation scores increase in the first three to five years then level off. This is what one would
17
expect if new teachers need a few years of experience to develop proficient performance and if those that
do not do so leave.
We also examined the correlation between experience and student achievement, as measured by
the EB residuals from the random intercept models with demographic controls. Table 7 shows the
correlations.
Table 7 Correlations Between Teacher Experience and Estimates of Average Student Achievement Based on Empirical Bayes Intercept Residuals from Models with Controls for Student Characteristics Tested Subject Site Reading Math Other Cincinnati -.19 -.20 -.02 (Science) Vaughn .20 .32 .23 (Language Arts) Washoe .15 .16 - These correlations are lower than the evaluation score – EB residual correlations reported in Table 2. This
suggests that experience, at least as measured here, is not as good a predictor of student performance as
the teacher evaluation scores. Surprisingly, for most grades, the correlations were negative in Cincinnati,
and the combined correlations are also negative. A possible explanation for this result is greater leniency
in the evaluation of highly-senior teachers. Because most senior teachers were exempt from
comprehensive evaluation, those who were evaluated had either been identified as having performance
problems by school administrators, or were seeking status as lead teachers. Both situations required a
comprehensive evaluation. In either case, evaluators may have been more lenient with these teachers due
to the stakes attached.
We also tried to assess the impact of teacher experience by including it along with the teacher
evaluation score as a predictor of classroom intercepts at level 2 of our models. We then calculated the
change in student achievement associated with a one level increase in teacher evaluation score,
controlling for teacher experience. Table 8 below reports the number of standard deviations in test scores
that are associated with a change in one level on all the domains (Cincinnati and Washoe) or the subject
18
specific domain (Vaughn) of the evaluation systems. These are compared with the effects estimated for
evaluation score alone, as reported in Table 3 above.
Table 8 Effect on Student Achievement of 1 Level Increase in Teacher Evaluation Score, with and without Controlling for Teacher Experience (Weighted Average across Grades) Cincinnati Vaughn Washoe TES
Alone Controlling for
Experience TES
Alone Controlling for
Experience TES
Alone Controlling for
Experience Reading .14 .16 .23 .23 .14 .12 Math .18 .22 .22 .16 .19 .17 Science -.01 -.00 - - - - Language Arts - - .21 .18 - -
As the table shows, the differences in effect are small. The estimated effect of teacher evaluation score
declines in Washoe and at Vaughn, but increases in Cincinnati, as might be expected given the positive
correlations between student achievement and experience at the former two sites and the possible leniency
toward senior teachers in Cincinnati. These results suggest that teacher evaluation scores are measuring
something besides teacher experience, and that scores from standards-based teacher evaluation systems
can explain variance in teachers’ average student achievement above and beyond that explained by
teacher experience.
Discussion The results reported above show that the scores produced by these standards-based teacher
evaluation systems have a substantial positive relationship with the achievement of the evaluated
teachers’ students. The results are comparable to those obtained from similar analyses we presented at
last year’s AERA meeting. Though there were some notable differences (i.e. higher correlations between
evaluation scores and student achievement in mathematics and language arts at Vaughn, and a near zero
correlation for science achievement in Cincinnati) we consider the results to represent a constructive
replication of our earlier results. So with respect to our first research question, there is the expected
19
relationship between student achievement and teacher evaluation scores, and this relationship is largely
similar to that found in our previous study. Together, these studies suggest that evaluation scores from
well-designed and implemented standards-based teacher evaluation systems can be used for decisions
about teachers, and have potential for use in studies of teacher effects on student achievement.
It should be noted, however, that differences in the strength of the relationship differ again this
year across subjects, and across grades within the Cincinnati and Washoe sites. For Cincinnati especially,
the variation in the strength of the evaluation score-student achievement relationships across subjects and
grades is considerable. Though sampling and measurement error are undoubtedly responsible for some of
this, the relationship for science is again the weakest. Last year, the correlations for fourth grade reading
were also small. In Washoe, no similar pattern has yet emerged.
Our second research question was whether not controlling for student characteristics associated
with test scores such as ethnicity, special education status, and English proficiency changes the
relationship between teacher evaluation scores and student achievement. Our results showed only small
differences in the evaluation score – student achievement relationships. While the correlations and effect
estimates were typically slightly lower without these controls, in a few cases they were slightly higher.
Our findings last year were largely similar. At least at these sites, it does not appear to make a big
difference overall whether these student characteristics are included as controls or not, probably because
most of their effects are already included in the prior year test scores used as controls. This is not to say
that models intended to estimate teacher effects on student achievement shouldn’t include student
characteristics outside teachers’ control. But it does suggest that the relationship between teacher
evaluation scores and student achievement does not depend on taking into account these student
characteristics. So this relationship does not need to be discounted by those who object, for substantive or
symbolic reasons, to taking student background characteristics into account in evaluating teacher
performance. A related question, however, deserves more analysis: do teachers whose students are
primarily low income, non-white, or limited in English proficiency have less opportunity to teach in ways
that allow them to earn high evaluation scores? Those who have studied disproportionate certification
20
rates for non-white teachers by the National Board for Professional Teaching Standards have raised this
possibility with respect to the National Board Standards (Bond, 1998). We intend to investigate this issue
by analyzing teachers’ evaluation scores as a function of classroom and school demographic composition
and prior achievement. This will give an indication of whether evaluation scores are lower for teachers of
low income, non-white, etc. students.
Our third research question involved the relationship between teacher evaluation scores and
teacher experience, and whether controlling for experience changes the relationship between evaluation
scores and student achievement. With respect to the first issue, there was, as might be expected, a
relationship between rated performance and experience, but the relationship weakens as experience
increases past 5 or so years. That beginning teachers are on the average weaker performers should be
expected, both because experience is needed to apply the lessons of teacher preparation, and because
poorer performers are more likely to leave over time. One can interpret this finding to support the
construct validity of the teacher evaluation scores as measures of teacher performance because they have
the expected relationship with experience: stronger at lower and weaker at higher experience levels.
With respect to the second part of the question, controlling for teacher experience reduced the
strength of the evaluation score – student achievement relationship slightly at Vaughn and in Washoe, but
strengthened it slightly in Cincinnati. This is evidence that the teacher evaluation scores are measuring
something different than experience. They add to the explanation of student achievement over and above
teacher experience. In our analyses, teacher position on the step part of the salary schedule had a weaker
relationship to student achievement than evaluation scores. This suggests that pay for performance could
be a more appropriate basis rewarding teachers for their contribution to student achievement than systems
based on seniority, as argued by Odden and Kelley (2002).
Limitations and Directions for Future Research
As in our previous research, a substantial number of students in Cincinnati and Washoe could not
be included in the analyses because of missing test information (especially prior year test scores),
21
movement between schools during the year, and difficulties linking students to teachers. While we doubt
that the overall results would have changed very much, it would be reassuring to have been able to
include a higher proportion of the enrolled students. Future work will explore the use of propensity scores
and missing data imputation to estimate the possible bias in our current results from excluding students
with missing data.
Another limitation is that all of our analyses have involved teachers and students at the
elementary and (in Cincinnati) middle school levels. This limitation is due to the lack of testing in all
grades at the high school level in Cincinnati and Washoe, and difficulties in linking teachers with students
for specific subjects after sixth grade in Washoe. Thus we do not know if scores from these evaluation
systems for high school teachers also predict average student achievement. This is an important issue
because in our interviews with teachers, some at the high school level told us that the evaluation standards
were oriented to the elementary and middle school levels.
As mentioned briefly above, we also experimented with models that included random slopes for
the effect of prior year student achievement on current year achievement. Unlike the results we reported
last year, this year there appeared to be more consistent evidence for differences in slopes across
classrooms. Though our preliminary results (not reported here in due to time and space limitations)
suggest that including random slopes does not alter the correlations between teacher evaluation scores and
EB intercept residuals (representing average classroom student achievement) very much, it does
complicate the question of the relationship between teacher evaluation scores and student achievement. If
all classrooms have similar relationships between current and prior year test scores, the correlation
between EB intercept residuals and evaluation scores tells the essential story about how the evaluation
scores relate to student achievement (controlling for prior achievement). But if the relationship between
current and prior student achievement varies with the level of prior achievement, it is possible that
teachers are differentially effective for those with low versus high prior achievement. The average level
of classroom achievement may no longer be a sufficient criterion for assessing the validity of the teacher
evaluation scores. If one teacher’s classroom has high average achievement (after controlling for prior
22
achievement) but students with lower prior achievement do worse than similar students of a teacher
whose classroom has lower average achievement, it is simplistic to conclude that the former is the ‘better’
teacher. Random slopes for ethnicity and free/reduced price lunch status raise a similar concern.
Preliminary analyses of the relationship between classroom slopes and teacher evaluation scores have not
shown a consistent relationship. The issue is addressed further using data from Washoe in the companion
paper by Borman and Kimball. We also hope to do more work on this issue using data from Cincinnati
and Vaughn.
Lastly, it would be useful to replicate these results again. We hope to collect one more year of
evaluation and student achievement data for Cincinnati and Washoe, and two more for Vaughn. Since
conditions at each of the sites are constantly changing, the persistence of the evaluation scores – student
achievement relationship cannot be taken for granted. Not only are minor modifications made in the
evaluation systems, but new evaluators are introduced and the population of teachers evaluated changes.
Since evaluators are an important part of these relatively high inference performance assessment systems,
including more evaluators by analyzing more years of data helps to provide a more generalizable estimate
of the evaluation score-student achievement relationship. We are also studying the ‘validity’ of specific
evaluators’ scores, and process by which evaluators make rating decisions. We hope this work will
contribute to a better understanding of standards-based teacher evaluation as it plays out in the field.
23
References Bond, L. (1998). Disparate impact and teacher certification. Journal of Personnel Evaluation in Education, 12:2, 211-220. Danielson, C. (1996). Enhancing Professional Practice: A Framework for Teaching. Alexandria, VA: Association for Supervision and Curriculum Development. Danielson, C., and McGreal, T.L.(2000). Teacher Evaluation to Enhance Professional Practice. Alexandria, VA: Association for Supervision and Curriculum Development. Gallagher, H.A. (2004). Vaughn Elementary’s Innovative Teacher Evaluation System: Are Teacher Evaluation Scores Related to Growth in Student Achievement? To be published in the Peabody Journal of Education, Spring, 2004. Kimball, S. M. (2002). Analysis of Feedback, Enabling Conditions and Fairness Perceptions of Teachers in Three School Districts with New Standards-Based Evaluation Systems. Journal of Personnel Evaluation in Education, 16(4), 241-268. Kimball, S.M., White, B., Milanowski, A.T., and Borman, G (2004). Examining the Relationship between Teacher Evaluation and Student Assessment Results in Washoe County. To be published in the Peabody Journal of Education, Spring, 2004. Kimball, S., Milanowski. A.T., and Heneman. H.G. III. (2003, November). Research Results and Formative Recommendations from the Study of the Washoe County Teacher Performance Evaluation System. Paper presented at the American Evaluation Association Seventeenth Annual Meeting, Sparks, Nevada.
King Rice, J. (2003) Teacher Quality: Understanding the Effectiveness of Teacher Attributes. Washington, DC: Economic Policy Institute. Milanowski, A.T. (forthcoming). The relationship between teacher performance evaluation scores and student achievement: Evidence from Cincinnati. To be published in the Peabody Journal of Education, Spring, 2004. Milanowski, A.T., and Heneman, H. G. III. (2001). Assessment of teacher reactions to a standards-based teacher evaluation system: A pilot study. Journal of Personnel Evaluation in Education, 15:3, 193-212. Milanowski, A.T., and Kimball, S. (2003, April). The Framework-Based Teacher Performance Assessment Systems in Cincinnati and Washoe. Paper presented at the Annual Meeting of the American Educational Research Association, Chicago, IL. Moore Johnson, S. (1990). Teachers at Work: Achieving Success in Our Schools. NY: Basic Books. Odden, A., Borman, G., and Fermanich, M. (2004). Assessing Teacher, Classroom, and School Effects, Including Fiscal Effects. To be published in the Peabody Journal of Education, Spring, 2004.
24
Odden, A. and Kelley, C. (2002). Paying Teachers for What They Know and Do: New and Smarter Compensation Strategies to Improve Schools. (2nd. Ed.). Thousand Oaks, CA: Corwin Press. Peterson, K. D. (2000). Teacher Evaluation: A Comprehensive Guide to New Directions and Practice. 2nd Ed. Thousand Oaks, CA: Corwin Press. Hunter, J.E., and Schmidt, F.L. (1990). Methods of Meta-analysis: Correcting Error and Bias in Research Findings. Newbury Park, CA: Sage. Stiggens, R.J., and Duke, D. (1988). The Case for Commitment to Teacher Growth: Research on Teacher Evaluation. Albany, NY: State University of New York Press. Porter, A. C., Youngs, P., and Odden, A. (2001). Advances in teacher assessments and their uses. In V. Richardson (Ed.), Handbook of Research on Teaching, Fourth Edition. Washington, DC: American Educational Research Association, 259-297.
25
Appendix 1 – Cincinnati Details Model Used to Generate Empirical Bayes Intercept Residuals Used in Analyses At level 1, the test score in the year evaluated was modeled as a function of the same subject test score from the prior year, the number of days the student was enrolled at the school where testing took place, dummy variables for gender, limited English proficiency, receipt of free/reduced price lunch, special education status, and African-American ethnicity. At level two, a random intercept was specified, but no predictors were included. Al slopes were considered fixed. Measures of Student Achievement The measures of student achievement used are shown in Table 1. Because no science test was given in third grade for 2001-02, the both prior year reading and mathematics tests given in grade three were used as a control for prior student achievement for the science fourth grade analysis. Table 1 Student Achievement Measures by Grade Grade 2002-03 Test 2001-02 Test (Control for Prior Achievement)
3 District Test District Test (reading and math for science) 4 State Proficiency Test District Test 5 TerraNova State Proficiency Test 6 State Proficiency Test District Test 7 District Test State Proficiency Test 8 TerraNova District Test
Number of Students In Analyses As was the case in last year’s analyses, a considerable number of students on the district roster in the Spring of 2003 could not be included due to missing data. Most students were lost form the analyses because there was no test data from the prior school year. A considerable number was lost because the they changed schools mid-year. As was done last year, students not enrolled in the school were they were tested for at least 71 days were dropped form the analyses. A small number of students were dropped because they could not be matched to a teacher for the subjects tested. A few students (no more than 10-15 per grade/subject combination) were excluded because their scores on the current year test or the prior year test were outliers. Table 2 below shows the number of students in the district’s schools by grade and subject, the number tested in 2002-03, and the number used in the analyses.
26
Table 2 Number of Students on Roster, Tested, and Whose Scores Were Used in Analyses - Cincinnati 3rd 4th 5th 6th 7th 8th Total Number of Students on CPS Roster, Spring 2003
3,556
3,353
3,215
3,337
3,449
3,054
Number Tested in 02-03 Reading 3,374 3,270 3,171 3,197 2,953 2,629 Math - 3,268 3,164 3,200 2,929 2,611 Science - 3,265 3,150 3,191 2,919 2,601 Number Used in HLM Models Reading 2,827 2,506 2,364 2,373 2,129 1,975 Math - 2,410 2,386 2,410 2,102 1,951 Science - 2,394 2,400 2,397 2,044 1,994 Analyses Results by Grade and Subject Table 3 shows the correlations between the EB intercept residuals and the total teacher evaluation score by grade and subject, the combined correlations across grades by subject, the standard errors of the combined correlations, and the confidence intervals. These correlations were calculated using the EB intercept residuals of those teachers with three or more students tested. As mentioned in the main body of the paper, there is considerable variation in the size and even sign of the correlations. No results for grade 3 in science and math are shown because no tests in those subjects were given in grade 2 in the prior year. Table 3 Correlations Between Empirical Bayes Intercept Residuals and Total TES Score, by Grade and Subject (Number of teachers in parentheses) Grade Reading Math Science 3
.35 (35)
-
-
4 .07 (24) .32 (19) .12 (20) 5 .23 (24) .09 (22) .13 (20) 6 .35 (24) .48 (22) -.15 (20) 7 .42 (15) .25 (14) -.29 (11) 8 .27 (12) .74 (8) -.32 (5) Combined .28 .34 -.02 Standard Error .11 .12 .12 95% Confidence Interval .11 - .44 .12 -.53 -.09 -.46
27
Table 4 Effect of a One Level Change in Teacher Evaluation Scorea on Student Achievement, by Grade and Subject, in Standard Deviation Units Grade Reading Math Science 3
.24
-
-
4 .04 .37 .09 5 .12 .06 .09 6 .14 .20 -.05 7 .12 .07 -.24 8 .08 .21 -.13 Weighted Avg: .14 .18 -.01 (a) For example, from ‘proficient’ to ‘distinguished’ on all four domains Model Parameter Estimates A complete set of results from the HLM analyses of the data, including model coefficients and standard errors, is available from the authors on request.
28
Appendix 2 – Vaughn Details Site-specific Two-Level Models At level 1, the basic models used in the analyses of the data from this site had the reading, mathematics or language arts test score of the student in the year the teacher was evaluated on the left hand side, and the corresponding prior year test score, and dummy variables for and grade level (with grade 2 as the left-out category). Models including student demographic characteristics added days of attendance and dummy variables for gender, special education status, retention in grade from the prior year, beginning English proficiency status, fully proficient status (so the left-out category for proficiency was partial proficiency), and whether the parent had finished high school. Following initial analysis of Vaughn data done by Gallagher (2004), missing values for the attendance and parental education variables were imputed based on school means. Dummy variables were added to indicate imputation, though none of these were had coefficients that were statistically significant at conventional levels. Variables for free/reduced price lunch and ethnicity were not included since all students were eligible for the former and the vast majority of students are of Hispanic ethnicity. At level 2, all models specified a random intercept and fixed slopes for all level 1 predictors. The models used to obtain the empirical Bayes residuals used to represent classroom average student achievement had no predictors at level 2. Other models included evaluation scores and years of tenure at Vaughn as predictors of the random intercepts. Student Achievement At Vaughn, the Stanford 9 test was the measure of student achievement used grades in both years. Because of this uniformity, and because this test is supposed to be scaled so that one year’s test scores are comparable with , we combined grades in the Vaughn models and include grade-level dummy variables, rather than estimating separate models by grades. Number of Students in the Analyses At Vaughn, there was proportionately less loss of students due to missing data than at the other sites, though students were not included in the analyses due especially to missing prior year test scores. Table 1 shows the number of students on the roster in grades 2-5, and those tested and used in the analyses, by subject. Table 1 Number of Students on Roster, Tested, and Whose Scores Were Used in Analyses – Vaughn Reading Math Language Arts Total Number of Students on Roster, Spring 2003
580
580
580
Number Tested in 02-03 561 570 578 Number Used in HLM Models 537 549 541
29
Parameter Estimates for Random Intercept Model Including Student Demographic Characteristics and Subject Specific Domain Evaluation Scores as Level 2 Predictors of Random Intercepts Table 2 Model Coefficients and Standard Errors – Random Intercept with Controls for Student Characteristics Reading Math Science Level 1 Variables Coefficient Std.
Error Coefficient Std.
Error Coefficient Std.
Error Intercept 568.13 - 564.69 582.65 Reading Prior Year 0.71 0.3 0.68 0.03 0.56 0.03 Special Ed Status (=1) -2.49 5.46 1.70 6.39 -0.11 5.47 Gender (F=1) 2.46 1.60 0.39 1.76 0.77 1.88 Retained (=1) -2.09 5.62 2.42 6.01 6.89 6.13 Grade 3 (=1) -3.57 2.66 -19.12 5.88 -6.02 4.35 Grade 4 (=1) 6.52 2.90 -7.04 6.10 15.68 4.43 Grade 5 (=1) -3.61 3.31 -11.20 6.51 4.91 4.72 Beginning English (=1) -7.25 2.54 -4.91 3.02 -11.08 3.14 Proficient English (=1) -0.11 3.42 1.86 3.87 3.84 3.84 Attendance (Days) -0.01 0.12 0.04 0.13 0.27 0.14 Father HS Grad (=1) 0.60 1.74 -2.52 1.96 -1.28 2.09 Imputed Attendance -4.44 4.33 -3.74 5.13 -3.25 5.33 Imputed Father Educ. 2.18 3.58 7.66 5.42 12.99 5.03 Level 2 Subject-Specific Evaluation Score
10.23
2.22
14.10
2.93
7.78
3.28
Note: All Level 1 variables were grand mean centered. A complete set of results from the other models, including coefficients and standard errors, is available from the authors on request.
30
Appendix 3 – Washoe Details Washoe Two-Level Model At level 1 the models included the 2002-03 test results regressed on the prior year test results in the same subject, and dummy variables representing student gender, non-white ethnicity, special education status, and qualification for free and reduced price-lunches. At level 2, all models specified a random intercept and fixed slopes for each level 1 predictor. The models for obtaining the empirical Bayes residuals used to represent classroom average student achievement had no predictors at level 2. To predict the average intercept in other level 2 models, variables were added for the teacher evaluation composite measure and teaching experience, which is the step designation on the district salary schedule. Measures of Student Achievement The student achievement measures used in the analysis are shown in Table 1 below. The tests included criterion referenced tests developed by the district for grades 4 and 6, state criterion referenced tests developed by Harcourt Brace for grades 3 and 5, and the TerraNova norm-referenced exam (Comprehensive Test of Basic Skills) administered in grade 4. The exam sequence is intended to assess student performance relative to state and district academic standards. Results from these assessments for mathematics and reading were used in the analyses Table 1 Student Achievement Measures by Grade Grade 2002-03 Test 2001-02 Test
4 District Criterion Referenced Test State Proficiency Test 5 State Proficiency Test TerraNova 6 District Criterion Referenced Test State Proficiency Test
Number of Students in Analyses A large number of students who were tested on the outcome measures could not be included in the study. Most of the missing students were lost because their teachers were not assessed on the applied evaluation measure used in the analyses. Other students were lost because they did not have pre-test results or because their teacher could not be identified from the district student database made available at the time of this study. Table 2 shows the number of students tested and the number included in the models at each grade level. Table 2 Number of Students on Roster, Tested, and Whose Scores Were Used in Analyses - Washoe
31
4th 5th 6th Total Number of Students on Roster, Spring 2003
4,764 4,910 4,914
Number Tested in 02-03 Reading 4,500 4,474 4,520 Math 4,539 4,484 4,533 Number Used in HLM Models* Reading 2527 2176 2632 Math 2527 2176 2632 * Listwise exclusion was used in creating the models. Table 3 Correlations Between Empirical Bayes Intercept Residuals and Total TES Score, by Grade and Subject (Number of teachers in parentheses) Grade Reading Math 4
.29 (131)
.22 (131)
5 .09 (135) .32 (135) 6 .27 (131) .17 (131) Combined .25 .24 Standard Error .05 .05 95% Confidence Interval .15 - .34 .14 -.33 Table 4 Effect of a One Level Change in Teacher Evaluation Scorea on Student Achievement, by Grade and Subject Grade Reading Math 4
.16
.17
5 .11 .24 6 .22 .21 Weighted Average: .16 .21 (a) For example, from ‘proficient’ to ‘distinguished’ on all four performance components Model Parameter Estimates
32
Table 5 provides results from the full random intercept model including student pretest and demographic characteristics at Level 1 and the teacher evaluation score composite and experience measure as Level 2 predictors of the random intercepts or average classroom achievement. Table 5 Grade 4-6 Model Coefficients and Standard Errors – Random Intercept with Controls for Student Characteristics Grade Four Grade Five Grade Six Reading Math Reading Math Reading Math Level 1 Variables Intercept 25.23 (.87) 27.36 (1.03) 292.30 (5.88) 275.70 (9.49) 23.81 (.93) 21.33 (2.0) Pretest .06 (.00) .05 (.00) 1.13 (.03) 1.28 (.03) .05 (.00) .05 (.00) Free and reduced lunch status (=1)
-.43 (.23) -.40 (.27) -6.69 (1.86) -1.43 (2.29) -1.03 (.23) -.46 (.24)
Special education status (=1)
-3.21 (.39) -2.77 (.36) -11.10 (4.00) -14.18 (3.77) -3.16 (.45) -3.86 (.42)
Gender (F=1) .15 (.18) -.26 (.18) 4.02 (1.61) -.70 (1.62) .13 (.16) -.40 (.19) Ethnicity (non-white=1) -1.21 (.21) -.90 (.22) -3.35 (1.92) -7.52 (1.86) -1.47 (.23) -.82 (.19) Level 2 Teacher Evaluation Composite
1.08 (.35) .94 (.45) 5.16 (2.53) 13.68 (3.96) .86 (.37) .97 (.82)
Teacher Experience .01 (.02) .03 (.03) .07 (.21) .04 (.29) .03 (.02) .06 (.05) Note: All Level 1 variables were grand mean centered. A complete set of results from the other models, including coefficients and standard errors, is available from the authors on request.