Transcript
Page 1: The impact of teacher subject knowledge on student achievement: Evidence from within-teacher within-student variation

Journal of Development Economics 99 (2012) 486–496

Contents lists available at SciVerse ScienceDirect

Journal of Development Economics

j ourna l homepage: www.e lsev ie r .com/ locate /devec

The impact of teacher subject knowledge on student achievement: Evidence fromwithin-teacher within-student variation☆

Johannes Metzler a, Ludger Woessmann b,⁎a University of Munich and Monitor Group, Germanyb University of Munich and Ifo Institute for Economic Research, CESifo and IZA, Poschingerstr. 5, 81679 Munich, Germany

☆ We thank Sandy Black, Jere Behrman, David Card, PaulKline,Michael Kremer, Javier Luque, Lant Pritchett, MartinWtwo anonymous referees, and seminar participants at UC BUniversity of Munich, the Ifo Institute, the Glasgow congrAssociation, the CESifo Area Conference on Economicsence on Cognitive and Non-Cognitive Skills, and the FranEconomicAssociation for helpful discussion and commentsBurga León of theMinisterio de Educación del Perú for provfor the teacher tests.Woessmann is thankful for the supporCampbell and Rita Ricardo-Campbell National Fellowship oUniversity. Support by the Pact for Research and Innovatiothe Regional Bureau for Latin America and the Caribbeanment Programme are also gratefully acknowledged.⁎ Corresponding author.

E-mail addresses: [email protected] (J. Metzler)(L. Woessmann).

URL: http://www.cesifo.de/woessmann (L. Woessm

0304-3878/$ – see front matter © 2012 Elsevier B.V. Aldoi:10.1016/j.jdeveco.2012.06.002

a b s t r a c t

a r t i c l e i n f o

Article history:Received 18 October 2011Received in revised form 8 June 2012Accepted 9 June 2012

JEL classification:I20O15

Keywords:Teacher knowledgeStudent achievementPeru

Teachers differ greatly in howmuch they teach their students, but little is known about which teacher attributesaccount for this. We estimate the causal effect of teacher subject knowledge on student achievement usingwithin-teacher within-student variation, exploiting a unique Peruvian 6th-grade dataset that tested bothstudents and their teachers in two subjects. Observing teachers teaching both subjects in one-classroom-per-grade schools, we circumvent omitted-variable and selection biases using a correlated random effectsmodel that identifies from differences between the two subjects. After measurement-error correction,one standard deviation in subject-specific teacher achievement increases student achievement by about9% of a standard deviation in math. Effects in reading are significantly smaller and mostly not significantlydifferent from zero. Effects also depend on the teacher-student match in ability and gender.

© 2012 Elsevier B.V. All rights reserved.

1. Introduction

One of the biggest puzzles in educational production today is theteacher quality puzzle: while there is clear evidence that teacherquality is a key determinant of student learning, little is knownabout which specific observable characteristics of teachers can ac-count for this impact (e.g., Hanushek and Rivkin, 2010; Rivkin et al.,2005; Rockoff, 2004). In particular, there is little evidence that thosecharacteristics most often used in hiring and salary decisions, namelyteachers' education and experience, are crucial for teacher quality.Virtually the only attribute that has been shown to be more frequently

Glewwe, Eric Hanushek, Patrickest, the editor Duncan Thomas,

erkeley, Harvard University, theess of the European Economicof Education, the IZA confer-kfurt conference of the German. Special gratitude goes to Andrésiding uswith reliability statisticst and hospitality by theW.Glennf theHoover Institution, Stanfordn of the Leibniz Association andof the United Nations Develop-

, [email protected]

ann).

l rights reserved.

significantly correlated with student achievement is teachers' academicskills measured by scores on achievement tests (cf. Eide et al., 2004;Hanushek and Rivkin, 2006; Wayne and Youngs, 2003). The problemwith the latter evidence, however, is that issues of omitted variablesand non-random selection are very hard to address when estimatingcausal effects of teacher characteristics.1 In this paper, we estimate theimpact of teachers' academic skills on student achievement, circum-venting problems from unobserved teacher traits and non-randomsorting of students to teachers.

We use a unique primary-school dataset from Peru that containstest scores in two academic subjects not only for each student, butalso for each teacher. This allows us to identify the impact of teachers'academic performance in a specific subject on students' academicperformance in the subject, while at the same time holding constantany student characteristics and any teacher characteristics that donot differ across subjects. We can observe whether the same studenttaught by the same teacher in two different academic subjects per-forms better in one of the subjects if the teacher's knowledge is rela-tively better in this subject. We develop a correlated random effectsmodel that generalizes conventional fixed-effects models with fixed ef-fects for students, teachers, and subjects. Ourmodel identifies the effectbased on within-teacher within-student variation and allows us to testthe overidentification restrictions implicit in the fixed-effects models.

1 Recently, Lavy (2011) estimates the effect of teaching practices on studentachievement.

Page 2: The impact of teacher subject knowledge on student achievement: Evidence from within-teacher within-student variation

487J. Metzler, L. Woessmann / Journal of Development Economics 99 (2012) 486–496

We can additionally restrict our analysis to small, mostly rural schoolswith atmost one teacher per grade, excluding any remaining possibilitythat parents may have chosen a specific teacher for their students inboth subjects based on the teacher's specific knowledge in one subject,or of bias from non-random classroom assignments more generally (cf.Rothstein, 2010).

We find that teacher subject knowledge exerts a statistically andquantitatively significant impact on student achievement. Oncecorrected for measurement error in teacher test scores, our resultssuggest that a one standard deviation increase in teacher test scoresraises student test scores by about 9% of a standard deviation inmath. Results in reading are significantly smaller and indistinguish-able from zero in most specifications. We also find evidence that theeffect depends on the ability and gender match between teachersand students, in that above-median students are not affected by var-iation in teacher subject knowledge when taught by below-medianteachers and that female students are not affected by variation inteacher subject knowledge when taught by male teachers. The aver-age math result means that if a student was moved, for example,from a teacher at the 5th percentile of the distribution of teachermath knowledge to a teacher at the median of this distribution, thestudent's math achievement would increase by 14.3% of a standarddeviation by the end of the school year. Thus, teacher subject knowl-edge is a relevant observable factor that is part of overall teacherquality.

Methodologically, our identification strategy for teacher effectsextends the within-student comparisons in two different subjectsproposed by Dee (2005, 2007) as a way of holding student characteris-tics constant.2 Because the teacher characteristics analyzed by Dee —

gender, race, and ethnicity— do not varywithin teachers, that approachuses variation across different teachers, so that the identifying variationmay still be affected by selection and unobserved teacher charac-teristics. By contrast, our approach uses variation within individualteachers, a feature made possible by the fact that subject knowl-edge does vary within teachers. In our correlated random effectsmodel, we actually reject the overidentification restriction impliedin the fixed-effects model that teacher knowledge effects are thesame across subjects. While in addition to teacher subject knowl-edge per se, the identified estimate may also capture the effect ofother features correlated with teacher subject knowledge such aspassion for the subject or experience in teaching the subject curricu-lum, robustness tests confirm that results are not driven by observedbetween-subject differences in instruction time, teachingmethods, andstudent motivation.

The vast literature on education production functions hints atteacher knowledge as one — possibly the only — factor quite consis-tently associated with growth in student achievement.3 In his earlyreview of the literature, Hanushek (1986, p. 1164) concluded that“the closest thing to a consistent finding among the studies is that‘smarter’ teachers, ones who perform well on verbal ability tests, dobetter in the classroom” — although even on this account, the existingevidence is not overwhelming.4 Important studies estimating the as-sociation of teacher test scores with student achievement gains in the

2 Further examples using this between-subject method to identify effects of specificteacher attributes such as gender, ethnicity, credentials, teaching methods, instructiontime, and class size include Altinok and Kingdon (2012), Ammermüller and Dolton(2006), Clotfelter et al. (2010), Dee and West (2011), Lavy (2010), and Schwerdt andWuppermann (2011).

3 The Coleman et al. (1966) report, which initiated this literature, first found thatverbal skills of teachers were associated with better student achievement.

4 A decade later, based on 41 estimates of teacher test-score effects, Hanushek(1997, p. 144) finds that “of all the explicit measures [of teachers and schools] that lendthemselves to tabulation, stronger teacher test scores are most consistently related tohigher student achievement.” Eide et al. (2004, p. 233) suggest that, compared to morestandard measures of teacher attributes, “a stronger case can be made for measures ofteachers' academic performance or skill as predictors of teachers' effectiveness.” See alsoHanushek and Rivkin (2006).

United States include Ehrenberg and Brewer (1995), Ferguson (1998),Ferguson and Ladd (1996), Hanushek (1971, 1992), Murnane andPhillips (1981), Rockoff et al. (2011), Rowan et al. (1997), andSummers and Wolfe (1977). Examples of education production func-tion estimates with teacher test scores in developing countries includeHarbison and Hanushek (1992) in rural Northeast Brazil, Tan et al.(1997) in the Philippines, Bedi and Marshall (2002) in Honduras, andBehrman et al. (2008) in rural Pakistan.5

However, the existing evidence is still likely to suffer from biasdue to unobserved student characteristics, omitted school and teachervariables, and non-random sorting and selection into classrooms andschools (cf. Glewwe and Kremer, 2006). Obvious examples of suchbias include incidents where better-motivated teachers incite morelearning and also accrue more subject knowledge; where parents witha strong preference for educational achievement choose schools orclassrooms with better teachers and also further their children'slearning in other ways; and where principals assign students withhigher learning gains to better teachers. Estimates of teacher effectsthat convincingly overcome such biases are so far missing in theliterature.6

An additional problem when estimating teacher effects is mea-surement error, which is likely to attenuate estimated effects (cf.Glewwe and Kremer, 2006, p. 989). The available measure mayproxy only poorly for the concept of teachers' subject knowledge, asin most existing studies, the examined skill is not subject-specific.Furthermore, any specific test will measure teacher subject knowl-edge only with considerable noise. In this paper, we have access toseparate tests of teachers' subject-specific knowledge in math andreading that are each generated from batteries of test questionsusing psychometric modeling. These provide consistent estimates ofthe internal reliability metric of the tests, which we use to adjust forbiases due to measurement error based on classical measurement‐errortheory.

Understanding the causes of student achievement is importantnot least because of its substantial economic importance. Differencesin student achievement tests can account for an important part of thedifference in long-run economic growth across countries, in particu-lar between Latin America and other world regions as well as withinLatin America (Hanushek and Woessmann, 2011, 2012). Peru hasbeen doing comparatively well in terms of school attendance, butnot in terms of educational quality as measured by achievementtests (cf. Luque, 2008; World Bank, 2007). Its average performanceon international student achievement tests is dismal compared to de-veloped countries (cf. OECD, 2003) and just below the average ofother Latin American countries. Of the 16 countries participating ina Latin American comparative study in 2006, Peruvian 6th-gradersranked 9 and 10, respectively, in math and reading (LLECE, 2008).

In what follows, Section 2 derives our econometric identificationstrategy. Section 3 describes the dataset and descriptive statistics.Section 4 reports our results and robustness tests.

2. Empirical identification

2.1. The correlated random effects model

We consider an education production function with an explicitfocus on teacher skills:

yi1 ¼ β1Tt1 þ γUt1 þ αZi þ δXi1 þ μ i þ τt1 þ εi1 ð1aÞyi2 ¼ β2Tt2 þ γUt2 þ αZi þ δXi2 þ μ i þ τt2 þ εi2 ð1bÞ

5 See Behrman (2010) and Glewwe and Kremer (2006) for recent reviews of the lit-erature on education production functions in developing countries.

6 Note that because of the association of teacher subject knowledge with otherteacher traits, even an experimental setup that randomly assigns teachers to studentswould not be able to identify the effect of teacher subject knowledge.

Page 3: The impact of teacher subject knowledge on student achievement: Evidence from within-teacher within-student variation

8 The within-teacher identification also accounts for possible differences between

488 J. Metzler, L. Woessmann / Journal of Development Economics 99 (2012) 486–496

where yi1 and yi2 are test scores of student i in subjects 1 and 2 (mathand reading in our application below), respectively. Teachers t arecharacterized by subject-specific teacher knowledge Ttj and non-subject-specific teacher characteristics Ut such as pedagogical skillsand motivation. (The latter differ across the two equations only ifthe student is taught by different teachers in the two subjects.) Addi-tional factors are non-subject-specific (Zi) and subject-specific (Xij)characteristics of students and schools. The error term is composedof a student-specific component μi, a teacher-specific component τt,and a subject-specific component εij.

The coefficient vectors β1, β2, and γ characterize the impact of allsubject-specific and non-subject-specific teacher characteristics thatconstitute the overall teacher quality effect as estimated by value-added studies (cf. Hanushek and Rivkin, 2010). As indicated earlier,several sources of endogeneity bias, stemming from unobserved studentand teacher characteristics or from non-random sorting of students toschools and teachers, are likely to hamper identification of the effect ofteacher subject knowledge in Eqs. (1a) and (1b).

Following Chamberlain (1982), we model the potential correla-tion of the unobserved student effect μi with the observed inputs ina general setup:

μ i ¼ η1Tt1 þ η2Tt2 þ θ1Ut1 þ θ2Ut2 þ χZi þ ϕXi1 þ ϕXi2 þωi ð2Þ

where ωi is uncorrelated with the observed inputs. We allow the ηand θ parameters to differ across subjects, but assume that the param-eters on the subject-specific student characteristics are the sameacross subjects.

Substituting Eq. (2) into Eqs. (1a) and (1b) and collecting termsyields the following correlated random effects models:7

yi1 ¼ β1 þ η1� �

Tt1 þ η2Tt2 þ γ þ θ1ð ÞUt1 þ θ2Ut2 þ α þ χð ÞZi

þ δþ ϕð ÞXi1 þ ϕXi2 þ τt1 þ ε′i1 ð3aÞ

yi2 ¼ η1Tt1 þ β2 þ η2� �

Tt2 þ θ1Ut1 þ γ þ θ2ð ÞUt2 þ α þ χð ÞZi

þ ϕXi1 þ δþ ϕð ÞXi2 þ τt2 þ ε′i2 ð3bÞ

where ε′ij=εij+ωi. In this model, teacher scores in each subject enterthe reduced-formequation of both subjects. The η coefficients capture theextent to which standard models would be biased due to the omission ofunobserved teacher factors,while theβparameters represent the effect ofteacher subject knowledge.

Themodel setupwith correlated randomeffects allows us to test theoveridentification restrictions implicit in conventional fixed-effectsmodels, which are nestedwithin the unrestricted correlated random ef-fects model (see Ashenfelter and Zimmerman, 1997). Since the seminalcontribution to cross-subject identifications of teacher effects by Dee(2005), available fixed-effects estimators (commonly implemented infirst-differenced estimations) implicitly impose that teacher effectsare the same across subjects. Rather than initially imposing such arestriction, after estimating the system of Eqs. (3a) and (3b) we canstraightforwardly test whether β1=β2=β and whether η1=η2=η.Only if these overidentification restrictions cannot be rejected, we canalso specify correlated random effects models that restrict the β coeffi-cients, the η coefficients, or both of them to be the same across subjectsin Eqs. (3a) and (3b).

A correlated random effects model that enacts both of these re-strictions is equivalent to conventional fixed-effects models that per-form within-student comparisons in two subjects to eliminate biasfrom unobserved non-subject-specific student characteristics. Theyhave been applied to estimate effects of non-subject-specific teacherattributes such as gender, ethnicity, and credentials (e.g., Clotfelter

7 This correlated random effects specification is similar to siblings models of earn-ings effects of education (Ashenfelter and Krueger, 1994; Card, 1999). We thank DavidCard for alerting us to this interpretation of our model.

et al., 2010; Dee, 2005, 2007). Note, however, that in order to identifyβ and γ in specifications like Eqs. (3a) and (3b) that restrict coeffi-cients to be the same across subjects, it has to be assumed that the as-signment of students to teachers is random and, in particular, is notrelated to students' subject-specific propensity for achievement.Otherwise, omitted teacher characteristics such as pedagogical skillsand motivation, comprised in the teacher-specific error componentτt, could bias estimates of the observed teacher attributes.

In order to avoid such bias from omitted teacher characteristics,we propose to restrict the analysis to samples of students whoare taught by the same teacher in the two subjects. In such a setting,Ut1=Ut2=Ut and τt1=τt2=τt, so that the education production func-tions simplify to:

yi1 ¼ β1 þ η1� �

Tt1 þ η2Tt2 þ γ þ θ1 þ θ2ð ÞUt þ α þ χð ÞZi

þ δþ ϕð ÞXi1 þ ϕXi2 þ τt þ ε′ i1 ð4aÞ

yi2 ¼ η1Tt1 þ β2 þ η2� �

Tt2 þ γ þ θ1 þ θ2ð ÞUt þ α þ χð ÞZi þ ϕXi1

þ δþ ϕð ÞXi2 þ τt þ ε′i2 ð4bÞ

With the additional restrictions that β1=β2=β and η1=η2=η,this model has a straightforward first-differenced representation:

yi1−yi2 ¼ β Ti1−Ti2ð Þ þ δ Xi1−Xi2ð Þ þ ε′ i1−ε′i2 ð5Þ

which is equivalent to including both student and teacher fixed effectsin a pooled specification. In this specification, any teacher characteristicthat is not subject-specific (Ut and τt) drops out.

Identification of teacher effects is still possible in our setting be-cause teacher subject knowledge varies across subjects. Note thatthis specification can be identified only in samples where studentsare taught by the same teacher in the two subjects. This makes it im-possible to identify teacher effects for attributes that do not vary acrossthe two subjectswithin the same teacher. Thus, this specification cannotidentify effects of teacher attributes that are not subject-specific, such asgender and race.8 But it allows eliminating bias from standard omittedteacher variables when estimating the effect of teacher subjectknowledge.

A final remaining concern might be that the matching of studentsand teachers is not random with respect to their respective relativeperformance in the two subjects. For example, if there are two class-rooms in the same grade in a school, in principle it is conceivable thatstudents who are better in math than in reading are assigned to oneof the classes and that a teacher who is better in math than in readingis assigned to teach them. Therefore, to avoid bias from non-randomallocation of teachers to students, we finally restrict our main sampleto schools that have only one classroom per grade. This eliminatesany bias from sorting between classrooms within the grade of aschool. While this restriction comes at the cost of identificationfrom a sample of relatively small and mostly rural schools that arenot necessarily representative of the total population, additional anal-yses suggest that the sample restriction is not driving our results. Byrestricting the analysis to schools in mostly rural regions where par-ents have access to only one school, the restricted sample additionallyeliminates any possible concern of non-random selection of schoolsby parents based on the relative performance in math vs. reading ofthe schools and the students. Note that, by eliminating non-randomteacher selection, the sample restriction also rules out bias fromprior differences in student achievement, as students cannot be

teachers in their absenteeism from the classroom, an important topic of recent re-search on teachers in developing countries (cf. Banerjee and Duflo, 2006). Note alsothat in a group of six developing countries with data, teacher absenteeism rates arelowest in Peru (Chaudhury et al., 2006).

Page 4: The impact of teacher subject knowledge on student achievement: Evidence from within-teacher within-student variation

10 For a derivation of the measurement-error correction formula in the first-differenced model, see the working-paper version of this paper (Metzler andWoessmann, 2010). While in contrast to the standard first-differenced model, our cor-related random effects model has more than one explanatory variable, results of thecorrelated random effects model without controls are very close to the ones reportedhere.11 In Rasch modeling, there are additional alternative measures of the reliability ratio,such as the person separation reliability. In the EN 2004 teacher test, these are veryclose to Cronbach's α (at 0.73–0.77 in math and 0.58–0.62 in reading), so that applying

489J. Metzler, L. Woessmann / Journal of Development Economics 99 (2012) 486–496

allocated on grounds of within-student performance differences be-tween the subjects to appropriate teachers.9

Oneway to test that student–teacher assignment is indeed random inour sample with respect to their relative capabilities in the two subjectsis to include measures of subject-specific student characteristics (Xij) inthe model. Therefore, in robustness specifications we will include asubject-specific index of student motivation in the model. Furthermore,the observed between-subject variation in teacher subject knowledgemay correlate with other, partly unobserved subject-specific teachercharacteristics, such as passion for the subject or experience in teachingthe curriculum of the subject. The estimates of our models are thus bestinterpreted as the effect of teacher subject knowledge and other subject-specific traits that go with it. To the extent that features like subject-specific passion are the outcome of the knowledge level, rather thanvice versa, estimates that donot control for other subject-specific teachercharacteristics capture the full reduced-form effect of teacher subjectknowledge. But to test towhat extent subject-specific teacher knowledgemay capturemore than just plain knowledge, we also present robustnessspecifications that include teacher characteristics other than subjectknowledge that might vary between subjects, namely subject-specificmeasures of teaching hours and teaching methods. Finally, teacherswhose first language is not Spanish may not only be relatively better inmath than in reading, but may also put relatively less effort into teachingSpanish. Since the native language is likely correlated between studentsand their teachers, we will test for robustness in a sample restricted tostudents whose first language is Spanish.

2.2. Correcting for measurement error

Any given test can only be an imperfect measure of our explanatoryvariable of interest, teachers' subject knowledge T. A first source ofmea-surement error arises if the available measure proxies only poorly forthe concept of teachers' subject knowledge. In most existing studies,the tested teacher skills can be only weakly tied to the academic knowl-edge in the subject in which student achievement is examined and can-not distinguish between subject-specific cognitive and general non-cognitive teacher skills (see Rockoff et al., 2011 for a notable exception).Often, the examined skill is not subject-specific, either when based on arudimentary measure of verbal ability (e.g., Coleman et al., 1966) orwhen aggregated across a range of skills including verbal and pedagogicability as well as knowledge in different subjects (e.g., Summers andWolfe, 1977). A second source of measurement error arises becauseany specific testwillmeasure teacher subject knowledge onlywith con-siderable noise. This is most evident when the result of one single mathquestion is used as an indicator of math skills (cf. Rowan et al., 1997),but it applies more generally because the reliability of any given itembattery is not perfect.

As is well known, classical measurement error in the explanatoryvariable will give rise to downward bias in the estimated effects.Glewwe and Kremer (2006, p. 989) argue that this is likely to be a se-rious problem in estimating school and teacher effects. Assume that,in any given subject s, true teacher test scores T⁎ are measured withclassical measurement error e:

T�is ¼ Tis þ eis; E eisð Þ ¼ Cov T�

is; eisð Þ ¼ 0 ð6Þ

Suppose that T⁎ is the only explanatory variable in an educationalproduction function like Eqs. (1a) and (1b) with mean-zero variables,and e and ε are uncorrelated. Classical measurement‐error theorytells us that using measured T instead of true T⁎ as the explanatory

9 A remaining potential source of non-random sorting might be selective drop-outbased on differences in teacher subject knowledge between the two subjects. Such se-lectivity seems quite unlikely, though, and drop-out by grade 6 is very uncommon inPeru. In the 2004 Demographic and Health Survey, 97.0% of 12-year-olds — the rele-vant age for grade 6 — are currently enrolled in school, and 93.4% of those aged15–19 have attained grade 6 (Filmer, 2007).

variable leads to well-known attenuation bias, where the true effectβ is asymptotically attenuated by the reliability ratio λ (e.g., Angristand Krueger, 1999):

yis ¼ βsλsTis þ ~ε is; λs ¼Cov T�

is; Tisð ÞVar Tisð Þ ¼ Var T�

isð ÞVar T�

is

� �þ Var eisð Þ ð7Þ

Thus, the unbiased effect of T⁎ on y is given by βs ¼ β̂ s=λs.10 Theproblem is that in general, we do not know the reliability λ withwhich a variable is measured.

But in the current case, where the explanatory variable is a testscore derived using psychometric modeling, the underlying psycho-metric test theory in fact provides estimates of the reliability ratio,the most commonly used statistic being Cronbach's α (Cronbach,1951). The method underlying Cronbach's α is based on the ideathat reliability can be estimated by splitting the underlying testitems in two halves and treating them as separate measures of the un-derlying concept. Cronbach's α is the mean of all possible split-half co-efficients resulting from different splittings of a test and as suchestimates reliability as the ratio of true variance to observed varianceof a given test. It is routinely calculated as a measure of the reliabilityof test metrics, and provided as one of the psychometric properties inour dataset.11

Of course, such reliabilitymeasures cannot informus about the test'svalidity — how good a measure it is of the underlying concept (teachersubject knowledge in our case). But they provide uswith information ofthe measurement error contained within the test metric derived fromthe underlying test items. Note that differences between the test metricand the underlying concept only reduce total reliability further, renderingour adjustments still a conservative estimate of the “true” effect of teachersubject knowledge on student achievement.

3. Data and descriptive statistics

To implement our identification strategy, which is rather demand-ing in terms of specific samples and testing in two separate subjects ofboth students and teachers, we employ data from the 2004 Peruviannational evaluation of student achievement, the “Evaluación nacionaldel rendimiento estudiantil” (EN 2004). The study tested a represen-tative sample of 6th-graders in mathematics and reading, covering atotal of over 12,000 students in nearly 900 randomly sampled primaryschools. The sample is representative at the national level and for com-parisons of urban vs. rural areas, public vs. private, andmulti-grade12 vs.“complete” schools.

The two tested subjects, math and reading, are subjects that areseparately taught in the students' curriculum. The student testswere scheduled for 60 min each and were developed to be adequatefor the curriculum and knowledge of the tested 6th-graders. As aunique feature of EN 2004, not only students but also their teacherswere required to take tests in their respective subjects. The teachertests were developed independently from the student tests, so that

them provides very similar (even slightly larger) estimates of the true effect.12 Multi-grade schools, in which several grades are served in the same class by thesame teacher, are a wide-spread phenomenon in many developing countries whereparts of the population live in sparsely populated areas (cf. Hargreaves et al., 2001).For example, the remoteness of communities in the Andes and the Amazon basinmeans a lack of critical mass of students to provide complete schools that have at leastone classroom per grade.

Page 5: The impact of teacher subject knowledge on student achievement: Evidence from within-teacher within-student variation

490 J. Metzler, L. Woessmann / Journal of Development Economics 99 (2012) 486–496

the two do not have common items. While in most studies in the exis-ting literature, tested teacher skills can be only weakly tied to theacademic knowledge in the subject in which student achievementis examined (cf. Wayne and Youngs, 2003), the separate subject-specific teacher tests in EN 2004 allow for an encompassing measure-ment of teacher subject knowledge in two specific subjects.13 Bothstudent and teacher tests in both subjects were scaled using Raschmodeling by the Unit for Quality Measurement (UMC) of the PeruvianMinistry of Education.

Of the 12,165 students in the nationally representative dataset,56% (6819 students) were taught by the same teacher in math andreading. Of the latter, 63% (4302 students) are in schools that haveonly one class in 6th grade. The combination of being served by thesame teacher in both subjects in schools with only one classroom in6th grade is quite frequent, as the Peruvian school system is charac-terized by many small schools spread out over the country to servethe dispersed population. As the main focus of our analyses will beon this “same-teacher one-classroom” (STOC) sample (although wealso report cross-sectional estimates for the full sample below), wefocus on this sample here.14

Table A1 in Appendix A reports descriptive statistics. Both studentand teacher test scores are scaled to a mean of 0 and standard devia-tion of 1. The first language was Spanish for 87% of the students, and anative language for the others. Descriptive test scores reveal thatSpanish-speaking students fare significantly better than studentswhose first language is not Spanish, although their teachers performonly slightly better. Both students and teachers score better in urban(compared to rural) areas, in private (compared to public) schools,and in complete (compared tomulti-grade) schools. In terms of teachereducation, 28% of the teachers hold a university degree, compared to adegree from a teaching institute. Both math and reading is taught atan average of about 6 h per week.

Subject-specific inspection (not shown) reveals only a few sub-stantive subject-specific differences in sub-populations. Boys andgirls score similar in reading, but boys outperform girls on averagein math. While male and female teachers score similarly on themath test, male teachers score worse than female teachers in reading.Teachers with a university degree score worse in math but slightlybetter in reading compared to teachers with a degree from aninstitute.

Throughout the paper, we express results as standard deviationsof student achievement per standard deviation of teacher achieve-ment. While no external benchmark is available to provide contextto the size of the achievement variation for either test in EN 2004, itis worth noting that on international tests, the average learning gainof students during one year of schooling (the “grade-level equiva-lent”) is usually equivalent to roughly 25–35% of a standard deviationof student achievement. For Peruvian 15-year-olds in 10th and 11thgrade in PISA 2009, the grade-level equivalent equals 31.0% of a stan-dard deviation (no similar data are available for 6th-graders). Apartfrom this, we can contextualize the test-score variations in terms ofthe observed differences in teacher and student test scores betweensub-groups of the population. For example, the difference in teachermath knowledge between rural and urban areas is 41.3% of a teacherstandard deviation, whereas the difference in student knowledge be-tween rural and urban areas is 79.2% of a student standard deviation

13 Evidence on content-related validity comes from the fact that both teacher test mea-sures do not show a significant departure from unidimensionality. Principal componentanalysis of the Rasch standardized residuals shows a first eigenvalue of 1.7 in math and1.5 in reading, indicating that both tests are measuring a dominant dimension — mathand reading subject knowledge, respectively. Additional validity stems from the fact thatat least five subject experts in math and reading, respectively, certified adequacy of testitems for both teacher tests.14 While the restricted STOC sample is certainly not a random sample of the full pop-ulation of Peruvian 6th-graders (see Table A2 in Appendix A), a matching procedurereported below indicates that the restriction is not driving our results.

(similar values in reading). This suggests that, at least on the availablemeasures, the teacher population appears more homogeneous thanthe student population in the rural–urban comparison.

4. Results

4.1. Main results

As a reference point for the existing literature and the subsequentanalyses, Table 1 reports cross-sectional regressions of student testscores on teacher test scores and different sets of regressors in thetwo subjects. Throughout the paper, standard errors are clustered atthe classroom level. As the first four columns show, there are statisti-cally significant associations between student test scores and teachertest scores in both subjects in the full sample, both in a bivariate set-ting and after adding a set of controls. The associations are substan-tially smaller when controlling for the background characteristics, inparticular the three school characteristics — urban location, privateoperation, and being a multi-grade school— and the reading estimateremains only marginally significant. Student achievement in bothsubjects is significantly positively associated with Spanish as a firstlanguage, urban location, private operation, being a complete school,and a student motivation index. Female students perform better inreading but worse in math, and reading (but not math) achievementis significantly positively associated with having a female teacher. Asthe subsequent columns reveal, the significant associations betweenstudent and teacher test scores hold also in the sample of studentstaught by the same teacher in both subjects and in the same-teacher one-classroom (STOC) sample, although point estimates arereduced (and become insignificant in reading in the STOC sample),indicating possible selection bias in the full sample.

Table 2 presents the central result of our paper, the correlated ran-dom effects model of Eqs. (3a) and (3b) estimated by seeminglyunrelated regressions. By restricting the student sample to thosewho are taught by the same teacher in both subjects in schools withonly one classroom per grade (STOC sample), we avoid bias stem-ming from within-school sorting.15 In our model, the effect of teachersubject knowledge on student achievement in math, β1, is given bythe difference between the regression coefficient on the teachermath test score in the math equation minus the regression coefficienton the teacher math test score in the reading subject (see Eqs. (3a)and (3b)), and vice versa for reading. In the STOC sample, the pointestimates are surprisingly close to the cross-sectional estimateswith basic controls, supporting the view that the teacher-student as-signment is indeed as good as random in this sample.

The estimates of this within-teacher within-student specificationindicate that an increase in teacher math test scores by one standarddeviation raises student math test scores by about 6.4% of a standarddeviation. A χ2 test reveals that this effect is statistically highly signif-icant. In reading, albeit also positive, the estimate is much lower anddoes not reach statistical significance. In fact, the tests for over-identification restrictions reject the hypothesis that the effect of interestis the same in the two subjects, advising against estimating the simplifiedfirst-differenced model of Eq. (5).16

15 Results on the βs are qualitatively similar in the larger same-teacher sample withoutrestriction to schools with only one classroom per grade (see Table A3 Appendix A). Inthat sample, student math scores are significantly related to teacher math scores aftercontrolling for teacher reading scores, and student reading scores are significantly relatedto teacher reading scores after controlling for teacher math scores. While the coefficientestimates on the differing-subject teacher scores are positive in both models in that sam-ple, they are smaller than the same-subject coefficients and statistically insignificant in theunrestricted model.16 If we denote the coefficient on the same-subject teacher test score in Eqs. (3a) and(3b) byψj, this test is implemented by testingψ1−η1=ψ2−η2, where the ηs are estimat-ed as the coefficients on the other-subject teacher test score in each equation. Given thatψj=η j+β j, this provides a test of β1=β2.

Page 6: The impact of teacher subject knowledge on student achievement: Evidence from within-teacher within-student variation

Table 1Cross-sectional regressions.

Full sample Same-teacher sample Same-teacher one-classroom (STOC)

OLS SUR OLS SUR OLS SUR

Math Reading Math Reading Math Reading Math Reading Math Reading Math Reading

(1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12)

Teacher test score 0.264⁎⁎⁎ 0.219⁎⁎⁎ 0.094⁎⁎⁎ 0.027⁎ 0.147⁎⁎⁎ 0.182⁎⁎⁎ 0.073⁎⁎⁎ 0.032⁎ 0.119⁎⁎ 0.158⁎⁎⁎ 0.061⁎⁎ 0.017(0.027) (0.023) (0.019) (0.018) (0.041) (0.032) (0.022) (0.018) (0.054) (0.042) (0.025) (0.023)

Student female −0.094⁎⁎⁎ 0.063⁎⁎ −0.074⁎⁎ 0.057⁎ −0.098⁎⁎⁎ 0.024(0.025) (0.027) (0.029) (0.030) (0.035) (0.035)

Student 1st language Spanish 0.709⁎⁎⁎ 0.748⁎⁎⁎ 0.765⁎⁎⁎ 0.741⁎⁎⁎ 0.639⁎⁎⁎ 0.623⁎⁎⁎

(0.076) (0.066) (0.081) (0.071) (0.079) (0.072)Urban area 0.323⁎⁎⁎ 0.392⁎⁎⁎ 0.332⁎⁎⁎ 0.382⁎⁎⁎ 0.321⁎⁎⁎ 0.364⁎⁎⁎

(0.066) (0.067) (0.069) (0.069) (0.077) (0.078)Private school 0.798⁎⁎⁎ 0.708⁎⁎⁎ 0.701⁎⁎⁎ 0.570⁎⁎⁎ 0.725⁎⁎⁎ 0.562⁎⁎⁎

(0.063) (0.055) (0.099) (0.077) (0.115) (0.092)Complete school 0.262⁎⁎⁎ 0.281⁎⁎⁎ 0.273⁎⁎⁎ 0.290⁎⁎⁎ 0.360⁎⁎⁎ 0.371⁎⁎⁎

(0.069) (0.069) (0.077) (0.076) (0.081) (0.077)Teacher female 0.025 0.104⁎⁎⁎ −0.010 0.100⁎ −0.110 0.066

(0.040) (0.038) (0.059) (0.053) (0.070) (0.066)Teacher university degree 0.022 0.039 0.004 0.024 0.060 0.058

(0.045) (0.041) (0.064) (0.060) (0.079) (0.075)Student motivation 0.517⁎⁎⁎ 0.195⁎⁎⁎ 0.701⁎⁎⁎ 0.315⁎⁎⁎ 0.638⁎⁎⁎ 0.382⁎⁎⁎

(0.065) (0.046) (0.076) (0.058) (0.083) (0.077)Hours 0.017 0.006 0.022 0.014 0.001 −0.003

(0.012) (0.010) (0.015) (0.012) (0.018) (0.016)Teaching method (8 indicators) Yes Yes Yes Yes Yes YesF (OLS) / χ2 (SUR) 92.46 89.82 806.76 12.79 32.71 522.98 4.96 13.97 470.36Observations (students) 12,165 12,165 6461 6819 6819 4987 4302 4302 3121Classrooms (clusters) 893 893 500 521 521 398 346 346 261

Dependent variable: student test score in math and reading, respectively. Clustered standard errors in the SUR models are estimated by maximum likelihood. Robust standarderrors (adjusted for clustering at classroom level) in parentheses: significance at ⁎⁎⁎1, ⁎⁎5, and ⁎10%.

491J. Metzler, L. Woessmann / Journal of Development Economics 99 (2012) 486–496

We can only speculate about the reasons why results differ mark-edly between math and reading. A first option is that teacher subjectknowledge may be genuinely more relevant in the production

Table 2The effect of teacher test scores: correlated random effects models.

Unrestricted model Model restrictingη1=η2

(1) (2)

Math Reading Math Reading

Implied βsubject 0.064⁎⁎⁎ 0.014 0.060⁎⁎⁎ 0.019χ2 (βsubject=0) 13.138 0.599 9.585 1.095Prob>χ2 [0.0003] [0.439] [0.002] [0.295]

Regression estimates:Teacher test score insame subject

0.047 0.039 0.063⁎⁎⁎ 0.022(0.037) (0.034) (0.025) (0.020)

Teacher test score inother subject

0.026 −0.017 0.004 0.004(0.036) (0.032) (0.019) (0.019)

χ2 (ηmath=ηreading) 0.585 –

Prob>χ2 [0.444] –

χ2 (βmath=βreading) 6.948⁎⁎⁎ 3.477⁎

Prob>χ2 [0.008] [0.062]χ2 353.45 354.57Observations (students) 3936 3936Classrooms (clusters) 316 316βsubject, corrected formeasurement error

0.087⁎⁎⁎ 0.022 0.081⁎⁎⁎ 0.029

λ (Cronbach's α) 0.74 0.64 0.74 0.64

Dependent variable: student test score in math and reading, respectively. Regressions inthe two subjects estimated by seemingly unrelated regressions (SUR). Sample: Same-teacher one-classroom (STOC). Implied β represents the effect of the teacher test score,given by the difference between the estimate on the teacher test score in the respectivesubject between the equation of the student test score in the respective subject and theequation of the student test score in the other subject (see Eqs. (3a) and (3b)). Regressionsinclude controls for student gender, student 1st language, urban area, private school, com-plete school, teacher gender, and teacher university degree. Clustered standard errors inthe SUR models are estimated by maximum likelihood. Robust standard errors (adjustedfor clustering at classroom level) in parentheses: significance at ⁎⁎⁎1, ⁎⁎5, and ⁎10%.

function of math rather than reading skills. Another possibility isthat by 6th grade, reading skills are already far developed and less re-sponsive to teacher influence, whereas math skills still developstrongly. It could be that the teacher reading test has more noisethan the math test, so that estimates suffer from greater attenuationbias. It could also be that the underlying true distribution of teacherknowledge is tighter in reading than in math (relative to the studentdistribution in each subject), so that a one standard-deviation changein teacher scores reflects smaller variation in actual knowledge inreading than in math.

As discussed in Section 2.2 above, the estimates still suffer fromattenuation bias due to measurement error in the teacher testscores. Following Eq. (7), the unbiased effect of teacher subjectknowledge on student achievement is given by βs ¼ β̂ s=λs. In theEN 2004, the reliability ratio (Cronbach's α) of the teacher mathtest is λM=0.74, and the reliability ratio of the teacher readingtest is λR=0.64 (in line with the argument just made). Thus, asreported at the bottom of Table 2, the true effect of teacher mathknowledge on student math achievement turns out to be 0.087:an increase in student math achievement by 8.7% of a standard de-viation for a one standard deviation increase in teacher mathknowledge.

The overidentification tests do not reject that the selection term η(given by the coefficient on other-subject teacher scores in eachequation) is the same in the two subjects, allowing us to estimate arestricted model that assumes η1=η2 in column (2). Results arehardly different in this restricted model, although similarity of resultsacross subjects is only marginally rejected in this model.17 The esti-mate of η is close to zero in the STOC sample, a further indication

17 Table A4 in Appendix A shows results of basic first-differenced specifications,which would be warranted when effects do not differ across subjects and which effec-tively represent an average of our estimates in math and in reading. The working-paper version (Metzler and Woessmann, 2010) provides further details on first-differenced specifications.

Page 7: The impact of teacher subject knowledge on student achievement: Evidence from within-teacher within-student variation

19 The student motivation index, scaled 0–1, is calculated as a simple average of fivesurvey questions corresponding to each subject such as “I like to read in my free time”(reading) or “I have fun solving mathematical problems” (math), on which students

492 J. Metzler, L. Woessmann / Journal of Development Economics 99 (2012) 486–496

that selection effects may indeed be eliminated by the sample restric-tion to students who have the same teacher in both subjects inschools that have only one classroom in the grade. In the largersame-teacher sample (shown in Table A3 in Appendix A), the esti-mate of η is positive and (marginally) significant, indicating positiveselection effects in that sample.

4.2. Evidence on heterogeneous effects

To test whether the main effect hides effect heterogeneity in sub-populations, we estimate our model separately by stratified sub-samples (Table 3). We start by looking at the following sub-groups:female vs. male students, students with Spanish vs. a native languageas their first language, urban vs. rural areas, private vs. public schools,complete vs. multi-grade schools, male vs. female teachers, andteachers with a university vs. institute degree. With only two excep-tions, both math and reading results are robust within the sub-samples, and the difference between the respective split samples isnot statistically significant.18 The exceptions are that the math resultsare no longer statistically significant in the two smallest sub-samples,private schools and teachers with a university degree. While the smallsample sizes are most likely behind this pattern, the point estimatesin these two sub-samples are also close to zero, and the difference ofthe effect for teachers with a university degree compared to the effectfor teachers without a university degree is statistically significant.With this exception, there is little evidence that the effect of teachersubject knowledge differs across these sub-populations.

The exception of the non-effect for teachers holding a universitydegree might indicate that the effect of teacher subject knowledgeflattens out at higher levels of teacher skills. To test whether the effectof teacher subject knowledge is non-linear, Table 3 also reports re-sults for sub-samples of teachers below and above the median interms of their knowledge level, averaged across the two subjects.There is no evidence for heterogeneous effects. The same is true inmodels that interact the teacher test-score difference with a continu-ous measure of the average test-score level or with test-score levels ofthe two subjects separately. We also experimented with additionaltypes of non-linear specifications, with similar results. At least withinour sample of Peruvian teachers, there is no indication that the effectof teacher subject knowledge levels out at higher knowledge levels.

Table 3 also reports results separately for students achievingbelow and above the median achievement level, averaged across thetwo subjects. The math results are significant in both sub-samples.The effect in reading becomes marginally significant in the sub-sample of students above the median level and is significantly higherthan the effect for low-performing students.

To analyze this pattern further, the left panel of Table 4 looksdeeper into the “ability match” between teachers and students. It pro-vides estimates for four sub-samples based on whether students andteachers, respectively, have below or above median achievement.Here, the sub-sample of above-median students taught by below-median teachers falls out of the general pattern in that the point esti-mate in math is small and statistically insignificant. This suggests thatthe teacher-student ability match matters: If teachers are relativelylow-performing, their subject knowledge does not matter much forstudents who are already relatively high-performing. This type ofheterogeneity is an interesting aspect of education production functionsthat is frequently overlooked.

Similarly, the right panel of Table 4 looks at the teacher-studentgender match. Given the apparent descriptive gender differences inmath vs. reading achievement, and given the recent evidence thatbeing assigned to a same-gender teacher may affect student out-comes (Carrell et al., 2010; Dee, 2005, 2007), it shows estimates for

18 We also do not find that the effect of teacher subject knowledge interacts signifi-cantly with teaching hours.

similar sub-sample models based on the gender match betweenteachers and students. Results suggest that female students do notprofit from higher teacher math knowledge when they are taughtby a male teacher. They do so when taught by a female teacher, asdo male students independent of the teachers' gender. This suggeststhat the importance of teacher subject knowledge also depends onthe teacher-student gender match: For girls, student–teacher genderinteractions may be important in facilitating the transmission ofmath knowledge from teachers to students.

Although the findings generally suggest that results can be gen-eralized to most sub-populations, the fact that our main sample ofanalysis — the same-teacher one-classroom (STOC) sample — is nota representative sample of the Peruvian student population raisesthe question whether results derived on the STOC sample can begeneralized to the Peruvian student population at large. Schoolswith only one 6th-grade class that employ the same teacher toteach both subjects are more likely to be found in rural ratherthan urban areas and in multi-grade schools, and accordingly onaverage perform lower on the tests (see Table A2 in Appendix A).To test whether the sample specifics prevent generalization, weperform the following analysis. We first draw a 25-percent randomsample from the original EN 2004 population, which is smallerthan the STOC sample. We then employ nearest-neighbor propen-sity score matching (caliper 0.01) to pick those observations fromthe STOC sample that are comparable to the initial population.The variables along which this STOC sub-sample is made comparableto the initial population are student test scores in both subjects, teachertest scores in both subjects, and dummies for different school types(urban, public, andmulti-grade). On this sub-sample of the STOC samplethat is comparable to the whole population, we estimate our within-teacher within-student specification. In a bootstrapping exercise, werepeated this procedure 1000 times. The regression analyses yield anaverage teacher math score effect of 0.065, with an average χ2 of7.524 (statistically significant at the 3% level), for an average numberof 1739 observations (detailed results available on request). These re-sults suggest that the teacher test‐score effect of our main specificationis not an artifact of the STOC sample, but is likely to hold at a similarmagnitude in the Peruvian 6th-grade student population at large.

4.3. Specification tests

As a test for the randomness of the student–teacher assignment, wecan add subject-specific measures of student and teacher attributes tothe within-teacher within-student specification. Such extended modelsalso address the question to what extent subject-specific teacher knowl-edge may capture other subject-specific features correlated with it.Results are reported in Table A5 in Appendix A.

We first introduce an index of subject-specific student motivation(column (1)).19 Subject-specific differences in student motivationmay be problematic if they are pre-existing and systematically corre-lated with the difference in teacher subject knowledge between sub-jects. On the other hand, they may also be the result of the quality ofteaching in the respective subject, so that controlling for differencesin student motivation between the two subjects would lead to an un-derestimation of the full effect of teacher subject knowledge. In anyevent, although student motivation is significantly associated withstudent test scores in both subjects, and although the motivation ef-fect is significantly larger in math than in reading, its inclusion hardly

can choose disagree/sometimes true/agree. The resulting answer is coded as 0 for theanswer displaying low motivation, 0.5 for medium motivation, and 1 for highmotivation.

Page 8: The impact of teacher subject knowledge on student achievement: Evidence from within-teacher within-student variation

Table 3The effect of teacher test scores in sub-samples.

No Yes Difference

Implied β Prob>χ2 Obs. Clusters Implied β Prob>χ2 Obs. Clusters Diff. Prob>χ2

(1) (2) (3) (4) (5) (6) (7) (8) (9) (10)

Math:Student female 0.078⁎⁎⁎ [0.0002] 2048 205 0.050⁎⁎ [0.043] 1888 301 0.028 [0.346]Student 1st language Spanish 0.096⁎⁎ [0.025] 530 114 0.060⁎⁎⁎ [0.003] 3406 293 0.036 [0.461]Urban area 0.076⁎⁎⁎ [0.001] 1859 203 0.063⁎⁎ [0.014] 2077 113 0.013 [0.710]Private school 0.073⁎⁎⁎ [0.0001] 3436 283 −0.022 [0.753] 500 33 0.094 [0.177]Complete school 0.071⁎⁎⁎ [0.006] 1823 213 0.067⁎⁎⁎ [0.005] 2113 103 0.005 [0.898]Teacher male 0.082⁎⁎⁎ [0.0004] 2100 153 0.039 [0.131] 1836 163 0.043 [0.214]Teacher university degree 0.087⁎⁎⁎ [0.00003] 2815 227 0.003 [0.931] 1121 89 0.084⁎⁎ [0.030]Teacher average score above median 0.059⁎⁎ [0.036] 2062 167 0.103⁎⁎⁎ [0.0003] 1874 149 −0.044 [0.263]Student average score above median 0.057⁎⁎⁎ [0.010] 1970 290 0.076⁎⁎⁎ [0.006] 1966 242 −0.019 [0.585]

Reading:Student female −0.005 [0.846] 2048 205 0.033 [0.180] 1888 301 −0.038 [0.250]Student 1st language Spanish 0.026 [0.637] 530 114 0.011 [0.551] 3406 293 0.015 [0.795]Urban area 0.007 [0.805] 1859 203 0.025 [0.268] 2077 113 −0.018 [0.623]Private school 0.016 [0.417] 3436 283 −0.016 [0.745] 500 33 0.032 [0.550]Complete school 0.037 [0.219] 1823 213 0.000 [0.990] 2113 103 0.037 [0.320]Teacher male 0.024 [0.313] 2100 153 0.005 [0.863] 1836 163 0.019 [0.600]Teacher university degree 0.020 [0.350] 2815 227 0.003 [0.919] 1121 89 0.017 [0.626]Teacher average score above median −0.006 [0.872] 2062 167 0.009 [0.771] 1874 149 −0.015 [0.756]Student average score above median −0.017 [0.520] 1970 290 0.035 [0.109] 1966 242 −0.052⁎ [0.100]

Dependent variable: student test score in math and reading, respectively. For each sub-sample, regressions in the two subjects are estimated jointly by seemingly unrelated regressions(SUR). Sample: Same-teacher one-classroom (STOC), stratified in two sub-samples based on whether characteristic in head column is true or not. Implied β represents the effect of theteacher test score, given by the difference between the estimate on the teacher test score in the respective subject between the equation of the student test score in the respective subjectand the equation of the student test score in the other subject (see Eqs. (3a) and (3b)). Regressions include controls for student gender, student 1st language, urban area, private school,complete school, teacher gender, and teacher university degree. Clustered standard errors (adjusted for clustering at classroom level) in the SURmodels are estimated bymaximum like-lihood. Significance at ⁎⁎⁎1, ⁎⁎5,and ⁎10%.

493J. Metzler, L. Woessmann / Journal of Development Economics 99 (2012) 486–496

affects the estimated effects of teacher subject knowledge in the twosubjects.

Next, we introduce subject-specific teacher measures (column (2)of Table A5). First, we add weekly teaching hours to control for thepossibility that teachers may prefer to teach more in the subjectthey know better, or get to know the subject better which theyteach more. The effect of teaching hours is, however, not significantin either subject. Second, we add controls for how teachers teachthe curriculum in the two subjects.20 Teachers report for each subjectwhether they use subject-specific books, work books, local, regional,or national guidelines, and other help to design their subject curricu-lum. The eight indicators of teaching methods enter jointly statistical-ly significantly in both subjects. However, the effect of teacher subjectknowledge is again unaffected by controlling for these subject-specificteacher characteristics. Together, the models with subject-specific con-trols confirm the significant causal effect of teacher subject knowledgein math.

Another concern with the identification, given the relatively largeshare (13%) of students for whom Spanish is not the main language,may lie in the fact that teachers who are not native Spanish speakersmay not only be relatively worse in teaching Spanish compared tomath, but may also exert less effort in teaching Spanish (even aftercontrolling for their knowledge of Spanish). As long as the native lan-guage is correlated between students and their teachers (due to localconcentrations of indigenous peoples), one way to test for such bias isto compare the full-sample estimate to an estimate on the sample ofnative Spanish‐speaking students only (shown in Table 3). The effectof teacher test scores on student achievement in this reduced sampleis 0.06 (very close to the full sample) and highly significant, indicating

20 EN 2004 asks teachers to describe which items they use to design the subject cur-riculum: working books, school curriculum, institutional educational projects, regionaleducational projects, 3rd cycle basic curriculum structure, and/or readjusted curricularprograms.

that lower relative effort of non-native Spanish teachers in teachingreading is unlikely to drive our results.

4.4. Interpretation of effect size

In order to assess whether the overall effect stems from an expo-sure of a student to a teacher for one year or for several years, wecan observe in the data how long each teacher has been with the test-ed class. A caveat with such an analysis is that 6th-grade teachers'score differences between math and reading may be correlated withprevious teachers' score differences between subjects, although it isnot obvious that this a substantial issue in the subject-differencedanalysis. It turns out that 39% of the STOC sample has been taughtby the specific teacher for only one year, an additional 38% for twoyears, and only the remaining 23% for more than two years.

Table A6 in Appendix A reports results of estimating the model sep-arately for the three sub-groups. The point estimate in the sub-sampleof students who have been taught by the teacher for only one yearis even larger than the point estimate in our main specification,suggesting that the estimated effect mostly captures a one-year effect.Note that in this sub-sample, there is actually also a (marginally)significant effect of teacher subject knowledge on student achieve-ment in reading. At the same time, the math effects are of a similarmagnitude in the sub-samples of teachers that have been with theclass for longer than one year. With limited statistical power in thesesmaller samples, it is however hard to reject that the effect increaseswith tenure, at least when there is some fadeout of teacher effects(Kane and Staiger, 2008; see Metzler and Woessmann, 2010 for amore extensive discussion).

When the estimates of our main specification are viewed as the ef-fects of having been with a teacher of a certain subject knowledge forone year, the measurement-error corrected estimate of the effect inmath means that if a student who is taught by a teacher at, e.g., the5th percentile of the subject knowledge distribution was moved to a

Page 9: The impact of teacher subject knowledge on student achievement: Evidence from within-teacher within-student variation

Table 4The effect of teacher test scores in sub-samples matching students and teachers on achievement and gender.

Achievement level Gender

Implied β Prob>χ2 Obs. Clusters Implied β Prob>χ2 Obs. Clusters

(1) (2) (3) (4) (5) (6) (7) (8)

Math: Math:Teacher low level Teacher female

Student low level 0.091⁎⁎⁎ [0.009] 1132 159 Student female 0.093⁎⁎⁎ [0.002] 1012 146Student high level 0.017 [0.642] 930 117 Student male 0.073⁎⁎⁎ [0.010] 1088 148

Teacher high level Teacher maleStudent low level 0.068 [0.130] 838 131 Student female −0.014 [0.698] 876 155Student high level 0.133⁎⁎⁎ [0.002] 1036 125 Student male 0.083⁎⁎⁎ [0.005] 960 157

Reading: Reading:Teacher low level Teacher female

Student low level −0.039 [0.358] 1132 159 Student female 0.037 [0.230] 1012 146Student high level 0.040 [0.472] 930 117 Student male 0.011 [0.713] 1088 148

Teacher high level Teacher maleStudent low level −0.022 [0.612] 838 131 Student female 0.030 [0.390] 876 155Student high level 0.020 [0.619] 1036 125 Student male −0.028 [0.454] 960 157

Dependent variable: student test score in math and reading, respectively. For each sub-sample, regressions in the two subjects are estimated jointly by seemingly unrelated regres-sions (SUR). Sample: Same-teacher one-classroom (STOC), stratified in four sub-samples based on whether combined teacher-student characteristic in head column is true or not.Teacher/student low/high level is defined by whether teacher/student average test score is below/above median. Implied β represents the effect of the teacher test score, given bythe difference between the estimate on the teacher test score in the respective subject between the equation of the student test score in the respective subject and the equation ofthe student test score in the other subject (see Eqs. (3a) and (3b)). Regressions include controls for student gender, student 1st language, urban area, private school, completeschool, teacher gender, and teacher university degree. Clustered standard errors (adjusted for clustering at classroom level) in the SUR models are estimated by maximum likeli-hood. Significance at ⁎⁎⁎1, ⁎⁎5, and ⁎10%.

494 J. Metzler, L. Woessmann / Journal of Development Economics 99 (2012) 486–496

teacher with median subject knowledge, the student's math achieve-ment would be raised by 14.3% of a standard deviation by the end ofthe school year. If one raised all 6th-grade teachers who are currentlybelow the 95th (75th, 50th) percentile on the subject knowledge dis-tribution to that percentile, average student performance would in-crease by 13.9 (7.0, 3.5) percent of a standard deviation. Additionalimprovements could be expected from raising teacher knowledge ingrades 1 to 5. Of course, this would mean substantial improvementsin teacher knowledge, but ones that appear conceivable consideringthe low average teacher performance in the given developing-country context.

It is also illuminating to compare our estimated size of the effect ofteacher subject knowledge to the size of the recent estimates of theoverall teacher effect based on relative student test score increasesexerted by different teachers. Rockoff (2004, pp. 247–248) concludesfor the U.S. school system that a “one-standard-deviation increase inteacher quality raises test scores by approximately 0.1 standard devi-ations in reading andmath on nationally standardized distributions ofachievement.” Rivkin et al. (2005) find a similar magnitude. It is im-possible to relate our results directly to this finding, both becausethe Peru finding may not generalize to the United States and becausethe line-up of teachers along the dimension of student test-score in-creases will not perfectly match the line-up along the dimension ofteacher subject knowledge, so that the two metrics do not necessarilymatch. But it is still interesting to note that our measurement-errorcorrected estimate of the effect of teacher subject knowledge falls inthe same ballpark.

5. Conclusion

The empirical literature analyzing determinants of student learn-ing has tackled the issue of teacher impacts from two sides: measur-ing the impact of teachers as a whole and measuring the impact ofdistinct teacher characteristics. Due to recent advances in the firststream of literature using newly available rich panel datasets, it hasbeen firmly established that overall teacher quality is an importantdeterminant of student outcomes, i.e., that teachers differ stronglyin their impact on student learning (Hanushek and Rivkin, 2010).The second stream of literature examines which specific teacher

characteristics may be responsible for these strong effects and consti-tute the unobserved bundle of overall teacher quality. Answers to thisquestion are particularly useful for educational administration andpolicy-making. In the empirical examination of teacher attributessuch as education, experience, salaries, test scores, and certification,only teacher knowledge measured by test scores has reasonably con-sistently been found to be associated with student achievement(Hanushek, 1986; Hanushek and Rivkin, 2006). However, identifica-tion of causal effects of teacher characteristics is often hampered byomitted student and teacher characteristics and by non-random se-lection into classrooms.

This paper proposed a new way to identify the effect of teachersubject knowledge on student achievement, drawing on the variationin subject matter knowledge of teachers across two subjects and thecommensurate achievement across the subjects by their individualstudents. By restricting the sample to students who are taught bythe same teacher in both subjects, and in schools that have only oneclassroom per grade, this identification approach is able to circum-vent the usual bias from omitted student and teacher variables andfrom non-random selection and placement. Furthermore, possiblebias from measurement error was addressed by reverting to psycho-metric test statistics on reliability ratios of the underlying teachertests.

Drawing on data on math and reading achievement of 6th-gradestudents and their teachers in Peru, we find a significant effect ofteacher math knowledge on student math achievement, but (withfew exceptions of sub-groups) not in reading. A one-standard-deviation increase in teacher math knowledge raises student achieve-ment by about 9% of a standard deviation. The results suggest thatteacher subject knowledge is indeed one observable factor that ispart of what makes up as-yet unobserved teacher quality, at least inmath.

While the results are robust in most sub-populations, we also findinteresting signs of effect heterogeneity depending on the teacher-student match. First, the effect does not seem to be present whenhigh-performing students are taught by low-performing teachers.Second, female students are not affected by the level of teachermath knowledge when taught by male teachers. Thus, effects ofteacher subject knowledge may depend on the subject, the ability

Page 10: The impact of teacher subject knowledge on student achievement: Evidence from within-teacher within-student variation

495J. Metzler, L. Woessmann / Journal of Development Economics 99 (2012) 486–496

and gender match between teachers and students. Such aspects of ed-ucation production are frequently overlooked, and understanding themechanisms behind them provides interesting directions for futureresearch.

Overall, the results suggest that teacher subject knowledge shouldbe on the agenda of educational administrators and policy-makers.Attention to teacher subject knowledge seems to be in order in recruit-ment policies, teacher training practices, and compensation schemes.However, additional knowledge about the relative cost of improvingteacher knowledge — both of different means of improving teacherknowledge and compared to other means of improving studentachievement — is needed before policy priorities can be established.

The results have to be interpreted in the given developing-countrycontext with relatively low academic achievement. Although averagestudent achievement in Peru is close to the average of Latin Americancountries (LLECE, 2008), it is far below the average developed country.For example, when Peruvian 15-year-olds took the Programme for In-ternational Student Assessment (PISA) test in 2002, their averagemath performance was a full two standard deviations below the aver-age OECD country, at the very bottom of the 41 (mostly developed)countries that had taken the test by the time (OECD, 2003). The extentto which the current results generalize to developed countries thus re-mains an open question for future research.

Appendix A. Supplementary material

Supplementary material to this article can be found online athttp://dx.doi.org/10.1016/j.jdeveco.2012.06.002.

References

Altinok, Nadir, Kingdon, Geeta, 2012. New evidence on class size effects: a pupil fixedeffects approach. Oxford Bulletin of Economics and Statistics 74 (2), 203–234.

Ammermüller, Andreas, Dolton, Peter, 2006. Pupil–teacher gender interaction effectson scholastic outcomes in England and the USA. ZEW Discussion Paper, 06–060.Centre for European Economic Research, Mannheim.

Angrist, Joshua D., Krueger, Alan B., 1999. Empirical strategies in labor economics. In:Ashenfelter, Orley, Card, David (Eds.), Handbook of Labor Economics, vol. 3A.North-Holland, Amsterdam, pp. 1277–1366.

Ashenfelter, Orley, Krueger, Alan B., 1994. Estimates of the economic return to schoolingfrom a new sample of twins. The American Economic Review 84 (5), 1157–1173.

Ashenfelter, Orley, Zimmerman, David J., 1997. Estimates of the returns to schoolingfrom sibling data: fathers, sons, and brothers. The Review of Economics and Statistics79 (1), 1–9.

Banerjee, Abhijit, Duflo, Esther, 2006. Addressing absence. The Journal of EconomicPerspectives 20 (1), 117–132.

Bedi, Arjun S., Marshall, Jeffery H., 2002. Primary school attendance in Honduras. Journalof Development Economics 69 (1), 129–153.

Behrman, Jere R., 2010. Investment in education: inputs and incentives. In: Rodrik,Dani, Rosenzweig, Mark (Eds.), Handbook of Development Economics, vol. 5.North-Holland, Amsterdam, pp. 4883–4975.

Behrman, Jere R., Ross, David, Sabot, Richard, 2008. Improving quality versus increasingthe quantity of schooling: estimates of rates of return from rural Pakistan. Journalof Development Economics 85 (1–2), 94–104.

Card, David, 1999. The causal effect of education on earnings. In: Ashenfelter, Orley,Card, David (Eds.), Handbook of Labor Economics, vol. 3A. North-Holland, Amsterdam,pp. 1801–1863.

Carrell, Scott E., Page, Marianne E., West, James E., 2010. Sex and science: how professorgender perpetuates the gender gap. Quarterly Journal of Economics 125 (3),1101–1144.

Chamberlain, Gary, 1982. Multivariate regression models for panel data. Journal ofEconometrics 18 (1), 5–46.

Chaudhury, Nazmul, Hammer, Jeffrey, Kremer, Michael, Karthik Muralidharan, F.,Rogers, Halsey, 2006. Missing in action: teacher and health worker absence in de-veloping countries. The Journal of Economic Perspectives 20 (1), 91–116.

Clotfelter, Charles T., Ladd, Helen F., Vigdor, Jacob L., 2010. Teacher credentials and studentachievement in high school: a cross-subject analysis with student fixed effects. TheJournal of Human Resources 45 (3), 655–681.

Coleman, James S., Campbell, Ernest Q., Hobson, Carol F., McPartland, James M., Mood,Alexander M., Weinfeld, Frederic D., et al., 1966. Equality of Educational Opportunity.Government Printing Office, Washington, D.C.

Cronbach, Lee J., 1951. Coefficient alpha and the internal structure of tests.Psychometrika 16 (3), 297–334.

Dee, Thomas S., 2005. A teacher like me: does race, ethnicity, or gender matter? TheAmerican Economic Review 95 (2), 158–165.

Dee, Thomas S., 2007. Teachers and the gender gaps in student achievement. The Journalof Human Resources 42 (3), 528–554.

Dee, Thomas S., West, Martin R., 2011. The non-cognitive returns to class size. EducationalEvaluation and Policy Analysis 33 (1), 23–46.

Ehrenberg, Ronald G., Brewer, Dominic J., 1995. Did teachers' verbal ability and racematter in the 1960s? Coleman revisited. Economics of Education Review 14 (1), 1–21.

Eide, Eric, Goldhaber, Dan, Brewer, Dominic, 2004. The teacher labour market andteacher quality. Oxford Review of Economic Policy 20 (2), 230–244.

Ferguson, Ronald F., 1998. Can schools narrow the black-white test score gap? In:Jencks, Christopher, Phillips, Meredith (Eds.), The Black-White Test Score Gap.Brookings Institution, Washington, D.C., pp. 318–374.

Ferguson, Ronald F., Ladd, Helen F., 1996. How and why money matters: an analysis ofAlabama schools. In: Ladd, Helen F. (Ed.), Holding Schools Accountable:Performance-based Reform in Education. Brookings Institution, Washington, D.C.,pp. 265–298.

Filmer, Deon, 2007. Educational Attainment and Enrollment around the World. Develop-ment Research Group, The World Bank (Peru 2004 data last updated 25 Sep 2007)[accessed 24 March 2012]. Available from econ.worldbank.org/projects/edattain.

Glewwe, Paul, Kremer, Michael, 2006. Schools, teachers, and education outcomes indeveloping countries. In: Hanushek, Eric A., Welch, Finis (Eds.), Handbook of theEconomics of Education, vol. 2. North-Holland, Amsterdam, pp. 945–1017.

Hanushek, Eric, 1971. Teacher characteristics and gains in student achievement: esti-mation using micro data. The American Economic Review 61 (2), 280–288.

Hanushek, Eric A., 1986. The economics of schooling: production and efficiency in publicschools. Journal of Economic Literature 24 (3), 1141–1177.

Hanushek, Eric A., 1992. The trade-off between child quantity and quality. Journal ofPolitical Economy 100 (1), 84–117.

Hanushek, Eric A., 1997. Assessing the effects of school resources on student perfor-mance: an update. Educational Evaluation and Policy Analysis 19 (2), 141–164.

Hanushek, Eric A., Rivkin, Steven G., 2006. Teacher quality. In: Hanushek, Eric A., Welch,Finis (Eds.), Handbook of the Economics of Education, vol. 2. North-Holland, Amster-dam, pp. 1051–1078.

Hanushek, Eric A., Rivkin, Steven G., 2010. Generalizations about using value-added measures of teacher quality. The American Economic Review 100 (2),267–271.

Hanushek, Eric A., Woessmann, Ludger, 2011. The economics of international differencesin educational achievement. In: Hanushek, Eric A., Machin, Stephen, Woessmann,Ludger (Eds.), Handbook of the Economics of Education, vol. 3. North Holland,Amsterdam, pp. 89–200.

Hanushek, Eric A., Woessmann, Ludger, 2012. Schooling, Educational Achievement,and the Latin American Growth Puzzle. Journal of Development Economics 99(2), 497–512.

Harbison, Ralph W., Hanushek, Eric A., 1992. Educational performance of the poor:lessons from rural northeast Brazil. A World Bank Book. Oxford UniversityPress, Oxford.

Hargreaves, Eleanore, Montero, C., Chau, N., Sibli, M., Thanh, T., 2001. Multigrade teachingin Peru, Sri Lanka and Vietnam: an overview. International Journal of Educational De-velopment 21 (6), 499–520.

Kane, Thomas J., Staiger, Douglas O., 2008. Estimating teacher impacts on studentachievement: an experimental evaluation. NBER Working Paper, 14607. NationalBureau of Economic Research, Cambridge, MA.

Laboratorio Latinoamericano de Evaluación de la Calidad de la Educación (LLECE),2008. Los Aprendizajes de los Estudiantes de América Latina y el Caribe: PrimerReporte de los Resultados del Segundo Estudio Regional Comparativo y Explicativo.Oficina Regional de Educación de la UNESCO para América Latina y el Caribe(OREALC/UNESCO), Santiago, Chile.

Lavy, Victor, 2010. Do differences in school's instruction time explain internationalachievement gaps in math, science, and reading? Evidence from developed anddeveloping countries. NBER Working Paper, 16227. National Bureau of EconomicResearch, Cambridge, MA.

Lavy, Victor, 2011. What makes an effective teacher? Quasi-experimental evidence. NBERWorking Paper, 16885. National Bureau of Economic Research, Cambridge, MA.

Luque, Javier, 2008. The Quality of Education in Latin America and the Caribbean: theCase of Peru. Abt Associates, San Isidro.

Metzler, Johannes, Woessmann, Ludger, 2010. The impact of teacher subject knowledgeon student achievement: evidence from within-teacher within-student variation.CESifo Working Paper, 3111. CESifo, Munich.

Murnane, Richard J., Phillips, Barbara R., 1981. What do effective teachers of inner-citychildren have in common? Social Science Research 10 (1), 83–100.

Organisation for Economic Co-operation and Development (OECD), 2003. Literacy Skillsfor the World of Tomorrow: Further Results from PISA 2000. OECD, Paris.

Rivkin, Steven G., Hanushek, Eric A., Kain, John F., 2005. Teachers, schools, and academicachievement. Econometrica 73 (2), 417–458.

Rockoff, Jonah E., 2004. The impact of individual teachers on student achievement:evidence from panel data. The American Economic Review 94 (2), 247–252.

Rockoff, Jonah E., Jacob, Brian A., Kane, Thomas J., Staiger, Douglas O., 2011. Can yourecognize an effective teacher when you recruit one? Education Finance and Policy6 (1), 43–74.

Rothstein, Jesse, 2010. Teacher quality in educational production: tracking, decay, andstudent achievement. Quarterly Journal of Economics 125 (1), 175–214.

Rowan, Brian, Chiang, Fang-Shen, Miller, Robert J., 1997. Using research on employees'performance to study the effects of teachers on students' achievement. Sociologyof Education 70 (4), 256–285.

Schwerdt, Guido, Wuppermann, Amelie C., 2011. Is traditional teaching really all thatbad? A within-student between-subject approach. Economics of Education Review30 (2), 365–379.

Page 11: The impact of teacher subject knowledge on student achievement: Evidence from within-teacher within-student variation

496 J. Metzler, L. Woessmann / Journal of Development Economics 99 (2012) 486–496

Summers, Anita A., Wolfe, Barbara L., 1977. Do schools make a difference? The AmericanEconomic Review 67 (4), 639–652.

Tan, Jee-Peng, Lane, Julia, Coustère, Paul, 1997. Putting inputs to work in elementaryschools: what can be done in the Philippines? Economic Development and CulturalChange 45 (4), 857–879.

Wayne, Andrew J., Youngs, Peter, 2003. Teacher characteristics and student achievementgains: a review. Review of Educational Research 73 (1), 89–122.

World Bank, 2007. Toward high-quality education in Peru: standards, accountability,and capacity building. A World Bank Country Study. The World Bank, Washington,D.C.


Recommended