View
2
Download
0
Category
Preview:
Citation preview
Teachers and the Gender Gap in Reading Achievement∗
Esteban M. Aucejo†
Arizona State University
Jane Cooley Fruehwirth‡
University of North Carolina
Sean Kelly§
University of Pittsburgh
June 30, 2020
Abstract
Boys persistently lag behind girls in English/language arts. We find that hetero-geneity in teachers’ relative boy-specific value-added explains a large proportion of thisgap. We exploit multifaceted measures of effective teaching, including popular teacherobservation protocols, principal ratings and student perceptions of teaching practicesto explain this heterogeneity. We find no evidence of heterogeneous effects of theseteacher measures by gender. Instead, we show that gender gaps in student evaluationsof teaching practices capture meaningful differences in the quality of instruction boysand girls receive from the same teacher, explaining from a third to all of the value-addedgender gap.
Keywords: gender gap, teaching practices, teacher effectiveness
JEL Classification Codes: I2, I20, I21
∗This research was supported by the Institute of Education Sciences, U.S. Departmentof Education, through Grant R305A170269 to University of North Carolina, Chapel Hill.The opinions expressed are those of the authors and do not represent views of the Instituteor the U.S. Department of Education. We thank Robert Bringe for his excellent researchassistance and Ken Bollen, Cassie Guarino, Laura Hamilton, Spyros Konstantopoulos, andLindsay Matsumara for helpful comments.†Dept of Economics, Arizona State University, CEP & NBER. Esteban.Aucejo@asu.edu‡Dept of Economics & Carolina Population Center, UNC. jane fruehwirth@unc.edu§School of Education, University of Pittsburgh. spkelly@pitt.edu
1
1 Introduction
Boys persistently lag behind girls in reading achievement by as much as 0.29 of a standard
deviation in National Assessment of Education Progress scores, which is associated with
approximately a year of learning (Loveless, 2015). Reardon et al. (2016) finds that the gap
in state standardized tests of reading achievement (or English language arts (ELA)) is on
average 0.23 of a standard deviation or about two-thirds of a year of learning. Reading
skills are essential building blocks for learning, and early struggles with basic literacy skills
set the stage for a process of cumulative disadvantage (Chatterji, 2006; Northrop, 2017;
Senechal and LeFevre, 2002). By the end of high school, verbal skills are shown to be more
important determinants of college attendance than math skills (Aucejo and James, 2016).
Given the importance of teachers to student learning (Koedel et al., 2015; Jackson et al.,
2014; Konstantopoulos, 2014), we study how much teachers contribute to the gender gap in
ELA performance in elementary school.
To this end, we implement a multi-pronged approach that relies on multiple measures
of teacher and student performance. First, we make use of student test scores to recover
teacher gender-specific value-added (VA) estimates. In particular, we explore the extent
that certain teachers can be more effective with boys relative to girls by documenting how
their value-added varies across gender groups using the estimator proposed in Chetty et al.
(2014) broken out by student gender. This analysis informs the overall teacher potential
to close the ELA gap. Second, we exploit rich objective measures of teacher effectiveness,
based on popular protocols designed to assess effective instruction using trained raters (i.e.
CLASS and FFT, along with the ELA-specific protocol, PLATO), to determine whether
boys and girls receive different marginal benefits from having teachers who score highly
on these evaluations. Third, we exploit summative measures of teacher quality/knowledge,
based on principal evaluations of teacher overall quality, teacher training, experience, and
2
the Content Knowledge for Teaching Assessment of teacher knowledge, to see whether boys
or girls have different marginal benefits from teacher quality/knowledge. Fourth, we exploit
data on student evaluations of teaching practices/effectiveness based on Ferguson’s Tripod
7C’s,(Ferguson, 2008) to assess whether boys and girls receive different teaching practices
within the classroom.1 Based on this diverse and rich set of teacher performance measures,
we believe this paper is the first to use a large-scale and multi-perspective database to
robustly interrogate how instructional processes might mediate gendered learning outcomes.
In our specifications, we focus on gender gaps in ELA value-added, conditioning both on
prior math and ELA performance (consistent with the literature measuring teacher value-
added) in order to help isolate the effect of the current teacher. We refer to this throughout
the paper as the gender gap in ELA. This gap is 0.08 of a standard deviation in test scores,
and while still sizable, it is only half the size of the raw gender gap in ELA. Even after focusing
on value-added, several identification challenges remain for determining the role of teachers
in explaining this gap. First, if boys (relative to girls) are systematically matched to certain
type of teachers, we may confound teacher effects with student characteristics. Balancing
tests suggest that this is not the case: none of the rich measures of effective teaching predict
student gender. Moreover, we also find that estimates of the achievement gap remain robust
after controlling for school-grade fixed effects, measures of classroom composition and class
fixed effects. The gender gap would not be stable across these regressions if boys were
systematically matched to less effective teachers.2
1Tripod is designed to capture multiple dimensions of teaching, including several aspectsof the student-teacher relationship. For instance, teachers and boys may fundamentallystruggle to relate to each other in the same ways as teachers and girls do, and therefore theymay be less happy in class or find school-work less interesting.
2 It is important to clarify that while a sub-sample of the MET data collects informationfrom teachers that are randomly allocated into classrooms within school-grade (i.e. ran-domization blocks), our analysis does not rely on it (for most part of the analysis) due totwo main reasons. First, a high compliance “random sample” (which involves a sufficientlylarge share of students not re-shuffling across classes after the random allocation of teach-
3
Second, measurement error in our teacher-effectiveness-related measures may make it dif-
ficult to detect heterogeneous effects of teachers by student gender. This may be particularly
problematic for the observation protocols–CLASS, FFT and PLATO–where inter-rater relia-
bility suggests significant measurement error (Kelly et al., 2020). To overcome this problem,
we implement an instrumental variable approach where contemporaneous teacher-related
measures are instrumented with lagged measures based on the previous class the teacher
taught.
Finally, to use student survey responses to analyze whether teaching practices are applied
differently to boys and girls in the same classroom we must deal with the important issue of
confounding unobservable characteristics of the student (e.g., student bias) with actions of
the teacher. For instance, if a student reports that schoolwork is less engaging, this could
be because he/she does not like school regardless of the teacher or it could relate to what
the teacher is doing in the classroom. We separate the teacher effect from the student effect
by instrumenting the student report with the average rating of the teacher for boys and
girls from the teacher’s classroom in the previous year. Absent matching within schools,
these instruments are independent of student unobservable characteristics that might also
determine their survey responses and ELA achievement. Because we are over-identified, the
test of overidentifying restrictions can provide support that matching on unobservables is
not driving our findings.
We find that the magnitude of the heterogeneity in teacher’s relative male-specific value-
ers) would require to reduce our final sample by more than 60%, decreasing substantiallythe variation in the data. Second, matching based on gender is less concerning given thatclasses tend to be gender balanced. Therefore, this empirical regularity combined with thefact that our specifications control for student lagged reading achievement and school-gradefixed effects (or class fixed effects when possible) make the concern on matching based ongender much less important. Nevertheless, we also present results (corresponding to ourmain specification) based on the random sample. We find that that key coefficients of in-terest show similar magnitude to those obtained from the non-random sample, however thestandard errors become much larger.
4
added is of an equal magnitude to the gender gap, suggesting that some teachers do a par-
ticularly good job in improving the performance of boys. In fact, the standard deviation of
teacher male-specific value-added is 0.08, which is comparable to overall estimates of teacher
value-added in the literature. Despite these large differences in male-specific value-added, we
find no evidence that the marginal benefits of teacher effectiveness vary by gender, whether
we measure teacher effectiveness through observation protocols of effective instruction, prin-
cipal survey ratings, average student survey ratings or teacher aptitude and even after dealing
with potential measurement error. We also find no evidence of different marginal benefits
of teacher gender or experience for boys. While this is bad news for explaining the gender
gap, on the positive side it suggests that popular teacher evaluation protocols may not be
systematically biased toward practices that favors girls over boys.
Finally, we find gender disparities in Tripod evaluations to be a striking feature of the
MET data, consistent with prior research suggesting that boys tend to evaluate teachers
less positively than girls (e.g. Entwisle et al., 1997). However, we break from the literature
in suggesting that these gaps may reflect real differences in experience that translate into
learning, rather than student bias.3 After isolating the teacher-related component of these
gaps, we find that gender gaps in teaching practices within the classroom explain from a third
to all of the ELA gap, depending on the domain of instruction. Moreover, we present evidence
that the two leading practices for explaining the achievement gap, including captivate (e.g.,
homework and schoolwork are interesting) and confer (e.g., “my teacher wants me to explain
my answers”) contribute to student engagement as well as ELA.4 Overall, our findings suggest
that teachers could play an important role in closing the gender gap in ELA by applying
practices that enhance boys’ engagement with schooling.
Our study relates to a significant body of research that attempts to explain gender
3For example, Mengel et al. (2019) shows that that women university instructors receivesystematically lower teaching evaluations (from their students) than their male colleagues.
4Appendix Table A.1 provide the questions that make up these domains.
5
achievement gaps. First, Reardon et al. (2016) documents substantial heterogeneity in gen-
der gaps across school districts, which could suggest a role for school policies in contributing
to literacy gaps. One explanation that has received some support in the literature is that
males learn differently from females (Gurian and Stevens, 2004; Gurian, 2010). This would
suggest that some teaching practices might benefit females more than males and vice versa.
However we do not find that instructional processes promoted by current teacher observation
and evaluation protocols (i.e. CLASS, FFT and/or PLATO) inadvertently favor females.
Second, our research connects with the literature on gender gaps in socio-emotional skills.
For example, Bertrand and Pan (2013); Figlio et al. (2019); Aucejo and James (2019) show
that gender differences in non-cognitive skills play an important role in explaining gender
gaps in schooling progression. In a similar vein, Cornwell et al. (2013) concludes that girls
display a more developed attitude toward learning, which is consistent with the lower levels
of boys’ schooling engagement that we find in our data. However, these studies do not
provide an analysis of how to explicitly overcome gender differences in skills. Our results
contribute by focusing on how teachers, through their instruction, could impact boys more
effectively and compensate for their lower levels of school engagement.
Third, a strand of the literature argues that gender interactions between teachers and
students have an important effect on educational outcomes. In particular, Dee (2007) shows
that a large fraction of the gender gap in reading reflects the classroom dynamics associated
with the fact that boys’ reading teachers are mainly females. While we do not recover a
similar pattern in our data, our research contributes by focusing on the teaching practices
that might be used to better engage boys regardless of teacher gender.5 By identifying the
likely specific channels through which some teachers are more successful at teaching boys,
we open up the possibility of training teachers to be more effective with boys rather than
5We do not have many male teachers in our sample, which could explain the differencesin findings.
6
relying solely on the more limited policy of matching boys to male teachers.6
Finally, the literature has also explored the role of teacher bias. If present, bias is perhaps
most readily revealed in grading practices, where teachers have discretion. Research using
blind vs non-blind classroom assessments finds mixed evidence on whether males or females
are favored by teachers. Lavy and Sand (2015) show the teacher bias (measured as blind
(national) vs non-blind (classroom) assessment) helps explain the math/science gender gap
in favor of males. Terrier (2015) argues females benefit from positive discrimination in math
but not French (using blind v. non-blind). However, given that we are focusing on state
assessments, our analysis does not speak to direct bias in the form of reduced ratings of
students’ academic work. Nevertheless, bias could operate indirectly, and more pervasively,
in ways that detract from learning. For instance, if boys consistently receive less enthusiastic
feedback during class discussions in ELA, they may perceive that they are not good at ELA
tasks and underperform, and subsequently get even less attention from the teacher.7 Our
findings do not directly analyze teacher bias, other than to acknowledge that it is one of
possibly many ways that teachers may differ in their capacity to teach boys. Disparities in
teacher attention within the classroom could be a result of other factors, such as teacher
training and optimal responses to student effort/behavior.8
The rest of the paper proceeds as follows. We first describe the data in Section 2, including
6The literature has also pointed out the role of the cultural environment. Legewie andDiPrete (2012) argues that boys’ underachievement derives from cultural norms that defineschool achievement, and especially ELA content, as not masculine. As one of Smith andWilhelm (2002)’s participants put it, “Reading Don’t Fix No Chevys.” The lower levelsof male engagement with schooling activities that we find in our database are consistentwith this explanation, however our goal is to identify whether and how teachers can alsocompensate for those cultural mandates.
7Bassi et al. (2018) shows, in the context of the Chilean education system, that teachersshow an imbalance in their attention and interactions that favors boys.
8Jackson (2016) finds some evidence that teachers have higher ratings on individual atten-tion in single-sex schools and higher warmth for both males and females, which he interpretsas evidence of focus effects of teachers in single-sex settings and could point to a genderedaspect of teaching and relate to our findings in interesting ways.
7
all our measures of teacher effectiveness. Section 3 further characterizes the gender gap in our
sample and presents teacher value-added estimates by gender. Section 4 studies the presence
of heterogeneous teacher effects. Section 5 analyzes how teaching practices are applied to
females and males within the classroom and their impact in the gender gap. Section 6
explores plausible mechanisms. Finally, Section 7 concludes.
2 Data
We use the Measure of Effective Teaching (MET) Longitudinal Database for its extensive
information about student outcomes, teacher evaluation protocols, student assessment of
teachers and classroom composition. The data come from six large urban public school
districts in the United States over two academic years (2009-2010 and 2010-2011).9 The
data are also linked to administrative records with detailed information about students and
teachers. For students, we have current and prior measures of achievement based on state
standardized test scores and background characteristics–age, race/ethnicity, gender, gifted
status, and English language learner (ELL) status. Students are linked to their teachers.
We also have self-reported measures of engagement, including whether the student is happy
in class, homework completion and effort, as described further in Table A.1.
We focus on elementary students (grades 4 to 5) since they are primarily in self-contained
classrooms and would have more sustained exposure to their given teacher. In addition, we
rely on the second year of the data (2010/11) because the availability of lagged measures
of teacher effectiveness are an important part of our identification strategy. This yields
9The original 6 districts include New York City Department of Education, Charlotte-Mecklenburg Schools, Denver Public Schools, Memphis City Schools, Dallas IndependentSchool District, and Hillsborough County Public Schools, but only 5 have video observationdata of teaching practice. Kane and Staiger (2012) provides a detailed description of howschools were selected to participate in the MET project. More importantly, Kane and Staiger(2012) argues that MET teachers are comparable by most measures to their non-MET peersin the district, suggesting that they are representative of the districts included.
8
a potential sample of 13,552 students. We restrict the sample to include only students
with available information from the Tripod survey, which drops 2734 observations. We lose
an additional 929 observations because of missing ELA or math test scores. We lose an
additional 20 observations because the gender or age of the student is not reported. We also
limit the sample to students that have an ELA teacher observed in the MET study, which
drops an additional 1277 observations. Finally, we lose 3 additional observations for lack
of classmates after these restrictions. This brings the estimation sample to 8589 students.
Randomization of teachers to classrooms was an important part of the MET study, but this
only applies to a significantly smaller part of the sample, about a third to a half, depending
on whether you restrict to high compliance randomization blocks. We choose here not to
focus on the randomization sample because of the loss in power and the sense that it was
unnecessary for identification for our question.
2.1 Measuring Teacher Effectiveness
The MET includes rich data on diverse measures of teacher effectiveness–ranging from obser-
vation protocol designed to measure effective instruction, summative assessments of teacher
quality/knowledge, teacher characteristics and student evaluations of teaching practices. The
Content Knowledge for Teaching (CKT) assessment measures the teachers’ ELA knowledge
and specialized content knowledge to teach ELA effectively. Principals are also asked to rate
the teacher’s overall effectiveness on a scale of 1 to 7. Finally, the data also include measures
of teacher experience, which has been shown frequently to matter for student achievement,
but it is only measured based on experience in the district and is therefore very noisy. We
describe the observation-based and student-survey based measures further here.
Teacher Observation Protocol For this analysis we focus on three observation protocols
including Framework for Teaching (FFT), Classroom Assessment Scoring System (CLASS)
9
and the Protocol for Language Arts Teaching Observation (PLATO). These instruments
provide scores from trained raters on several dimensions of teaching. They are all based
on rigorous research and are aligned with established teaching practices and standards. As
such, we refer to these as measures of effective instruction. Trained raters scored teacher
lessons by watching video recordings and reliability checks were performed at different times
to test and ensure that videos were being appropriately rated.
FFT was designed for use in a variety of academic subjects and grade levels as an in-
strument for general teaching principles. It includes four domains, 2 of which are evaluated
in MET: (1) the classroom environment and (2) instruction. These domains were scored
on a total of eight components (sub-domains) on a four-point scale: unsatisfactory, basic,
proficient, or distinguished (Danielson, 2011).
CLASS is a standardized observational system that, like FFT, is designed for use in a
variety of subjects and grades as an instrument for general teaching principles (Hamre et
al., 2013). CLASS focuses in particular on the quality of teacher-student interactions and
is organized into three domains: (1) emotional support, (2) classroom organization, and (3)
instructional support. Teachers are rated on a 7-point scale labeled simply from low to high.
While there is overlap in the constructs measured by FFT, CLASS has an especially strong
focus on emotional support in defining that at the domain level (Hamre et al., 2013).
PLATO is a classroom observation tool designed to assess fourth to ninth grade ELA in-
struction (Grossman et al., 2013). PLATO focuses on 13 elements of instruction, 8 of which
are included in MET: intellectual challenge, modeling, strategy use and instruction, guided
practice, classroom discourse, text-based instruction, behavior management and time man-
agement. The elements are scored on a 4-point scale (almost no evidence, limited evidence,
evidence with some weaknesses, consistent strong evidence). While there is conceptual over-
lap with other observational protocols, the PLATO elements have their origin specifically in
research on English language arts instruction and are closely linked specifically with literacy
10
learning.
We create standardized versions of these measures to use in the analysis by averaging
across the different domains and raters. Each teacher was rated by multiple trained raters
and each domain is made up of several subdomains. Averaging over these multiple measures
help to deal with measurement error, though we discuss the potential for measurement error
to bias our results and our empirical strategy for testing this in Section 4 further.10
Student Survey Data Our student-survey-based measures of teaching practices come
from Ferguson (2008)’s Tripod or 7C’s survey. The survey was designed to measure student
perceptions of classroom instruction in 7 dimensions, including care, control, clarify, chal-
lenge, captivate, confer, and consolidate. Care measures students’ perceptions of whether
they feel encouraged and cared for by the teacher. Control measures student perceptions
about classroom behavior. Clarify measures student perceptions of teaching practices aimed
at helping them better understand classroom material. Challenge measures student percep-
tions of rigor and effort needed in the classroom. Captivate measures whether students find
schoolwork and homework to be interesting/enjoyable. Confer measures how much students
perceive they are encouraged to participate in class. Finally, Consolidate measures whether
students perceive that teachers explain how they can do better and summarize what they
learn each day.
Each dimension of the Tripod survey is associated with a set of statements (anywhere
from 2-8 statements for a given dimension). Students rate their agreement with a five-
point scale (1-Totally Untrue to 5-Totally True). Appendix Table A.1 describes the full set
of survey questions for each dimension. We create standardized versions of each of the 7
domains by taking averages across the responses of a given student and then standardizing
them. We also create a composite overall 7C score by averaging across all domains and then
10We also considered each subdomain separately, but focus only on the overall rating asthe results were similar across subdomains.
11
standardizing.11
2.2 Summary Statistics
Table 1 provides summary statistics for student characteristics, test scores and engagement
(Panel A), classroom characteristics (Panel B), and Tripod responses (Panel C). The first
3 columns present the means, standard deviation and number of observations for males,
the next 3 for females and the last 2 columns the difference in means and and a test of
whether the boy/girl means are statistically significantly different from each other.12 Girls
have statistically significantly higher test scores for ELA and lagged ELA (i.e. 0.16 of a
standard deviation in test scores) and demonstrate higher engagement with school by all 3
measures.13 That said, we do not see that girls are more likely to be placed in gifted classes
or that they are less likely to be designated English language learners. Most aspects of
student background are similar for boys and girls as might be expected, with the exception
that boys are statistically significantly older on average by .07 of a year (possibly consistent
with higher rates of retention or red-shirting), suggesting that controlling for age may be
important.14
Figure 1 presents K-density plots for ELA and math achievement separately for males
11Since many teachers in the former grades taught both math and ELA, these classeswere randomly split, with half of the class filling out the survey for math and the other halfemphasizing ELA. Because we did not see much evidence that student responses varied bysubject and our results were similar, we decided not to distinguish in the results we reporthere whether a student was reporting for ELA or math.
12Appendix Table A.2 shows analogous summary statistics but for the unrestricted sample(i.e. 13,552 students). A comparison between samples indicate that the distribution of mostdemographics and gender differences follow very similar patterns.
13This is consistent with previous findings in the literature (Cornwell et al., 2013; Bertrandand Pan, 2013; Figlio et al., 2019; Aucejo and James, 2019).
14Note that some of these control variables have fewer observations due to missing values.In regressions, we will control for this by using the standard technique of replacing missingvalues with 0’s and controlling for an indicator that they are missing.
12
and females. Girls outperform boys in ELA across the distribution. By contrast, the math
density plot shows similar performance for males and females across the distribution. We test
for statistical differences in the density for females and males using the Komolgorov-Smirnov
test and find that they are statistically significantly different for ELA with a p-value below
0.001, but not for math, with a p-value of 0.15.
Panel B of Table 1 shows that classroom characteristics are quite balanced between males
and females suggesting that non-random assignment of students to classrooms based on
gender may not be an issue.15 Table 2 shows summary statistics of teacher characteristics and
measures of teacher effectiveness for boys and girls.16 Most importantly, the characteristics
and measures of teacher effectiveness do not differ significantly between boys and girls,
suggesting again that boys and girls do not seem to be systematically assigned to different
teachers or classroom characteristics. Overall, a little under 10% of the students in the sample
have a male teacher. About a third of the teachers are black and only 6% are Hispanic,
despite black and Hispanic students each making up about a third of the population.17
Figure 1: ELA and Math Densities for Males and Females
0.1
.2.3
.4D
ensi
ty
−4 −2 0 2 4ELA Achievement
Male Female
Density for ELA Achievement by Gender
0.1
.2.3
.4D
ensi
ty
−4 −2 0 2 4Math Achievement
Male Female
Density for Math Achievement by Gender
15The only difference is in percent male, which is by construction given that it excludesthe student’s own observation, so will be higher for females. The balancing tests in Table 3show that boys are not systematically assigned to different types of classrooms.
16Appendix Table A.3 shows analogous summary statistics but for the unrestricted sample(i.e. 13,552 students). We do not find evidence suggesting large differences between samples.
17Appendix Table A.4 shows the correlations between these measures.
13
Table 1: Student Summary Statistics by GenderMale Female Male–Female
Mean SD N Mean SD N Mean P-value
Panel A: Student CharacteristicsELA(2009) 0.01 0.97 4227 0.17 0.93 4362 -0.16 0.00ELA(2010) 0.03 0.97 4227 0.19 0.91 4362 -0.16 0.00Effort 3.97 1.07 4225 4.15 1.01 4356 -0.18 0.00Happy 3.94 1.05 4194 4.10 1.01 4343 -0.15 0.00Homework Complete 0.73 0.44 4113 0.81 0.39 4285 -0.08 0.00Age 9.41 0.93 4227 9.34 0.91 4362 0.07 0.00Gifted 0.09 0.29 4227 0.10 0.30 4362 -0.01 0.37English Language Learner (ELL) 0.14 0.35 4227 0.13 0.34 4362 0.01 0.06Free Reduced Price Lunch (FRPL) 0.49 0.50 3020 0.50 0.50 3113 -0.01 0.56White 0.25 0.43 4199 0.24 0.43 4305 0.00 0.72Black 0.41 0.49 4199 0.41 0.49 4305 0.00 0.76Hispanic 0.26 0.44 4199 0.26 0.44 4305 0.00 0.74Asian 0.07 0.25 4199 0.06 0.24 4305 0.01 0.29Race Other 0.02 0.15 4199 0.02 0.15 4305 0.00 0.44Grade Level 4.54 0.50 4227 4.54 0.50 4362 0.01 0.63
Panel B: Class CharacteristicsAvg. Lag Math 0.06 0.51 4227 0.08 0.51 4362 -0.02 0.09Avg. Lag ELA 0.06 0.50 4227 0.08 0.50 4362 -0.02 0.07Avg. Age 9.40 0.82 4227 9.40 0.82 4362 0.00 0.94% Male 0.50 0.10 4227 0.50 0.10 4362 -0.01 0.01% Black 0.42 0.36 4199 0.41 0.36 4305 0.01 0.49% Hispanic 0.26 0.26 4199 0.25 0.26 4305 0.00 0.93% Asian 0.06 0.11 4199 0.06 0.11 4305 0.00 0.99% Race Other 0.02 0.04 4199 0.02 0.04 4305 0.00 0.30% Gifted 0.09 0.17 4227 0.09 0.17 4362 -0.01 0.10% ELL 0.14 0.18 4227 0.14 0.18 4362 0.00 0.29% FRPL 0.49 0.31 3020 0.49 0.31 3113 0.00 0.98
Panel C: Student Tripod ResponsesClarify 4.20 0.58 4227 4.27 0.56 4362 -0.07 0.00Care 4.13 0.74 4227 4.23 0.73 4362 -0.10 0.00Challenge 4.26 0.70 4227 4.31 0.69 4362 -0.05 0.00Consolidate 3.86 0.95 4227 3.89 0.97 4362 -0.03 0.11Captivate 3.60 0.84 4227 3.74 0.80 4362 -0.14 0.00Control 3.52 0.73 4227 3.53 0.73 4362 0.00 0.76Confer 4.22 0.60 4227 4.31 0.57 4362 -0.10 0.00All 7Cs 3.97 0.54 4227 4.04 0.53 4362 -0.07 0.00
Notes: The sample sizes 4362 for females and 4227 for males refer to the core sample; some variablesin this table have fewer observations, as discussed in Section 2. The last column reports whether themeans are statistically significantly different between males and females. The classroom-characteristicvariables in Panel B are calculated excluding each individual student. The tripod survey domainsin Panel C were constructed by averaging over the relevant questions. All 7Cs averages over the 7different domains.
14
Table 2: Teacher Descriptive Statistics by GenderMale Female Male–Female
Mean SD N Mean SD N Mean P-value
7C(2010) 4.01 0.25 4227 4.01 0.25 4362 0.00 0.49FFT(2010) 2.67 0.25 3541 2.67 0.25 3625 0.00 0.79CLASS(2010) 4.57 0.36 3541 4.58 0.36 3625 -0.01 0.18PLATO(2010) 2.70 0.23 3541 2.70 0.23 3625 0.00 0.477C(2009) 3.95 0.27 4093 3.95 0.26 4224 0.01 0.30FFT(2009) 2.66 0.24 3508 2.66 0.24 3598 0.00 0.49CLASS(2009) 4.57 0.40 3522 4.57 0.40 3611 0.00 0.85PLATO(2009) 2.66 0.27 3487 2.66 0.26 3576 0.00 0.91Principal Survey 4.32 1.16 3528 4.36 1.15 3652 -0.04 0.14CKT Score -0.02 1.00 3628 0.02 1.00 3702 -0.04 0.13Years of Exp. 6.49 5.89 2808 6.38 6.01 2899 0.10 0.52Male 0.09 0.29 4066 0.08 0.28 4202 0.01 0.14Black 0.32 0.47 4066 0.32 0.47 4202 0.00 0.72White 0.60 0.49 4066 0.61 0.49 4202 -0.01 0.30Hispanic 0.06 0.25 4066 0.06 0.24 4202 0.00 0.35Notes: P-value in the last column tests whether the male and female means arestatistically significantly different. The 7C variable the average student score byclass. FFT, CLASS and PLATO are also calculated as averages across all ratersand domains. For this analysis, we consider every possible response when calcu-lating this teacher average, so we include both ELA and Math responses in thecase where the teacher is instructing both subjects in one class.
Finally Panel C of Table 1 shows that there are striking raw differences in Tripod sur-
vey responses between boys and girls. Note that while we present raw measures here to
illustrate the magnitudes, in our empirical strategy we standardize these measures for easier
interpretation of effect sizes. In most cases, girls respond more favorably than boys, with
the exception of Control and Consolidate.18 Recall that a rating of 5 is indicative of more
positive regard for the teacher in that dimension, 3 is neutral and 1 is the lowest rating. We
see that the averages are above 4 for Clarify, Care, Challenge and Confer, but drop to rang-
ing between 3.5 and 4 for Captivate, Consolidate and Control. The statistically significant
18The lack of a gender gap in these measures likely stems from the fact that all the questionsaim at the classroom behavior rather than the individual students behavior or perceptionsof the teacher’s actions toward the student.
15
gender gaps range from .14 for Captivate to .05 for Challenge. These differences could be
a result of teacher actions or unobserved student attributes, a feature we explore further in
Section 5.
2.3 Balancing tests
We provide further evidence that boys are not systematically matched to teachers or class-
rooms in our sample through simple balancing tests in Table 3. We regress each measure of
teaching practice on whether the student is male conditional on school-grade fixed effects.
Previous literature has shown that a potential problem with contemporaneous measures of
teaching practice (either observation protocol or student/principal evaluations) is that they
may be affected directly by classroom composition (Campbell and Ronfeldt, 2018; Steinberg
and Garrett, 2016; Kelly et al., 2020), either because raters are biased, teachers adapt or
measures intrinsically capture some features of the classroom as well as the teacher. Thus,
our analysis will rely more heavily on lagged measures of the practice, in which case the
main endogeneity concern is matching. To this end, the balancing tests report both lagged
and contemporaneous practices. We find that none of our measures of teacher effectiveness,
quality or practice are statistically significant predictors of the male student dummy and
coefficients are very small in magnitude, whether we include contemporaneous or lagged
measures.
3 Gender Gap in ELA
In this section, we further develop evidence on the extent that teachers may help to explain
the gender gaps in ELA. We do this first by showing how the gap changes with various
controls that are not directly based on the teacher. Then, we provide an estimate of the
heterogeneity in teacher effectiveness for boys and girls.
16
Tab
le3:
Bal
anci
ng
Tes
ts
Cla
rify
Car
eC
hal
lenge
Con
solidat
eC
apti
vate
Con
trol
Con
fer
FF
TC
LA
SS
PL
AT
O
Pan
el
A:
Conte
mp
ora
neous
Teach
er
Measu
res
Mal
e0.
006
0.00
00.
006
0.00
8-0
.011
-0.0
17-0
.006
-0.0
05-0
.007
-0.0
07
(0.0
12)
(0.0
11)
(0.0
10)
(0.0
10)
(0.0
12)
(0.0
11)
(0.0
12)
(0.0
09)
(0.0
08)
(0.0
08)
N85
8985
8985
8985
8985
8985
8985
8971
6671
6671
66
Pan
el
B:
Lagged
Teach
er
Measu
res
Mal
e0.
003
0.00
90.
006
0.00
50.
013
0.02
0*-0
.010
0.01
4-0
.003
-0.0
03
(0.0
10)
(0.0
10)
(0.0
11)
(0.0
11)
(0.0
12)
(0.0
10)
(0.0
10)
(0.0
10)
(0.0
11)
(0.0
10)
N83
1783
1783
1783
1783
1783
1783
1771
0671
3370
63
PSV
YC
KT
Mal
eB
lack
Whit
eH
ispan
icY
rs.
ofE
xp.
φM j
φj
Pan
el
C:
Conte
mp
ora
neous
an
dT
ime-I
nvari
ant
Teach
er
Measu
res
Mal
e-0
.008
-0.0
170.
010
0.00
4-0
.009
0.00
40.
028
-0.0
000.
001
-
(0.0
10)
(0.0
15)
(0.0
09)
(0.0
05)
(0.0
06)
(0.0
05)
(0.0
70)
(0.0
01)
(0.0
01)
-
N71
8073
3082
6882
6882
6882
6857
0783
2983
79-
Note
s:***
den
otes
sign
ifica
nce
at
the
1%,
**at
the
5%an
d*
atth
e10
%le
vels
.S
tan
dar
der
rors
are
clust
ered
atth
esc
hool
leve
l.E
ach
cell
corr
esp
ond
sto
ase
par
ate
regr
essi
onof
ate
ach
erm
easu
reon
the
stu
den
tm
ale
du
mm
yco
ntr
olli
ng
for
sch
ool
-gra
de
fixed
effec
ts.
Pan
els
Aan
dC
incl
ud
eco
nte
mp
oran
eou
ste
ach
erm
easu
res,
wh
ile
Pan
elB
reli
eson
lagg
edm
easu
res.
All
thre
ep
an
els
rep
ort
coeffi
cien
tson
the
mal
ein
dic
ator
for
the
diff
eren
td
epen
den
tva
riab
les
inea
chco
lum
n.
PS
VY
isth
ep
rin
cip
alsu
rvey
-base
dte
ach
erra
tin
g,
CK
Tre
fers
toth
ete
ach
erap
titu
de
rati
ng
assi
gned
by
obse
rver
s,ye
ars
ofex
per
ien
cere
fers
toye
ars
inth
ecu
rren
td
istr
ict
andφM j
andφj
are
valu
e-ad
ded
vari
able
sd
iscu
ssed
inse
ctio
n3.
2.T
he
sam
ple
size
isli
mit
edon
lyby
each
dep
end
ent
vari
ab
lean
dis
show
nin
the
bot
tom
row
ofea
chp
anel
.
17
3.1 Conditional Gender Gap
Table 4 presents a set of OLS regressions where the dependent variable is ELA score and
the independent variables are gender, and then we add in different sets of controls across the
columns including, prior achievement, student background characteristics, classroom com-
position, and school/grade/classroom fixed effects (depending on the specification). Column
(1) shows that the raw gender gap in ELA is 0.16 of a standard deviation. Column (2)
controls for lagged ELA and math scores to show how controlling for prior performance
affects the overall gap. Column (3) adds other student characteristics, including English
language learner, race/ethnicity, age, gifted, free/reduced price lunch status to determine
whether other student characteristics help explain the gap. Column (4) adds in controls for
school-grade fixed effects to see whether school/grade level variation explains some of the
gap. Column (5) adds in controls for classroom composition, the classroom peer averages of
all the individual controls, including lagged ELA and math. This is our main specification
throughout. Column (6) controls for class fixed effects rather than classroom composition
to capture any unobservables at the class level that may explain the achievement gap.
Interestingly, these results that controlling for lagged achievement explains about half of
the raw gap. After that, the gap is stable. The fact that classroom composition measures,
school-grade fixed effects and classroom fixed effects have virtually no additional explanatory
power matches the intuition that males are not being systematically matched to higher or
lower “quality” schools or classrooms once we account for lagged test scores.19 Importantly,
the results also suggest that teacher-based explanations of the achievement gap must involve
boys and girls within the same classroom not receiving the same benefits from their teacher.
19The size of the ELA gender gap remains quite stable when comparing specificationswith only class fixed effects or with only school-grade fixed effects and no other controls.Importantly, this means that conditioning on prior test scores is not needed to get thisevidence of stability. Again, these results make sense when there is no within-school matchingof students to classrooms, either based on gender or prior achievement.
18
Table 4: The Conditional Gender Gap in ELA (N=8589)
(1) (2) (3) (4) (5) (6)
Male -0.160*** -0.084*** -0.083*** -0.077*** -0.075*** -0.081***(0.022) (0.015) (0.015) (0.014) (0.014) (0.014)
ELAt−1 0.557*** 0.526*** 0.513*** 0.510*** 0.515***(0.013) (0.013) (0.013) (0.013) (0.013)
Matht−1 0.283*** 0.252*** 0.239*** 0.239*** 0.229***(0.010) (0.011) (0.009) (0.009) (0.009)
Avg Peer Matht−1 0.181**(0.073)
Avg Peer ELAt−1 -0.076(0.077)
% Male 0.122*(0.070)
ELAt−1, Matht−1 X X X X XAdditional Student Controls X X X XClass Controls XSchool-Grade FE X XClass FE X
R2 0.007 0.633 0.641 0.576 0.578 0.559
Notes: *** denotes significance at the 1%, ** at the 5% and * at the 10% levels. Standard errors areclustered at the school level. Student controls include lagged ELA and math, age, race, grade level,English Language Learner (ELL) status, gifted status, free and reduced price lunch (FRPL) status. Todeal with missing values for race and free/reduced price lunch, we replace missing with 0’s and controlfor indicator variables for missing. Classroom controls in column (5) controls include average lagged peerachievement (both math and ELA), average age, %male, %black, %hispanic, %race other, %gifted, %ELL,and %FRPL.
19
3.2 Teacher Male-Specific Value-Added
To further understand whether teacher-based explanations may be a plausible explanation
for the gap, we next consider an overall estimate of the male-specific teacher value-added
to get a sense of the extent to which within-class gender disparities in teacher value-added
contribute to the gap. Let i index students, c = c(i, t) classrooms and j = j(c(i, t)) teachers,
where teachers are assumed to be assigned to a given classroom. Yit denotes a student’s ELA
achievement on the end of grade assessment and Mi an indicator that the student is male.
Let φj denote the contribution of a teacher j to i’s value-added. If the student is male, the
teacher also makes contribution φMj , i.e.,
Yit = β0 + β1Mi +Xitβ2 + φj + φMj Mi + νit. (1)
Note that if φMj = 0, then males and females receive the same benefits of a given teacher.
Xit includes math and reading prior achievement, dummies for race, gifted, English language
learner, free-reduced price lunch, and the corresponding classroom averages along with per-
cent male.20
We estimate the value-added in several steps, following Chetty et al. (2014) and expanding
to recover the differential male effect, using the logic of a difference-in-difference estimator.
First, we estimate equation (1) controlling for teacher fixed effects, which yields estimates
of β0, β1, β2.21 We then solve for φj taking averages over all females in a given class as:
φj + ˆνFjt = Y Fjt − β0 − XF
jt β2 ≡ ˜Y Fjt ,
20Because free/reduced-price lunch status is missing in some cases, we replace missingvalues with 0 and control for a missing indicator so that these observations are not droppedfrom our regression.
21Controlling for teacher fixed effects helps eliminate bias from matching of students toteachers.
20
where the F superscript denote that classroom averages are taken over the females in the
classroom. This gives us an approximation of φj with error ˆνFjt. Third, to deal with noise, we
apply a standard shrinkage estimator (denoted by superscript S) resulting in ˜Y FSjt . Fourth, we
regress ˜Y FSjt on ˜Y FS
jt−1 with coefficient δ, which gives us φj = δ ˜Y Fjt−1, assuming Cov(ˆνjt, ˆνjt−1) =
0. This is a standard assumption in value-added models and is based on the rationale that Xit
provides sufficient controls for selection so that what remains in the residual after removing
the shared teacher component is just random measurement error.
We follow the same procedure as above for the male sample. We solve for the residualized
test score first as:
φj + φMj + ˆνMjt = Y M
jt − β0 − β1 − XMjt β2 ≡ ˜Y M
jt .
From the same steps as above, applying shrinkage, etc, we recover φj + φMj = δM ˜Y MS
jt−1, where
δM was estimated from the projection of ˜Y MSjt on ˜Y MS
jt−1.
As a final step, we estimate the relative male-specific value-added, φMj , by taking differ-
ences between the male and female value-added estimates. While Chetty et al. (2014), Kane
et al. (2013) and others make a compelling case that the above procedure is sufficient to re-
cover value-added even in the presence of sorting, the recovery of male-specific-value-added
has the additional advantage that it is based on a within-class comparison and so is arguably
even more robust to sorting concerns.
Table 5 shows that the relative male-specific value-added is of approximately similar
magnitude to the overall value-added, 0.08 of a standard deviation. We also explore a
generalization by estimating equation (1) separately for males and females. This approach
gives slightly higher estimates of the standard deviation in male-specific value-added. Finally,
we contrast estimates when we do not apply shrinkage. Again, estimates are slightly larger.
Overall, these findings support that there is a likely significant capacity for teachers to close
21
Table 5: Teacher Value-Added Components (N=448)
Mean SD Min Max
Restricted parameter estimatesShrunk
φMj 0.01 0.08 -0.28 0.24
φj -0.01 0.08 -0.23 0.32Not Shrunk
φMj 0.01 0.10 -0.34 0.28
φj -0.01 0.09 -0.27 0.40
Unrestricted parameter estimatesShrunk
φMj 0.01 0.10 -0.37 0.27
φj 0.00 0.10 -0.30 0.42Not Shrunk
φMj 0.01 0.12 -0.46 0.33
φj 0.00 0.12 -0.36 0.53
Notes: φMj and φj are value-added variables.They are constructed using the process discussedin section 3.2. In the regressions that createdthe value-added measures used to the constructφMj and φj , the following controls were included:lagged ELA and math, race, English LanguageLearner (ELL) status, gifted status, free and re-duced price lunch (FRPL) status, year and in-dicator variables showing if FRPL status wasmissing. We also controlled for the followingclassroom composition variables: average laggedpeer achievement (both math and ELA), %male,%black, %hispanic, %race other, %gifted, %ELL,and %FRPL. Unrestricted parameter estimatesare when equation (1) is estimated separately formales and females, whereas restricted model re-stricts the parameter estimates to be the samefor males and females.
22
the male/female ELA gap and that some teachers are better at doing this.
Given that φMj 6= 0, it could be (1) that the teacher input is the same for males and
females, but males have a different marginal benefit of that input, i.e., φMj = ρφj. Alterna-
tively, it could be that (2) males and females within the same classroom experience different
inputs, i.e., they feel more cared for or more engaged by what the teacher is doing. We
test these alternative hypotheses below. First, we exploit our rich teacher/classroom-based
measures of teacher effectiveness to test (1). We use the individual-specific Tripod measures
of practices to test (2).
4 Heterogeneous Teacher Effects
Value-added estimates reveal that some teachers are more effective at teaching boys relative
to girls, suggesting that differences in teacher effectiveness could explain the achievement gap.
A natural next step is to consider whether popular teacher evaluation protocols or other
measures of teacher aptitude and quality, as measured in MET data, have heterogeneous
returns across males and females. Let Pj represent these different measures of effective
teaching, which may or may not be time-varying. We estimate
Yit = β0 + β1Mi +Xitβ2 +MiPjβ3 + Pjβ4 + βsg + εit, (2)
where βsg denotes the school, s = s(i, t), by grade, g = g(i, t), fixed effects. The fixed effects
are important to control for differences in end of grade assessments across schools and grades
in our sample. They could play a secondary role of helping to control for matching, though
evidence presented in Section 2.3 suggests that matching of males to teachers/classrooms at
the school-grade level is not a likely confound. Our key parameter of interest is β3. Note
that because Pj is standardized to have mean 0, β1 will remain the same, continuing to
23
capture the achievement gap in the average classroom when we include the interactions with
our measure of teacher effectiveness. Thus, the relevant focus will be on the magnitude of
the interaction. If β3 is high relative to β1, it suggests that there is significant potential
for heterogeneity in teacher effectiveness to explain the achievement gap. While Pj could
represent a vector of teaching practices/qualities, we focus on estimating equations with
each teacher measure separately to give the best chance of finding statistically significant
interactions.
Our teacher measures include the standardize average ratings from each of the video
observation protocols (FFT, CLASS, PLATO), the standardize average of the 7Cs student
evaluations (this specification does not exploit individual student level variation in Tripod
as we do in Section 5), the principal survey teacher rating, CKT and teacher characteristics,
including experience and whether the teacher is male. We focus our analysis on contem-
poraneous measures of effective teaching, but we show robustness to potential bias from
measurement error and/or matching, as discussed below.
4.1 Results
The top panels of Tables 6 and 7 show estimates of equation (2) for each measure of teacher
effectiveness and teacher characteristics separately. We find that across the board β3 is not
statistically significantly different from 0 and that the point estimates are small. This is
true whether we include the practice measures individually or in the same regression (not
shown).22 Interestingly, we also do not find that boys benefit more from having male teachers,
though the lack of significance may be driven by the small number of male teachers in our
sample.
22Bassi et al. (2016) also show that in the context of the Chilean education system teacherswith higher scores in CLASS do not have differential impact on boys and girls readingperformance.
24
Tab
le6:
Het
erog
eneo
us
Eff
ects
ofT
each
ers
(N=
8589
)
FF
TC
LA
SS
PL
AT
OT
each
erA
pti
tude
(CK
T)
Pri
nci
pal
Eva
luat
ion
Yea
rsof
Exp
erie
nce
Tea
cher
Mal
e
(1)
(2)
(3)
(4)
(5)
(6)
(7)
Panel
A
Mal
e×T
each
erM
easu
re-0
.001
0.00
6-0
.010
-0.0
00-0
.002
-0.0
010.
031
(0.0
13)
(0.0
14)
(0.0
13)
(0.0
13)
(0.0
15)
(0.0
03)
(0.0
40)
Mal
e-0
.075
***
-0.0
75**
*-0
.075
***
-0.0
75**
*-0
.074
***
-0.0
75**
*-0
.075
***
(0.0
14)
(0.0
14)
(0.0
14)
(0.0
14)
(0.0
14)
(0.0
14)
(0.0
14)
Tea
cher
Mea
sure
0.03
1**
0.02
50.
027*
-0.0
010.
059*
**0.
004
-0.0
03
(0.0
13)
(0.0
16)
(0.0
16)
(0.0
10)
(0.0
15)
(0.0
02)
(0.0
34)
Panel
B
Mal
e×T
each
erM
easu
re0.
000
0.00
3-0
.027
--
--
(0.0
33)
(0.0
28)
(0.0
52)
Mal
e-0
.075
***
-0.0
73**
*-0
.075
***
--
--
(0.0
14)
(0.0
14)
(0.0
14)
Tea
cher
Mea
sure
0.10
3**
0.08
3**
-0.0
06-
--
-
(0.0
48)
(0.0
36)
(0.0
53)
Fir
stSta
geF†
16.6
6525
.463
13.6
04
Not
es:
***
den
otes
sign
ifica
nce
at
the
1%,
**at
the
5%an
d*
atth
e10
%le
vel
s.S
tan
dar
der
rors
are
clu
ster
edat
the
sch
ool
leve
l.E
ach
colu
mn
refe
rsto
ase
par
ate
regr
essi
onw
ith
EL
Aas
the
dep
end
ent
vari
able
and
the
teac
her
mea
sure
list
edin
the
hea
der
.A
llre
gre
ssio
ns
contr
olfo
rsc
hool
-gra
de
fixed
effec
tsan
dst
ud
ent
lagg
edac
hie
vem
ent
(EL
Aan
dm
ath
),age,
gen
der
,ra
ce,
En
gli
shL
angu
age
Lea
rner
(EL
L)
stat
us,
gift
edst
atu
s,fr
eean
dre
du
ced
pri
celu
nch
(FR
PL
)st
atu
s,an
din
dic
ato
rva
riab
les
show
ing
ifan
yof
the
pre
vio
us
vari
able
sw
ere
mis
sin
g(i
ncl
ud
ing
the
teac
her
pra
ctic
e).
Cla
ssro
omco
ntr
ols
are
als
oin
clu
ded
–ave
rage
lagg
edp
eer
ach
ievem
ent
(bot
hm
ath
and
EL
A),
aver
age
age,
%m
ale,
%b
lack
,%
his
pan
ic,
%ra
ceot
her
,%
gift
ed,
%E
LL
,an
d%
FR
PL
.†
Rep
orts
the
Kle
iber
gen
-Paa
prk
Wal
dst
atis
tic
for
aw
eak
inst
rum
ent
test
.
25
Tab
le7:
Het
erog
eneo
us
Eff
ects
ofT
each
ers
(N=
8589
)
7Cs
Cla
rify
Car
eC
hal
lenge
Con
solidat
eC
apti
vate
Con
trol
Con
fer
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
Panel
A
Mal
e×T
each
erM
easu
re0.
001
0.00
10.
007
-0.0
090.
002
-0.0
020.
002
0.00
4
(0.0
11)
(0.0
12)
(0.0
12)
(0.0
13)
(0.0
12)
(0.0
12)
(0.0
12)
(0.0
12)
Mal
e-0
.075
***
-0.0
76**
*-0
.075
***
-0.0
76**
*-0
.075
***
-0.0
75**
*-0
.074
***
-0.0
75**
*
(0.0
14)
(0.0
14)
(0.0
14)
(0.0
14)
(0.0
14)
(0.0
14)
(0.0
14)
(0.0
14)
Tea
cher
Mea
sure
0.03
0*0.
028*
0.02
30.
040*
**0.
006
0.01
90.
033*
*0.
024*
(0.0
16)
(0.0
16)
(0.0
16)
(0.0
14)
(0.0
17)
(0.0
15)
(0.0
17)
(0.0
13)
Panel
B
Mal
e×T
each
erM
easu
re-0
.017
-0.0
040.
003
-0.0
37-0
.015
-0.0
25-0
.012
-0.0
06
(0.0
25)
(0.0
26)
(0.0
21)
(0.0
35)
(0.0
27)
(0.0
28)
(0.0
28)
(0.0
29)
Mal
e-0
.076
***
-0.0
77**
*-0
.076
***
-0.0
80**
*-0
.078
***
-0.0
72**
*-0
.070
***
-0.0
76**
*
(0.0
14)
(0.0
14)
(0.0
14)
(0.0
14)
(0.0
15)
(0.0
15)
(0.0
15)
(0.0
14)
Tea
cher
Mea
sure
0.12
6***
0.08
0**
0.07
2**
0.17
3***
0.09
80.
196*
*0.
137*
*0.
100*
**
(0.0
37)
(0.0
38)
(0.0
28)
(0.0
54)
(0.0
64)
(0.0
86)
(0.0
56)
(0.0
34)
Fir
stSta
geF†
11.3
5111
.407
27.5
199.
413
7.18
16.
074
3.59
412
.532
Not
es:
***
den
otes
sign
ifica
nce
at
the
1%,
**at
the
5%an
d*
atth
e10
%le
vels
.S
tan
dar
der
rors
are
clu
ster
edat
the
sch
ool
leve
l.E
ach
colu
mn
refe
rsto
ase
par
ate
regr
essi
onw
ith
EL
Aas
the
dep
end
ent
vari
able
and
the
teac
her
mea
sure
list
edin
the
hea
der
.A
llre
gres
sion
sco
ntr
ol
for
sch
ool-
grad
efi
xed
effec
tsan
dst
ud
ent
lagg
edac
hie
vem
ent
(EL
Aan
dm
ath
),ag
e,ge
nd
er,
race
,E
ngl
ish
Lan
gu
age
Lea
rner
(EL
L)
statu
s,gi
fted
stat
us,
free
and
red
uce
dp
rice
lun
ch(F
RP
L)
stat
us,
and
ind
icat
orva
riab
les
show
ing
ifan
yof
the
pre
vio
us
vari
able
sw
ere
mis
sin
g(i
ncl
ud
ing
the
teac
her
pra
ctic
e).
Cla
ssro
omco
ntr
ols
are
also
incl
ud
ed–a
vera
gela
gged
pee
rach
ieve
men
t(b
oth
mat
han
dE
LA
),av
erag
eag
e,%
mal
e,%
bla
ck,
%h
ispan
ic,
%ra
ceot
her
,%
gift
ed,
%E
LL
,an
d%
FR
PL
.†
Rep
orts
the
Kle
iber
gen
-Paa
prk
Wald
stat
isti
cfo
ra
wea
kin
stru
men
tte
st.
26
One reason we may fail to uncover significant heterogeneity in the effects of teachers for
males is because of measurement error. For instance, Kelly et al. (2020) and other studies
discuss the somewhat low inter-rater reliability for these protocols. Furthermore, existing
research suggests that FFT and CLASS scales are correlated with characteristics of the
classroom Campbell and Ronfeldt (2018); Kelly et al. (2020); Garrett and Steinberg (2015),
so that measurement error may not be random. This could be a result of either measurement
bias among raters that is created by confounding teacher and classroom characteristics or
teacher adaptation in how they teach according to classroom characteristics. A similar
concern will apply to student-based 7C’s ratings. We address these types of measurement
error in Panel B of Tables 6 and 7 by instrumenting contemporaneous measures of teacher
effectiveness (Pjt and PjtMi) with lagged measures (Pjt−1 and Pjt−1Mi).23 Again, even
after accounting for measurement error, there is no evidence of heterogeneity in teacher
effectiveness by gender.24
Another potential reason that the estimates of β3 could be biased is if boys are matched
to teachers who are better-suited to teaching boys. Our balancing tests reveal no evidence
of matching based on observable measures of teacher effectiveness. Furthermore, we believe
if anything this matching would bias our estimates of β3 away from 0. However, we check
robustness by controlling for matching based on class fixed effects.25 These results, reported
in Appendix Tables A.5 and A.6 again show that there is no evidence of heterogeneity in
23Note that we could use multiple contemporaneous measures also as a potential instru-ment. We do not pursue this route because of evidence that contemporaneous measures maybe biased by aspects of the current classroom, as discussed above.
24Note that similar results apply if we replace contemporaneous with lagged measures ofteacher effectiveness, i.e., the reduced-form of Panel B. We also experimented with principalcomponent scores, instead of averages of the different domains, but the lack of heterogeneouseffects remain robust.
25We repeated this analysis on the subsample where teachers were randomly allocated intoclassrooms within randomization blocks (i.e. approximately within school-grade level). Re-assuringly, we also find small point estimates that are not statistically significantly differentfrom 0.
27
teacher effectiveness by gender.26
Overall, this evidence indicates that the effects of teaching practices do not differ by
gender, whether we use observation protocol-based measures or student survey or principal-
based measures or teacher knowledge. We further find no evidence that boy learning in ELA
is better-supported by teachers with more experience or those who are male. The good news
is that the practices emphasized by the observation-based protocols as effective instruction
do not appear to favor girls to the expense of boys. This is important because these protocols
are being increasingly used in districts in high stakes settings. Yet, the puzzle of how teachers
contribute to the gender ELA gap remains.
5 Teacher Differentiation and Student Evaluations
The remaining potential explanation for the heterogeneity in teacher effectiveness by gender
revealed by our value-added estimates has to do with teacher differentiation within the class-
room. Teacher differentiation could occur from teachers targeting boys and girls differently
within the classroom or indirectly through the teaching practice being received differently
by boys and girls. For instance, boys may be called on more to answer questions than girls
or may find different teacher actions to be caring or clarifying. Given the evidence that
boys learn in different ways than girls (Gurian and Stevens, 2004), some within-classroom
differentiation of instruction between them may even be an important aspect of teacher
effectiveness. Importantly, none of the observation protocol are designed to capture differ-
entiation, but we can begin to explore aspects of this differentiation using Tripod’s student
evaluations of teaching practice. It is an unusually rich measure in large scale data offering
some sense of differentiation of teacher inputs within the classroom.
26Instruments are stronger in this case because we are only instrumenting for the interac-tion of the measure of teacher effectiveness with the male dummy because the level effect ofthe teacher is absorbed by the class fixed effects.
28
First, we perform a similar analysis to the conditional ELA gap in Table 4, but with
student-level 7C’s as the dependent variable. We do this to see whether our controls for
student and classroom characteristics explain the gender gap, which would eliminate the
student-level variation in 7C’s as a likely explanation for the ELA gender gap. Each row of
Table 8 corresponds to treating a different domain of the 7C’s as the outcome variable and
the cells report the coefficient on the male dummy from separate regressions with different
controls. Column (1) is the raw gap; column (2) conditions on lagged performance. Column
(3) controls for other student characteristics, and column (4) for school-grade fixed effects
to deal with any differences at the school-grade level or potential matching to schools. To
deal with potential matching at the classroom/teacher level, column (5) adds controls for
classroom characteristics and column (6) for classroom fixed effects.
Interestingly, unlike ELA, for most of the 7C’s the gap is not explained at all by lagged
ELA and math performance. However, like ELA, the gender gap stays constant with the
addition of other student controls, school-grade fixed effects, classroom controls and class-
room fixed effects. An exception to this is Challenge, in which case the gap drops from
-.07 to -0.03 after controlling for school-grade fixed effects and class characteristics and is
no longer statistically significantly different from zero. Thus, despite significant raw gaps in
Challenge, the lack of conditional gender gap suggests it is unlikely to explain the gender
gap in ELA. Interestingly, the gender gap in Control is consistently zero across specifica-
tions, likely because it measures students’ perceptions of the classroom behavior rather than
anything that would signal differentiation. Consolidate also exhibits a comparatively small
gap throughout, suggesting it is an unlikely explanation. The persistent gap in Clarify, Care,
Captivate and Confer make these the most likely candidate explanations for the ELA gender
gap. That said, the gaps in these measures could be due to (1) males perceiving different
degrees of support, care or engagement than females, perhaps through teacher differentia-
tion of curriculum, teacher bias or differential student response to inputs, or (2) males just
29
evaluate teachers on a different scale. The first explanation could help to explain the ELA
achievement gap, whereas the latter would fit more with the observation of several studies
suggesting that male/female gaps in teacher ratings are not “productive” but rather evidence
of student bias (MacNell et al., 2015). We explore this further below.
5.1 Do the Tripod Gaps Matter for Achievement
To test this differentiation hypothesis, suppose there is some student-specific component of
teacher effectiveness, τij, which could be either uni- or multi-dimensional such that
Yit = β0 + β1Mi +Xitβ2 + τijβ3 + νit.
To the extent that τij explains the gender gap in ELA, we expect that controlling for it would
shrink the value of β1 toward 0. Suppose that τij is captured at least in part by the student
evaluations. Let Eijk denote the evaluation of a teacher j by a student i on a question k,
i.e.,
Eijk = τij + eijk, (3)
where eijk captures measurement error, student characteristics or non-productive aspects of
the student perception of teachers.
Plugging in for each student’s average evaluation of teacher j in a given dimension (or
dimensions) as a proxy for τij, we have
Yit = β0 + β1Mi +Xitβ2 + Eijβ3 + νij − β3eij. (4)
We cannot be certain that the measurement error is random in this case, so it is difficult
to sign the bias in β3 associated with this measurement error. The individual evaluation
of the teacher could be correlated with achievement for reasons not related directly to the
30
Table 8: Conditional Gender Gap in 7C’s Student Evaluations (N=8589)
Response (1) (2) (3) (4) (5) (6)
ClarifyMale -0.122*** -0.121*** -0.119*** -0.120*** -0.114*** -0.125***
(0.021) (0.021) (0.021) (0.022) (0.022) (0.022)Care
Male -0.137*** -0.137*** -0.134*** -0.136*** -0.132*** -0.142***(0.023) (0.023) (0.023) (0.023) (0.024) (0.023)
ChallengeMale -0.069*** -0.051** -0.051** -0.042** -0.035 -0.048**
(0.020) (0.020) (0.020) (0.020) (0.021) (0.020)Consolidate
Male -0.033* -0.037** -0.037** -0.043** -0.036* -0.051***(0.019) (0.019) (0.018) (0.019) (0.020) (0.020)
CaptivateMale -0.154*** -0.172*** -0.163*** -0.169*** -0.170*** -0.171***
(0.019) (0.019) (0.019) (0.020) (0.021) (0.020)Control
Male -0.007 -0.006 -0.016 0.013 -0.001 0.021(0.022) (0.022) (0.023) (0.020) (0.020) (0.019)
ConferMale -0.153*** -0.148*** -0.144*** -0.148*** -0.144*** -0.152***
(0.020) (0.020) (0.020) (0.021) (0.021) (0.021)All 7Cs
Male -0.131*** -0.132*** -0.130*** -0.127*** -0.124*** -0.132***(0.020) (0.021) (0.021) (0.022) (0.022) (0.022)
ELAt−1, Matht−1 X X X X XOther Student X X X XClass Controls XSchool-Grade FE X XClass FE X
Notes: *** denotes significance at the 1%, ** at the 5% and * at the 10% levels. Standarderrors are clustered at the school level. Although this table only shows the coefficient estimatesfor the male variable, other controls were included. Starting with column (3), all individualstudent controls were included. The controls include lagged ELA and math, age, race, gradelevel, English Language Learner (ELL) status, gifted status, free and reduced price lunch (FRPL)status, and indicator variables showing if any of the previous variables were missing (we imputeto maintain consistent sample size). In column (5) additional classroom characteristic controlswere included. These include average lagged peer achievement (both math and ELA), averageage, %male, %black, %hispanic, %race other, %gifted, %ELL, and %FRPL. These classroomcharacteristic variables were calculated for each student excluding themselves.
31
teaching practice. For instance, engaged students may rate teachers more highly, suggesting
the possible concern of reverse causality.
To address these concerns, we instrument Eij with the lag of teacher j’s average evaluation
for her/his t−1 class for females and males, EFjt,t−1, E
Mjt,t−1.
27 These instruments are arguably
independent of unobservables at the student level that contribute to either achievement or
certain types of evaluations. The main reason our instrumental variable strategy might fail
to address the endogeneity concern is if there is matching of students of certain types to
teachers with certain practices or effectiveness, but again recall the balancing tests suggest
that this is not a concern. Because we are over-identified, we can also test whether the
instruments pass the test of over-identifying restrictions, which would also be unlikely if
matching were relevant.
5.2 Results
Table 9 shows estimates of equation (4) first without any control for teaching practice (col-
umn 1), then the overall 7C’s (column 2) and then with each of the 7C’s that shows a
statistically significant raw gender gap, including Clarify, Care, Captivate, Challenge and
Confer (columns 3 to 7). These specifications follow the instrumental variable strategy de-
scribed above. Comparing columns (1) to (2) we see that the overall 7C’s explains a little
less than half of the gap, 0.03 of a standard deviation. A 1 standard deviation increase in
overall 7C’s increases achievement by 0.27 of a standard deviation. Weak instruments do
not appear to be a problem and we pass the test of over-identifying restrictions, providing
supportive evidence that these results are not driven by matching or other unobservable
characteristics of the current student population. Among the five domains of the 7C’s that
27We have also explored instrumenting Eij with Ejt,t−1 (i.e. average lagged evaluation notbeing gender specific). Results are almost identical in terms of explaining the gender gap, ifany they become stronger, as we discuss in the following subsection.
32
we explore, Captivate and Confer explain the largest proportion of the gap. Column (5)
shows that captivate explains all of the achievement gap and column (7) that Confer ex-
plains about half.28 Increasing Captivate by one standard deviation raises ELA achievement
by 0.41; increasing Confer by one standard deviation raises ELA achievement by a smaller
amount, 0.28.29
To account for correlations in our measures, we include in Column (8), our two leading
explanatory factors, Confer and Captivate in the same regression. Combined these measures
explain all of the gender gap. In this case, the marginal effect of captivate and confer are
comparable in magnitude, 0.21 and jointly significant at the 99% confidence level.
To test whether these findings are robust to controlling for other measures of teacher
effectiveness, Appendix Table A.7 repeats the analysis in Table 9 with the only change of
including controls for CLASS, FFT, PLATO, principal survey ratings, years of experience,
teacher gender and CKT. Findings for Captivate and Confer remain robust; they remain
jointly significant at the 0.03 confidence level.
As a final robustness check, we explore in Table 10 the extent that the gender gap
in Tripod explains gender differences in SAT9 performance. SAT9 is a useful comparison
because it is a low-stakes English/language arts test, in that students were only tested for
the purpose of the MET study. It differs substantively by including a writing component. As
shown in column (1), the conditional gaps are also much larger, 0.29 of a standard deviation.
This may be partly because of the writing element which has been shown to have larger
male/female gaps (Schwabe et al., 2015), but also likely because we are not able to control
28If instead, we were instrumenting Eij with Ejt,t−1 (i.e. average lagged evaluation notbeing gender specific), the equivalent male coefficients in columns (5) and (7) would be 0.023(positive but not statistically significant), and -0.035, respectively.
29We also experimented with principal component scores (instead of averages) of the differ-ent Tripod questions corresponding to each of the 7Cs but results do not change substantially,if any the specification with all 7C’s shows that the male coefficient becomes slightly smallerand not statistically significantly different from 0.
33
for lagged SAT9 in this sample and instead control for lagged ELA. The overall 7C’s (column
2) explain 0.05 of a standard deviation, slightly more than for ELA, but significantly less in
terms of proportion of the gap. Controlling for Captivate alone (column 5) explains 0.12 of
a standard deviation of the gap (almost half), whereas Confer (column 7) explains 0.06 of
a standard deviation. Combined again Confer and Captivate explain just under half of the
gap and are jointly significant at the 0.001 level.30
To get a sense of the magnitude of the contributions of Confer and Captivate to the
overall ELA gap by fourth grade, we consider how much boys would gain if their average
Confer and Captivate increased to the level of girls. Assuming the same marginal effect at
different grade levels and taking the persistence of 0.52 estimated in column (8) of Table 9,
we find that the average boy would gain 0.11 of a standard deviation in ELA, which comes
close to closing the raw gap, 0.16.
6 Mechanisms Exploration
One natural way that Captivate and Confer may operate to close the achievement gap is
through student engagement. Thus, we explore the effects of Confer and Captivate on our
measures of student engagement, including self-reported effort, homework completion and
happiness in class.31 Recall from Table 1 that boys have statistically significant lower levels
of engagement than girls in all dimensions we measure.
30Appendix Table A.8 repeats the analysis on Table 9 but based on a subsample in whichteachers were randomly allocated into classrooms within randomization blocks (i.e. approx-imately within school-grade level). After further constraining this subsample to classroomswith more than 50% of students compliance (i.e. students remain in their original class-rooms), we ended up with 3072 observations. Results show that while the standard errorshave increased substantially (mainly due to a weak first stage), the pattern that Captivateand Confer play a key role in explaining the gender gap remains.
31We do not have access to teacher assessments of student engagement. The particularquestions that refer to these domains are explained in Appendix Table A.1.
34
Tab
le9:
Gen
der
Diff
eren
tiat
ion:
Stu
den
tE
valu
atio
n-b
ased
Pra
ctic
esan
dE
LA
(N=
8,32
6)
No
7C7C
sC
lari
fyC
are
Cap
tiva
teC
hal
lenge
Con
fer
Cap
tiva
te+
Con
fer
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
7C0.
273*
**0.
200*
*0.
161*
**0.
404*
**0.
411*
**0.
274*
**-
(0.0
80)
(0.0
79)
(0.0
62)
(0.1
45)
(0.1
20)
(0.0
93)
Cap
tiva
te0.
209
(0.1
55)
Con
fer
0.21
1**
(0.1
07)
Mal
e-0
.071
***
-0.0
41**
-0.0
50**
*-0
.052
***
-0.0
05-0
.059
***
-0.0
36*
-0.0
08
(0.0
15)
(0.0
18)
(0.0
17)
(0.0
16)
(0.0
27)
(0.0
16)
(0.0
20)
(0.0
26)
F-S
tat†
10.7
7311
.141
27.2
8214
.567
12.4
7110
.973
6.85
9
JP
-Val
ue‡
0.43
90.
600
0.36
60.
305
0.47
90.
646
0.87
1
Joi
nt
Sig
nifi
can
ceof
Cap
tiva
tean
dC
onfe
r
P-V
alue?
0.00
2
Not
es:
***
den
otes
sign
ifica
nce
at
the
1%,**
atth
e5%
and
*at
the
10%
leve
ls.
Sta
nd
ard
erro
rsar
ecl
ust
ered
atth
esc
hool
leve
l.E
ach
colu
mn
incl
udes
ase
par
ate
regr
essi
onw
ith
EL
Aas
the
dep
end
ent
vari
able
and
the
7C’s
mea
sure
ind
icat
edin
the
hea
der
asa
key
ind
epen
den
tva
riab
le.
Th
ere
gres
sion
sin
stru
men
tth
est
uden
tre
por
tof
the
rele
vant
7Cd
omai
nw
ith
the
aver
age
of
that
mea
sure
for
boy
san
dgi
rls
wh
oh
adth
esa
me
teac
her
inth
ep
rior
year
.A
llre
gres
sion
sco
ntr
olfo
rsc
hool
-gra
de
fixed
effec
tsan
dst
ud
ent
lagg
edac
hie
vem
ent
(EL
Aan
dm
ath
),ag
e,ge
nd
er,
race
,E
ngl
ish
Lan
guag
eL
earn
er(E
LL
)st
atu
s,gi
fted
statu
s,fr
eean
dre
du
ced
pri
celu
nch
(FR
PL
)st
atu
s,an
din
dic
ator
vari
able
ssh
owin
gif
any
ofth
ep
revio
us
vari
ab
les
wer
em
issi
ng
(in
clu
din
gth
ete
ach
erpra
ctic
e).
Cla
ssro
omch
arac
teri
stic
contr
ols
incl
ud
eav
erag
ela
gged
pee
rac
hie
vem
ent
(bot
hm
ath
and
EL
A),
aver
age
age,
%m
ale,
%b
lack
,%
his
pan
ic,
%ra
ceot
her
,%
gift
ed,
%E
LL
,an
d%
FR
PL
.?
Rep
orts
the
P-v
alu
efr
omth
ejo
int
test
ofsi
gnifi
can
cefo
rco
nfe
ran
dca
pti
vate
.†
Rep
orts
the
Kle
iber
gen
-Paa
prk
Wald
stat
isti
cfo
ra
wea
kin
stru
men
tte
st.‡
Rep
orts
the
P-v
alu
efr
omH
anse
n’s
Jst
atis
tic
test
ofov
erid
enti
fyin
gre
stri
ctio
ns.
35
Tab
le10
:G
ender
Diff
eren
tiat
ion:
Stu
den
tE
valu
atio
n-b
ased
Pra
ctic
esan
dSA
T-9
(N=
7779
)
No
7C7C
sC
lari
fyC
are
Cap
tiva
teC
hal
lenge
Con
fer
Cap
tiva
te+
Con
fer
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
7C0.
537*
**0.
578*
**0.
314*
**0.
779*
**0.
362*
*0.
495*
**-
(0.1
66)
(0.1
75)
(0.1
03)
(0.2
47)
(0.1
78)
(0.1
59)
Cap
tiva
te0.
378
(0.2
40)
Con
fer
0.39
7**
(0.1
85)
Mal
e-0
.291
***
-0.2
37**
*-0
.229
***
-0.2
57**
*-0
.168
***
-0.2
84**
*-0
.226
***
-0.1
79**
*
(0.0
20)
(0.0
27)
(0.0
31)
(0.0
23)
(0.0
48)
(0.0
20)
(0.0
30)
(0.0
42)
F-S
tat†
9.96
810
.369
24.3
3913
.891
12.2
8011
.127
10.6
36
JP
-Val
ue‡
0.46
30.
494
0.24
00.
428
0.59
40.
839
0.86
0
Joi
nt
Sig
nifi
can
ceof
Cap
tiva
tean
dC
onfe
r
P-V
alue?
0.00
1
Not
es:
***
den
otes
sign
ifica
nce
at
the
1%,**
atth
e5%
and
*at
the
10%
leve
ls.
Sta
nd
ard
erro
rsar
ecl
ust
ered
atth
esc
hool
leve
l.E
ach
colu
mn
incl
ud
esa
sep
arat
ere
gres
sion
ofS
AT
9as
the
dep
end
ent
vari
able
and
the
7C’s
mea
sure
ind
icat
edin
the
hea
der
asa
key
ind
epen
den
tva
riab
le.
Th
ere
gres
sion
sin
stru
men
tth
est
uden
tre
por
tof
the
rele
vant
7Cd
omai
nw
ith
the
aver
age
of
that
mea
sure
for
boy
san
dgi
rls
wh
oh
adth
esa
me
teac
her
inth
ep
rior
year
.A
llre
gres
sion
sco
ntr
olfo
rsc
hool
-gra
de
fixed
effec
tsan
dst
ud
ent
lagg
edac
hie
vem
ent
(EL
Aan
dm
ath
),ag
e,ge
nd
er,
race
,E
ngl
ish
Lan
guag
eL
earn
er(E
LL
)st
atu
s,gi
fted
statu
s,fr
eean
dre
du
ced
pri
celu
nch
(FR
PL
)st
atu
s,an
din
dic
ator
vari
able
ssh
owin
gif
any
ofth
ep
revio
us
vari
ab
les
wer
em
issi
ng
(in
clu
din
gth
ete
ach
erpra
ctic
e).
Cla
ssro
omch
arac
teri
stic
contr
ols
incl
ud
eav
erag
ela
gged
pee
rac
hie
vem
ent
(bot
hm
ath
and
EL
A),
aver
age
age,
%m
ale,
%b
lack
,%
his
pan
ic,
%ra
ceot
her
,%
gift
ed,
%E
LL
,an
d%
FR
PL
.?
Rep
orts
the
P-v
alu
efr
omth
ejo
int
test
ofsi
gnifi
can
cefo
rco
nfe
ran
dca
pti
vate
.†
Rep
orts
the
Kle
iber
gen
-Paa
prk
Wald
stat
isti
cfo
ra
wea
kin
stru
men
tte
st.‡
Rep
orts
the
P-v
alu
efr
omH
anse
n’s
Jst
atis
tic
test
ofov
erid
enti
fyin
gre
stri
ctio
ns.
36
To characterize this gap in engagement, we study whether it can be explained by a
large set of student and classroom controls, and to what extent being exposed to teachers
with higher levels of Captivate and Confer could contribute to close this gap. Table 11
shows results corresponding to nine specifications where we estimate three different models
with the 3 different engagement proxies as dependent variables. Columns (1), (4) and (7)
present OLS baseline estimates where we regress homework completion, the index of student
effort, and the index of student happiness on an indicator for gender. Columns (2), (5),
and (8) add a large set of student and classroom controls to see how much of the gap
is explained by student and classroom controls. Finally, columns (3), (6) and (9) show
IV specifications where we explore the effect of our key teaching practices, instrumenting
Captivate and Confer as in the previous section. Surprisingly, we find that our rich set of
student and classroom controls explain little of the male/female engagement gap. However,
after adding Captivate and Confer, gender disparities are reduced by (almost) half in the case
of homework completion. The gap is no longer statistically significantly different from 0 for
effort and happiness. Interestingly, these results also show that Captivate plays a dominate
role for homework completion and effort, whereas Confer dominates for whether students
report being happy in class.
We finally consider the extent to which our engagement variables correlate with ELA
performance. Column (1) of Table 12 reports, for comparison purposes, the baseline gender
gap in ELA. Column (2) adds to our baseline specification all the engagement measures.
Finally, column (3) corresponds to an IV specification where we further include Captivate
and Confer as controls but we instrument for them in a similar fashion as in previous specifi-
cations. Results indicate that the engagement proxies are positively and jointly significantly
correlated with test score performance. The only measure that is not statistically significant
by itself is the index for happiness in class, however this is explained by its correlation with
37
Tab
le11
:G
ender
Mec
han
ism
s:T
each
ing
Pra
ctic
ean
dStu
den
tE
nga
gem
ent
(N=
7911
)
Hom
ewor
kC
omple
teE
ffor
tH
appy
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
Cap
tiva
te0.
166*
0.54
8***
0.39
2**
(0.0
94)
(0.1
79)
(0.1
85)
Con
fer
0.04
60.
017
0.71
3***
(0.0
69)
(0.2
06)
(0.1
66)
Mal
e-0
.080
***
-0.0
79**
*-0
.047
***
-0.1
43**
*-0
.130
***
-0.0
42-0
.122
***
-0.1
26**
*0.
020
(0.0
10)
(0.0
11)
(0.0
15)
(0.0
24)
(0.0
36)
(0.0
23)
(0.0
24)
(0.0
27)
(0.0
30)
Stu
den
tco
ntr
olN
oY
esY
esN
oY
esY
esN
oY
esY
es
Cla
ssro
omco
n-
trol
No
Yes
Yes
No
Yes
Yes
No
Yes
Yes
P-V
alue?
0.01
50.
000
0.00
0
F-S
tat†
8.99
38.
993
8.99
3
JP
-val
ue‡
0.47
30.
656
0.15
1
Not
es:
***
den
otes
sign
ifica
nce
atth
e1%
,**
atth
e5%
and
*at
the
10%
leve
ls.
Sta
nd
ard
erro
rsar
ecl
ust
ered
atth
esc
hool
leve
l.E
ach
colu
mn
incl
ud
esa
sep
arat
ere
gres
sion
ofst
uden
ten
gage
men
ton
am
ale
du
mm
yan
da
diff
eren
tse
tof
contr
ols.
All
regr
essi
ons
contr
ol
for
sch
ool
-gra
de
fixed
effec
ts.
Col
um
ns
2,3,
5,6,
8,9
contr
olfo
rst
ud
ent
and
clas
sroom
char
acte
rist
ics,
incl
ud
ing
stu
den
tla
gged
ach
ieve
men
t(E
LA
and
math
),ag
e,ge
nd
er,
race
,E
ngl
ish
Lan
guag
eL
earn
er(E
LL
)st
atu
s,gi
fted
stat
us,
free
and
red
uce
dp
rice
lun
ch(F
RP
L)
statu
s,an
din
dic
ato
rva
riab
les
show
ing
ifan
yof
the
pre
vio
us
vari
able
sw
ere
mis
sin
g(i
ncl
ud
ing
the
teac
her
pra
ctic
e),
aver
age
lagged
pee
rac
hie
vem
ent
(bot
hm
ath
and
EL
A),
aver
age
age,
%m
ale,
%b
lack
,%
his
pan
ic,
%ra
ceot
her
,%
gift
ed,
%E
LL
,an
d%
FR
PL
.?
Rep
orts
the
P-v
alu
efr
omth
ejo
int
test
ofsi
gnifi
can
cefo
rC
onfe
ran
dC
apti
vate
.C
olu
mn
3,6,
and
9in
stru
men
tfo
rth
est
ud
ent
rep
ort
ofth
ete
ach
ing
pra
ctic
ew
ith
mea
sure
ofth
ete
ach
ing
pra
ctic
eb
ased
onth
eav
erag
eof
the
teac
her
’sp
rior
clas
sroom
bro
ken
out
by
gen
der
.†
Rep
orts
the
Kle
iber
gen
-Paa
prk
Wal
dst
atis
tic
for
aw
eak
inst
rum
ent
test
.‡
Rep
orts
the
P-v
alu
efr
omH
anse
n’s
Jst
atis
tic
test
ofov
erid
enti
fyin
gre
stri
ctio
ns.
38
Table 12: Gender Gap in ELA with Engagement and Teaching Practice Controls (N=8236)
(1) (2) (3)
Captivate 0.267(0.197)
Confer 0.389***(0.148)
Male -0.071*** -0.055*** -0.005(0.015) (0.015) (0.028)
HW Complete 0.062*** -0.058(0.016) (0.058)
Effort 0.058*** 0.037**(0.008) (0.017)
Happy 0.010 -0.248**(0.007) (0.110)
Test for Joint Significance of Student EngagementP-Value 0.000 0.000Test for Joint Significance of Captivate and ConferP-Value 0.025KP F-Statistic† 4.673Test overidentifying restrictionsP-value‡ 0.718
Notes: *** denotes significance at the 1%, ** at the 5% and *at the 10% levels. Standard errors are clustered at the schoollevel. All regressions control for school-grade fixed effects,student lagged achievement (ELA and math), age, gender,race, English Language Learner (ELL) status, gifted status,free and reduced price lunch (FRPL) status, an indicator vari-ables showing if any of the previous variables were missing(including the teacher practice), average lagged peer achieve-ment (both math and ELA), average age, %male, %black,%hispanic, %race other, %gifted, %ELL, and %FRPL. Col-umn 3 instruments for the student report of the teachingpractice with measure of the teaching practice based on theaverage of the teacher’s prior classroom broken out by gen-der. † Reports the Kleibergen-Paap rk Wald statistic for aweak instrument test. ‡ Reports the P-value from Hansen’sJ statistic test of overidentifying restrictions.
39
the effort measures.32 Moreover, our estimates show that student engagement measures can
explain an important share of the gender gap in ELA (23%).33 Finally, column (3) indicates
that the teaching practices explain a significant part of the relationship between engagement
and ELA, which we interpret this as evidence that the practices may work partly through
student engagement.34 That said, the engagement measures do not have the power to ex-
plain the relationship between teaching practices and ELA. This is not surprising given the
likely unobserved aspects of engagement and measurement error. Further exploration would
benefit from richer measures of engagement and behavior to develop a better sense of the
key mediators underlying the relationship between these teaching practices and ELA.
7 Conclusion
We find that in our sample there is considerable heterogeneity in teacher value-added for boys
and girls, between 0.08 and 0.10 of a standard deviation, enough to explain the conditional
achievement gap. That said, we find little evidence that this heterogeneity in value-added
measures of effectiveness is explained by heterogeneous returns to measures related to teacher
effectiveness, including video observation protocols, principal-survey and student survey-
based measures, teaching knowledge, experience or gender. These results are robust to
adjustments for measurement error and do not appear to be driven by matching. On the
positive side, observation protocols designed to measure effective instruction–FFT, CLASS
and PLATO–do not appear to be unfairly biased toward practices that favor girl achievement
at the expense of boy achievement or vice versa. Combined with evidence that boys may
32If the two effort variables are not included, happiness becomes positive and statisticallysignificant. Appendix Table A.9 reports similar specifications as in Table 12 but includingeach of the engagement variables separately.
33Importantly, we do not make a claim of causation here; only to say that our engagementmeasures are related to ELA performance.
34Interestingly, in column (3) we see that homework completion becomes no longer signif-icant, effort substantially reduces its size, and happiness completely changes sign.
40
learn differently from girls,(Gurian, 2010), this is good news given that these protocol are
growing in importance in many districts for both teacher development and accountability.
On the other hand, we find that gender gaps in the student survey-based 7C’s measure
of teaching practice, Captivate, which captures the extent to which students find school
and homework engaging, fully explains the gender ELA gap alone. Confer, which measures
the student’s perception of the teachers’ encouragement of student discussion, also explains
about half the ELA gender gap. Combined these two measures seem to matter similarly
for producing ELA value-added and explain all the conditional ELA gap. We rule out that
this is explained by reverse causality or student unobserved attributes by using teachers’
prior Tripod scores as instrumental variables. We interpret this as evidence of meaningful
differentiation within the classroom. Back of the envelope calculations suggest that raising
the amount of Captivate and Confer that boys receive to that of girls would increase boy
achievement by 4th grade (over 5 years) by 0.11 of a standard deviation in ELA. This is about
two-thirds of the overall ELA gap for 4th and 5th graders in our sample, 0.16. Finally, we
show that captivate and confer operate (in part) by increasing boys engagement in schooling
activities.
Our findings have several important implications for practice. First, though boys are
lagging behind girls in ELA, some teachers appear to be highly effective in improving boys’
ELA outcomes. Second, observational measures that evaluate teachers based on a set of
well-accepted practices do not appear to be biased toward practices that favor girls. More
work would be useful to determine if the practices favored by these protocols, which generally
favor a student-centered approach to teaching, are unevenly applied to boys and girls in the
classroom in ways that are simply not captured by the protocol. Third, student surveys
capture useful information about teacher differentiation that can help explain achievement.
Our results indicate the need for caution in interpreting heterogeneity in student reports as
evidence of student bias–rather gender gaps appear to capture meaningful differences in the
41
learning that is occurring in the classroom. Finally, our findings on Captivate and Confer
suggest that focusing on practices that move the needle on boys’ interest in school/homework
and create an environment where they feel welcome to interrogate ideas will be most fruitful
in narrowing achievement gaps.
References
Aucejo, Esteban and Jonathan James, “The Path to College Education: The Role of
Math and Verbal Skills,” Technical Report 1602, California Polytechnic State University,
Department of Economics 2016.
Aucejo, Esteban M and Jonathan James, “Catching up to girls: Understanding the
gender imbalance in educational attainment within race,” Journal of Applied Economet-
rics, 2019, 34 (4), 502–525.
Bassi, Marina, Costas Meghir, and Ana Reynoso, “Education quality and teaching
practices,” Technical Report, National Bureau of Economic Research 2016.
, Mercedes Mateo Diaz, Rae Lesser Blumberg, and Ana Reynoso, “Failing to
notice? Uneven teachers’ attention to boys and girls in the classroom,” IZA Journal of
Labor Economics, November 2018, 7 (1), 9.
Bertrand, Marianne and Jessica Pan, “The trouble with boys: Social influences and
the gender gap in disruptive behavior,” American Economic Journal: Applied Economics,
2013, 5 (1), 32–64.
Campbell, Shanyce L. and Matthew Ronfeldt, “Observational Evaluation of Teachers:
Measuring More Than We Bargained for?,” American Educational Research Journal, 2018,
55 (6), 1233–1267.
42
Chatterji, Madhabi, “Reading achievement gaps, correlates, and moderators of early read-
ing achievement: Evidence from the Early Childhood Longitudinal Study (ECLS) kinder-
garten to first grade sample.,” Journal of Educational Psychology, 2006, 98 (3), 489–507.
Chetty, Raj, John N Friedman, and Jonah E Rockoff, “Measuring the impacts of
teachers I: Evaluating bias in teacher value-added estimates,” The American Economic
Review, 2014, 104 (9), 2593–2632.
Cornwell, Christopher, David B. Mustard, and Jessica Van Parys, “Noncognitive
Skills and the Gender Disparities in Test Scores and Teacher Assessments: Evidence from
Primary School,” Journal of Human Resources, January 2013, 48 (1), 236–264.
Danielson, Charlotte, The framework for teaching evaluation instrument 2011. Published:
The Danielson Group.
Dee, Thomas S, “Teachers and the gender gaps in student achievement,” Journal of Human
resources, 2007, 42 (3), 528–554.
Entwisle, Doris R., Karl L Alexander, and Linda Steffel Olson, Children, schools,
and inequality Social inequality series, Boulder, Colo. : Westview Press, 1997., 1997.
Ferguson, R, “The tripod project framework,” The Tripod Project, 2008.
Figlio, David, Krzysztof Karbownik, Jeffrey Roth, Melanie Wasserman et al.,
“Family disadvantage and the gender gap in behavioral and educational outcomes,” Amer-
ican Economic Journal: Applied Economics, 2019, 11 (3), 338–81.
Garrett, Rachel and Matthew P Steinberg, “Examining teacher effectiveness using
classroom observation scores: Evidence from the randomization of teachers to students,”
Educational Evaluation and Policy Analysis, 2015, 37 (2), 224–242.
43
Grossman, Pam, Susanna Loeb, Julie Cohen, and James Wyckoff, “Measure for
Measure: The Relationship between Measures of Instructional Practice in Middle School
English Language Arts and Teachers’ Value-Added Scores,” American Journal of Educa-
tion, May 2013, 119 (3), 445–470.
Gurian, Michael, Boys and Girls Learn Differently! A Guide for Teachers and Parents:
Revised 10th Anniversary Edition, John Wiley & Sons, October 2010. Google-Books-ID:
o WclCNMBm4C.
and Kathy Stevens, “With Boys and Girls in Mind,” Educational Leadership, 2004, 62
(3), 21–26.
Hamre, Bridget K., Robert C. Pianta, Jason T. Downer, Jamie DeCoster, An-
drew J. Mashburn, Stephanie M. Jones, Joshua L. Brown, Elise Cappella,
Marc Atkins, Susan E. Rivers, Marc A. Brackett, and Aki Hamagami, “Teach-
ing through Interactions: Testing a Developmental Framework of Teacher Effectiveness in
over 4,000 Classrooms,” The Elementary School Journal, 2013, 113 (4), 461–487.
Jackson, C. Kirabo, “The Effect of Single-Sex Education on Test Scores, School Com-
pletion, Arrests, and Teen Motherhood: Evidence from School Transitions,” Technical
Report w22222, National Bureau of Economic Research, Cambridge, MA May 2016.
, Jonah E. Rockoff, and Douglas O. Staiger, “Teacher Effects and Teacher-Related
Policies,” Annual Review of Economics, 2014, 6 (1), 801–825.
Kane, Thomas J and Douglas O Staiger, “Gathering Feedback for Teaching: Combining
High-Quality Observations with Student Surveys and Achievement Gains. Research Paper.
MET Project.,” Bill & Melinda Gates Foundation, 2012.
44
, Daniel F McCaffrey, Trey Miller, and Douglas O Staiger, “Have we identified
effective teachers? Validating measures of effective teaching using random assignment,”
in “Research Paper. MET Project. Bill & Melinda Gates Foundation” Citeseer 2013.
Kelly, Sean, Robert Bringe, Esteban Aucejo, and Jane Cooley Fruehwirth, “Us-
ing global observation protocols to inform research on teaching effectiveness and school
improvement: Strengths and emerging limitations,” Education Policy Analysis Archives,
2020, 28, 62.
Koedel, Cory, Kata Mihaly, and Jonah E. Rockoff, “Value-added modeling: A re-
view,” Economics of Education Review, 2015, 47, 180 – 195.
Konstantopoulos, Spyros, “Teacher Effects, Value-Added Models and Accountability,”
Teacher College Record, 2014, 116 (1).
Lavy, Victor and Edith Sand, “On The Origins of Gender Human Capital Gaps: Short
and Long Term Consequences of Teachers’ Stereotypical Biases,” Technical Report 20909,
National Bureau of Economic Research, Inc January 2015.
Legewie, Joscha and Thomas A. DiPrete, “School Context and the Gender Gap in
Educational Achievement,” American Sociological Review, June 2012, 77 (3), 463–485.
Loveless, Tom, “Girls, boys, and reading,” November 2015.
MacNell, Lillian, Adam Driscoll, and Andrea N. Hunt, “What’s in a Name: Exposing
Gender Bias in Student Ratings of Teaching,” Innovative Higher Education, August 2015,
40 (4), 291–303.
Mengel, Friederike, Jan Sauermann, and Ulf Zolitz, “Gender bias in teaching evalu-
ations,” Journal of the European Economic Association, 2019, 17 (2), 535–566.
45
Northrop, L., “Breaking the Cycle: Cumulative Disadvantage in Literacy,” Reading Re-
search Quarterly, 2017, 52 (4), 391–396.
Reardon, Sean, Erin Fahle, Demetra Kalogrides, Anne Podolsky, and Rosalia
Zarate, Geographic Variation of District-Level Gender Achievement Gaps within the
United States, Society for Research on Educational Effectiveness, 2016.
Schwabe, Franziska, Nele McElvany, and Matthias Trendtel, “The school age gender
gap in reading achievement: Examining the influences of item format and intrinsic reading
motivation,” Reading Research Quarterly, 2015, 50 (2), 219–232.
Senechal, Monique and Jo-Anne LeFevre, “Parental Involvement in the Development
of Children’s Reading Skill: A Five-Year Longitudinal Study,” Child Development, 2002,
73 (2), 445–460.
Smith, Michael and Jeffrey D. Wilhelm, Reading Don’t Fix No Chevys: Literacy in
the Lives of Young Men, 1 edition ed., Portsmouth, NH: Heinemann, March 2002.
Steinberg, Matthew P and Rachel Garrett, “Classroom composition and measured
teacher performance: What do teacher observation scores really measure?,” Educational
Evaluation and Policy Analysis, 2016, 38 (2), 293–317.
Terrier, Camille, “Giving a Little Help to Girls? Evidence on Grade Discrimination and
its Effect on Students’ Achievement,” Technical Report dp1341, Centre for Economic
Performance, LSE March 2015.
46
A Appendix Tables
Table A.1: Student Survey Questions: Engagement and Tripod 7CsDimension Example Question PromptsCare My teacher in this class makes me feel that he/she really cares about me.
The teacher in this class encourages me to do my best.My teacher gives us time to explain our ideas.My teacher seems to know if something is bothering me.If I am sad or angry, my teacher helps me feel better.My teacher is nice to me when I ask questions.I like the way my teacher treats me when I need help.
Control Our class stays busy and does not waste time.Students behave so badly in this class that it slows down our learning.Everybody knows what they should be doing and learning in this class.My classmates behave the way my teacher wants them to.
Clarify If you don’t understanding something, my teacher explains it another way.My teacher has several good ways to explain each topic that we cover in this class.My teacher explains difficult things clearly.This class is neat, everything has a place and things are easy to find.My teacher explains things in very orderly ways.In this class, we learn to correct our mistakes.My teacher knows when the class understands, and when we do not.I understand what I am supposed to be learning in this class.
Challenge My teacher pushes us to think hard about things we read.In this class, my teacher accepts nothing less than our full effort.In this class we have to think hard about the writing we do.My teacher pushes everybody to work hard.
Captivate We have interesting homework.Homework helps me learn.School work in not very enjoyable.School work in interesting
Confer My teacher wants me to explain my answers.When he/she is teaching us, my teacher asks us whether we understand.My teacher tells us what we are learning and why.My teacher asks questions to be sure we are following along when he/she is teaching.My teacher checks to make sure we understand what he/she is teaching us.My teacher wants us to share our thoughts.Students speak up and share their ideas about class work.
Consolidate When my teacher marks my work, he/she writes on my papers to help me understandhow to do better.
My teacher takes the time to summarize what we learn each day.Effort I have pushed myself hard to understand my lessons in this class.
When doing schoolwork for this class, I try to learn as much as I can and I don’t worryabout how long it takes.
Happy Being in this class makes me feel sad or angry (reverse-coded)This class is a happy place for me to be
Homework When homework is assigned in this class, how much of it do you usually complete?
Effort and Happy items have the scale no/never (1); mostly not (2); maybe/sometimes (3); mostly yes (4);yes, always (5). Homework complete gets a value of 1 if students responded all and 0 otherwise. Tripodquestions have a 5-point scale (1-totally untrue to 5-totally true).
47
Table A.2: Student Summary Statistics by Gender (Unrestricted Sample)
Male Female Male-FemaleMean SD N Mean SD N Mean P-value
Panel A: Student CharacteristicsELA(2009) -0.04 0.98 5928 0.13 0.93 6047 -0.17 0.00ELA(2010) -0.04 0.98 6290 0.13 0.92 6437 -0.17 0.00Effort 3.96 1.08 5375 4.13 1.03 5450 -0.17 0.00Happy 3.94 1.05 5305 4.11 1.00 5416 -0.17 0.00Homework Complete 0.73 0.44 5203 0.81 0.39 5340 -0.08 0.00Age 9.42 0.93 6607 9.34 0.90 6637 0.08 0.00Gifted 0.08 0.27 6611 0.08 0.27 6638 0.00 0.46English Language Learner (ELL) 0.15 0.35 6611 0.13 0.34 6637 0.01 0.02Free Reduced Price Lunch (FRPL) 0.48 0.50 4919 0.48 0.50 4911 0.00 0.90White 0.24 0.43 6558 0.24 0.43 6558 0.00 0.92Black 0.43 0.50 6558 0.44 0.50 6558 0.00 0.78Hispanic 0.25 0.43 6558 0.24 0.43 6558 0.00 0.58Asian 0.05 0.23 6558 0.05 0.23 6558 0.00 0.94Race Other 0.02 0.15 6558 0.02 0.15 6558 0.00 0.65Grade Level 4.54 0.50 6611 4.54 0.50 6638 0.00 0.97
Panel B: Class CharacteristicsAvg. Lag Math 0.04 0.51 5998 0.05 0.51 6109 -0.01 0.15Avg. Lag ELA 0.04 0.50 5928 0.06 0.51 6047 -0.02 0.08Avg. Age 9.39 0.81 5783 9.39 0.82 5806 -0.01 0.57% Male 0.52 0.10 5787 0.48 0.10 5807 0.04 0.00% Black 0.45 0.37 5750 0.45 0.37 5743 0.00 0.62% Hispanic 0.24 0.26 5750 0.24 0.26 5743 0.00 0.32% Asian 0.06 0.10 5750 0.06 0.11 5743 0.00 0.51% Race Other 0.02 0.04 5750 0.02 0.04 5743 0.00 0.19% Gifted 0.08 0.16 5787 0.09 0.17 5807 0.00 0.12% ELL 0.14 0.18 5787 0.14 0.18 5807 0.01 0.09% FRPL 0.47 0.31 4180 0.47 0.31 4174 0.00 0.57
Panel C: Student Tripod ResponsesClarify 4.20 0.58 5396 4.27 0.56 5464 -0.07 0.00Care 4.12 0.75 5395 4.23 0.73 5465 -0.11 0.00Challenge 4.23 0.71 5392 4.29 0.70 5461 -0.06 0.00Consolidate 3.86 0.95 5337 3.89 0.97 5437 -0.03 0.09Captivate 3.60 0.86 5390 3.73 0.81 5462 -0.13 0.00Control 3.52 0.72 5382 3.53 0.73 5459 -0.01 0.59Confer 4.21 0.61 5383 4.31 0.57 5459 -0.10 0.00All 7Cs 3.96 0.54 5396 4.03 0.53 5465 -0.07 0.00
Notes: The sample sizes for males and females refer to the unrestricted sample where we only limitthe sample to students in the second wave of the survey and only in grades 4-5; some variables in thistable have fewer observations. The last column reports whether the means are statistically significantlydifferent between males and females. The classroom-characteristic variables in Panel B are calculatedexcluding each individual student. The tripod survey domains in Panel C were constructed by averagingover the relevant questions. All 7Cs averages over the 7 different domains.
48
Table A.3: Teacher Descriptive Statistics by Student Gender (Unrestricted Sample)
Male Female Male–FemaleMean SD N Mean SD N Mean P-Value
7C(2010) 4.01 0.25 5539 4.01 0.25 5538 0.00 0.71FFT(2010) 2.66 0.25 4839 2.67 0.25 4750 0.00 0.47CLASS(2010) 4.58 0.36 4839 4.59 0.36 4750 -0.01 0.08PLATO(2010) 2.70 0.23 4839 2.69 0.23 4750 0.00 0.467C(2009) 3.96 0.26 5597 3.95 0.27 5605 0.01 0.07FFT(2009) 2.66 0.24 4793 2.65 0.24 4717 0.00 0.65CLASS(2009) 4.57 0.41 4810 4.57 0.41 4732 0.00 0.91PLATO(2009) 2.65 0.27 4763 2.65 0.27 4688 0.00 0.74Principal Survey 4.32 1.13 4679 4.34 1.14 4699 -0.03 0.24CKT Score -0.04 0.99 4676 -0.01 0.99 4666 -0.03 0.16Years of Exp. 6.26 5.79 3694 6.23 5.92 3662 0.03 0.85Male 0.10 0.30 5326 0.09 0.28 5343 0.01 0.12Black 0.34 0.48 5326 0.34 0.47 5343 0.01 0.40White 0.58 0.49 5326 0.60 0.49 5343 -0.02 0.10Hispanic 0.06 0.24 5326 0.06 0.23 5343 0.01 0.12
Notes: The sample sizes for males and females refer to the unrestricted samplewhere we only limit the sample to students in the second wave of the survey andonly in grades 4-5. P-value in the last column tests whether the male and femalemeans are statistically significantly different. The 7C variable is the average studentscore by class. FFT, CLASS and PLATO are also calculated as averages across allraters and domains. For this analysis, we consider every possible response whencalculating this teacher average, so we include both ELA and Math responses in thecase where the teacher is instructing both subjects in one class.
49
Tab
leA
.4:
Cor
rela
tion
sB
etw
een
Mea
sure
sof
Tea
cher
Eff
ecti
venes
s7C
sF
FT
CL
AS
SP
LA
TO
PS
VY
CK
TY
rs.
Exp
.T
.Mal
eT
.Bla
ckT
.Wh
ite
T.H
isp
anic
φM j
φj
7Cs
1
FF
T0.
141
CL
AS
S0.
120.
641
PL
AT
O0.
100.
470.
411
PS
VY
0.16
0.32
0.35
0.21
1
CK
T-0
.10
0.07
0.14
0.13
0.17
1
Yrs
.E
xp
.-0
.07
0.05
0.03
-0.1
30.
110.
131
T.M
ale
-0.0
9-0
.10
-0.0
5-0
.10
-0.1
3-0
.05
0.03
1
T.B
lack
0.05
-0.1
4-0
.14
-0.1
8-0
.13
-0.4
0-0
.17
-0.0
21
T.W
hit
e-0
.08
0.15
0.16
0.18
0.09
0.32
0.11
0.02
-0.8
51
T.H
isp
anic
0.04
-0.0
3-0
.05
0.00
0.03
0.10
0.07
-0.0
2-0
.18
-0.3
11
φM j
0.01
0.02
0.08
0.06
0.07
0.03
0.02
0.01
0.06
-0.0
2-0
.08
1
φj
0.18
0.06
0.07
0.06
0.26
-0.0
2-0
.06
-0.1
0-0
.08
0.04
0.07
-0.3
51
Ob
s.46
4
Not
es:
Tab
lesh
ows
corr
elat
ions
bet
wee
nte
acher
pra
ctic
es.
We
use
the
sam
ple
of
elem
enta
rysc
hool
teac
her
sth
athav
est
uden
tsw
ith
trip
od
surv
eyre
spon
ses.
7Cis
the
com
pos
ite
of
the
7Cs
whic
hav
erag
esov
erth
ein
div
idual
7Cva
riab
les.
The
7Cva
riab
leat
the
teach
erle
vel
isth
eav
erage
studen
tsc
ore
by
teac
her
(so
this
vari
able
only
diff
ers
by
teac
her
).F
orth
isan
alysi
s,w
eco
nsi
der
ever
yp
ossi
ble
resp
onse
wh
enca
lcula
ting
this
teac
her
aver
age,
sow
ein
clude
bot
hE
LA
and
Mat
hre
spon
ses
inth
eca
sew
her
eth
ete
acher
isin
stru
ctin
gb
oth
sub
ject
sin
one
clas
s.F
FT
,C
LA
SS
and
PL
AT
Oar
eal
soco
mp
osit
esov
erin
div
idual
com
pon
ents
.P
SV
Yis
the
pri
nci
pal
surv
eyre
spon
ses
for
rati
ng
teach
ers,
CK
Tre
fers
toth
ete
acher
apti
tude
rati
ng
assi
gned
by
obse
rver
s,Y
rs.
Exp
.re
fers
toye
ars
ofex
per
ien
cein
the
curr
ent
dis
tric
tan
dT
each
erM
ale
isth
ege
nder
ofth
ete
acher
.φM j
andφj
are
valu
e-ad
ded
vari
able
sdis
cuss
edin
sect
ion
3.2.
50
Tab
leA
.5:
Het
erog
eneo
us
Eff
ects
ofT
each
ers
Con
trol
ling
for
Cla
ssF
ixed
Eff
ects
FF
TC
LA
SS
PL
AT
OT
each
erA
pti
tude
(CK
T)
Pri
nci
pal
Eva
luat
ion
Yea
rsof
Ex-
per
ience
Tea
cher
Mal
e
(1)
(2)
(3)
(4)
(5)
(6)
(7)
Panel
A
Mal
e×T
each
erM
easu
re0.
001
0.00
6-0
.011
0.00
2-0
.002
-0.0
010.
019
(0.0
13)
(0.0
14)
(0.0
13)
(0.0
13)
(0.0
15)
(0.0
03)
(0.0
41)
Mal
e-0
.080
***
-0.0
80**
*-0
.081
***
-0.0
81**
*-0
.081
***
-0.0
81**
*-0
.081
***
(0.0
14)
(0.0
14)
(0.0
14)
(0.0
14)
(0.0
14)
(0.0
14)
(0.0
14)
Panel
B
Mal
e×T
each
erM
easu
re0.
002
0.00
6-0
.040
--
--
(0.0
33)
(0.0
28)
(0.0
53)
Mal
e-0
.081
***
-0.0
81**
*-0
.081
***
--
--
(0.0
14)
(0.0
14)
(0.0
14)
Fir
stSta
geF†
76.1
1790
.432
27.1
55-
--
-
Note
s:**
*d
enot
essi
gn
ifica
nce
at
the
1%,
**at
the
5%an
d*
atth
e10
%le
vels
.S
tan
dar
der
rors
are
clu
ster
edat
the
sch
ool
leve
l.E
ach
colu
mn
refe
rsto
ase
par
ate
regr
essi
onw
ith
EL
Aas
the
dep
end
ent
vari
able
and
the
teac
her
mea
sure
list
edin
the
hea
der
.A
llre
gre
ssio
ns
contr
olfo
rcl
ass
fixed
effec
ts.
Th
eco
ntr
ols
incl
ud
est
ud
ent
lagg
edac
hie
vem
ent
(EL
Aan
dm
ath
),ag
e,gen
der
,ra
ce,
En
gli
shL
an
guag
eL
earn
er(E
LL
)st
atu
s,gi
fted
stat
us,
free
and
red
uce
dpri
celu
nch
(FR
PL
)st
atu
s,an
din
dic
ato
rva
riab
les
show
ing
ifan
yof
the
pre
vio
us
vari
able
sw
ere
mis
sing
(in
clu
din
gth
ete
ach
erp
ract
ice)
.C
lass
room
contr
ols
are
als
oin
clu
ded
–ave
rage
lagg
edp
eer
ach
ieve
men
t(b
oth
mat
han
dE
LA
),av
erag
eag
e,%
mal
e,%
bla
ck,
%h
isp
anic
,%
race
oth
er,
%gif
ted
,%
EL
L,
an
d%
FR
PL
.†
Rep
orts
the
Kle
iber
gen
-Paa
prk
Wal
dst
atis
tic
for
aw
eak
inst
rum
ent
test
.
51
Tab
leA
.6:
Het
erog
eneo
us
Eff
ects
ofT
each
ers
Con
trol
ling
for
Cla
ssF
ixed
Eff
ects
7Cs
Cla
rify
Car
eC
hal
lege
Con
solidat
eC
apti
vate
Con
trol
Con
fer
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
Panel
A
Mal
e×T
each
erM
easu
re0.
001
0.00
00.
007
-0.0
090.
001
-0.0
030.
005
0.00
3
(0.0
11)
(0.0
12)
(0.0
12)
(0.0
13)
(0.0
13)
(0.0
12)
(0.0
12)
(0.0
12)
Mal
e-0
.081
***
-0.0
81**
*-0
.081
***
-0.0
81**
*-0
.081
***
-0.0
81**
*-0
.081
***
-0.0
81**
*
(0.0
14)
(0.0
14)
(0.0
14)
(0.0
14)
(0.0
14)
(0.0
14)
(0.0
14)
(0.0
14)
Panel
B
Mal
e×T
each
erM
easu
re-0
.013
-0.0
050.
003
-0.0
28-0
.014
-0.0
27-0
.004
-0.0
04
(0.0
25)
(0.0
26)
(0.0
22)
(0.0
34)
(0.0
27)
(0.0
27)
(0.0
28)
(0.0
30)
Mal
e-0
.080
***
-0.0
81**
*-0
.081
***
-0.0
81**
*-0
.080
***
-0.0
80**
*-0
.081
***
-0.0
81**
*
(0.0
14)
(0.0
14)
(0.0
14)
(0.0
14)
(0.0
14)
(0.0
14)
(0.0
14)
(0.0
14)
Fir
stSta
geF†
82.9
1379
.215
107.
615
51.0
9591
.064
92.0
7573
.507
48.5
71
Not
es:
***
den
otes
sign
ifica
nce
at
the
1%,
**at
the
5%an
d*
atth
e10
%le
vels
.S
tan
dar
der
rors
are
clu
ster
edat
the
sch
ool
leve
l.E
ach
colu
mn
refe
rsto
ase
par
ate
regr
essi
onw
ith
EL
Aas
the
dep
end
ent
vari
able
and
the
teac
her
mea
sure
list
edin
the
hea
der
.A
llre
gre
ssio
ns
contr
olfo
rcl
ass
fixed
effec
ts.
Th
eco
ntr
ols
incl
ude
stu
den
tla
gged
ach
ievem
ent
(EL
Aan
dm
ath
),ag
e,ge
nd
er,
race
,E
ngli
shL
an
guag
eL
earn
er(E
LL
)st
atu
s,gi
fted
stat
us,
free
and
red
uce
dp
rice
lun
ch(F
RP
L)
stat
us,
and
ind
icat
orva
riab
les
show
ing
ifany
ofth
ep
revio
us
vari
able
sw
ere
mis
sin
g(i
ncl
ud
ing
the
teac
her
pra
ctic
e).
Cla
ssro
omco
ntr
ols
are
also
incl
ud
ed–a
ver
age
lagg
edp
eer
ach
ievem
ent
(bot
hm
ath
an
dE
LA
),av
erag
eag
e,%
mal
e,%
bla
ck,
%h
isp
anic
,%
race
oth
er,
%gi
fted
,%
EL
L,
and
%F
RP
L.†
Rep
ort
sth
eK
leib
erge
n-P
aap
rkW
ald
stat
isti
cfo
ra
wea
kin
stru
men
tte
st.
52
Tab
leA
.7:
Gen
der
Diff
eren
tiat
ion:
Stu
den
tE
valu
atio
n-b
ased
Pra
ctic
esan
dE
LA
wit
hT
each
erC
ontr
ols
(N=
8236
)
No
7C7C
sC
lari
fyC
are
Cap
tiva
teC
hal
lenge
Con
fer
Cap
tiva
te+
Con
fer
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
7C0.
110
0.11
8**
0.33
5**
0.34
1***
0.20
1**
-
(0.0
78)
(0.0
59)
(0.1
32)
(0.1
26)
(0.1
00)
Cap
tiva
te0.
196
(0.1
43)
Con
fer
0.14
8
(0.1
14)
Mal
e-0
.071
***
-0.0
70**
*-0
.058
***
-0.0
55**
*-0
.015
-0.0
61**
*-0
.045
**-0
.018
(0.0
15)
(0.0
15)
(0.0
16)
(0.0
16)
(0.0
25)
(0.0
16)
(0.0
20)
(0.0
24)
Fir
stst
age
F†
9.97
223
.934
16.8
8611
.010
10.3
259.
146
JP
-val
ue‡
0.94
30.
982
0.64
30.
387
0.80
90.
993
Joi
nt
sign
ifica
nce
ofC
apti
vate
and
Con
fer
P-V
alue?
0.01
8
Not
es:
***
den
otes
sign
ifica
nce
at
the
1%,
**at
the
5%an
d*
atth
e10
%le
vel
s.S
tan
dar
der
rors
are
clu
ster
edat
the
sch
ool
leve
l.E
ach
colu
mn
incl
ud
esa
sep
arat
ere
gres
sion
ofS
AT
9as
the
dep
end
ent
vari
able
and
the
7C’s
mea
sure
ind
icat
edin
the
hea
der
asa
key
ind
epen
den
tva
riab
le.
Th
ere
gres
sion
sin
stru
men
tth
est
ud
ent
rep
ort
ofth
ere
leva
nt
7Cd
omai
nw
ith
the
aver
age
of
that
mea
sure
for
boy
san
dgi
rls
wh
oh
adth
esa
me
teac
her
inth
ep
rior
year
.A
llre
gres
sion
sco
ntr
olfo
rsc
hool
-gra
de
fixed
effec
tsand
stu
den
tla
gged
ach
ievem
ent
(EL
Aan
dm
ath
),ag
e,ge
nd
er,
race
,E
ngl
ish
Lan
guag
eL
earn
er(E
LL
)st
atu
s,gi
fted
statu
s,fr
eean
dre
du
ced
pri
celu
nch
(FR
PL
)st
atu
s,an
din
dic
ator
vari
able
ssh
owin
gif
any
of
the
pre
vio
us
vari
able
sw
ere
mis
sin
g(i
ncl
udin
gth
ete
ach
erp
ract
ice)
.C
lass
room
char
acte
rist
icco
ntr
ols
incl
ud
eav
erag
ela
gged
pee
rach
ieve
men
t(b
oth
mat
han
dE
LA
),av
erag
eag
e,%
mal
e,%
bla
ck,%
his
pan
ic,%
race
oth
er,%
gift
ed,%
EL
L,an
d%
FR
PL
.T
each
erco
ntr
ols
incl
ud
eF
FT
,C
LA
SS
,P
LA
TO
,p
rin
cip
alev
alu
atio
n,
conte
nt
kn
owle
dge
and
year
sof
exp
erie
nce
.?
Rep
orts
the
P-v
alu
efr
om
the
join
tte
stof
sign
ifica
nce
for
con
fer
and
cap
tiva
te.†
Rep
orts
the
Kle
iber
gen
-Paa
prk
Wal
dst
atis
tic
for
aw
eak
inst
rum
ent
test
.‡
Rep
orts
the
P-v
alu
efr
omH
anse
n’s
Jst
atis
tic
test
ofov
erid
enti
fyin
gre
stri
ctio
ns.
53
Tab
leA
.8:
Gen
der
Diff
eren
tiat
ion:
Stu
den
tE
valu
atio
n-b
ased
Pra
ctic
esan
dE
LA
(N=
3072
)
No
7C7C
sC
lari
fyC
are
Cap
tiva
teC
hal
lenge
Con
fer
Cap
tiva
te+
Con
fer
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
7C0.
374
-0.0
250.
028
0.90
70.
398*
*0.
369
(0.2
77)
(0.4
52)
(0.1
69)
(0.6
74)
(0.2
02)
(0.3
82)
Cap
tiva
te0.
167
(0.3
10)
Con
fer
0.24
3
(0.3
39)
Mal
e-0
.071
***
-0.0
27-0
.074
-0.0
67**
0.09
1-0
.066
***
-0.0
200.
002
(0.0
24)
(0.0
44)
(0.0
62)
(0.0
28)
(0.1
18)
(0.0
25)
(0.0
58)
(0.0
71)
F-S
tat†
0.92
80.
476
5.23
21.
245
4.73
40.
991
0.56
5
Han
sen
J‡
0.08
70.
044
0.05
70.
423
0.13
90.
496
2.36
4
JP
-Val
ue‡
0.76
80.
834
0.81
20.
515
0.71
00.
481
0.30
7
Joi
nt
Sig
nifi
can
ceof
Cap
tiva
tean
dC
onfe
r
P-V
alue?
0.62
5
Not
es:
***
den
otes
sign
ifica
nce
at
the
1%,
**at
the
5%an
d*
atth
e10
%le
vel
s.S
tan
dar
der
rors
are
clu
ster
edat
the
ran
dom
izat
ion
-blo
ckle
vel.
Eac
hco
lum
nin
clu
des
ase
par
ate
regr
essi
onw
ith
EL
Aas
the
dep
end
ent
vari
able
and
the
7C’s
mea
sure
ind
icat
edin
the
hea
der
asa
key
ind
epen
den
tva
riab
le.
Th
ere
gres
sion
sin
stru
men
tth
est
ud
ent
rep
ort
ofth
ere
leva
nt
7C
dom
ain
wit
hth
eav
erag
eof
that
mea
sure
for
boy
san
dgi
rls
wh
oh
adth
esa
me
teac
her
inth
ep
rior
yea
r.A
llre
gre
ssio
ns
contr
olfo
rra
nd
om
izat
ion
blo
ckfi
xed
effec
tsan
dst
ud
ent
lagg
edac
hie
vem
ent
(EL
Aan
dm
ath
),ag
e,ge
nd
er,
race
,E
ngl
ish
Lan
guag
eL
earn
er(E
LL
)st
atu
s,gi
fted
stat
us,
free
and
red
uce
dp
rice
lun
ch(F
RP
L)
stat
us,
and
ind
icat
orva
riab
les
show
ing
ifan
yof
the
pre
vio
us
vari
able
sw
ere
mis
sin
g(i
ncl
ud
ing
the
teac
her
pra
ctic
e).
Cla
ssro
omch
arac
teri
stic
contr
ols
incl
ud
eav
erage
lagg
edp
eer
ach
ieve
men
t(b
oth
mat
han
dE
LA
),av
erag
eag
e,%
mal
e,%
bla
ck,
%h
isp
anic
,%
race
oth
er,
%gif
ted,
%E
LL
,an
d%
FR
PL
.?
Rep
orts
the
P-v
alu
efr
omth
ejo
int
test
ofsi
gnifi
can
cefo
rco
nfe
ran
dca
pti
vate
.†
Rep
ort
sth
eK
leib
ergen
-Paa
prk
Wal
dst
atis
tic
for
aw
eak
inst
rum
ent
test
.‡
Rep
orts
the
P-v
alu
efr
omH
anse
n’s
Jst
atis
tic
test
ofov
erid
enti
fyin
gre
stri
ctio
ns.
54
Tab
leA
.9:
EL
AG
ender
Gap
wit
hStu
den
tE
nga
gem
ent
Con
trol
s(N
=82
36)
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
Cap
tiva
te0.
212
0.19
60.
282
0.26
7
(0.1
60)
(0.1
66)
(0.1
82)
(0.1
97)
Con
fer
0.21
1**
0.21
1**
0.39
6***
0.38
9***
(0.1
07)
(0.1
06)
(0.1
44)
(0.1
48)
Mal
e-0
.071
***
-0.0
63**
*-0
.012
-0.0
62**
*-0
.009
-0.0
67**
*-0
.004
-0.0
55**
*-0
.005
(0.0
15)
(0.0
15)
(0.0
24)
(0.0
15)
(0.0
25)
(0.0
15)
(0.0
30)
(0.0
15)
(0.0
28)
HW
Com
-ple
te0.
085*
**-0
.060
0.06
2***
-0.0
58
(0.0
17)
(0.0
57)
(0.0
16)
(0.0
58)
Eff
ort
0.06
5***
0.01
20.
058*
**0.
037*
*
(0.0
08)
(0.0
27)
(0.0
08)
(0.0
17)
Hap
py
0.02
6***
-0.2
53**
0.01
0-0
.248
**
(0.0
07)
(0.1
07)
(0.0
07)
(0.1
10)
Joi
nt
sign
ifica
nce
ofca
ptiv
ate
and
con
fer
P-V
alue?
0.00
50.
006
0.01
40.
025
Joi
nt
sign
ifica
nce
ofen
gage
men
tco
ntr
ols
P-V
alue?
0.00
00.
000
Fir
stst
age
F†
8.87
68.
757
5.11
44.
673
JP
-val
ue‡
0.84
90.
870
0.70
60.
718
Not
es:
***
den
otes
sign
ifica
nce
atth
e1%
,**
atth
e5%
and
*at
the
10%
leve
ls.
Sta
nd
ard
erro
rsar
ecl
ust
ered
atth
esc
hool
level
.E
ach
colu
mn
repre
sents
ad
iffer
ent
regre
ssio
nof
EL
Aon
ase
tof
contr
ols.
All
regr
essi
ons
contr
olfo
rsc
hool
-gra
de
fixed
effec
tsan
dst
ud
ent
lagged
ach
ievem
ent
(EL
Aan
dm
ath
),ag
e,ge
nd
er,
race
,E
ngl
ish
Lan
guag
eL
earn
er(E
LL
)st
atu
s,gi
fted
stat
us,
free
and
red
uce
dp
rice
lun
ch(F
RP
L)
stat
us,
an
din
dic
ator
vari
able
ssh
owin
gif
any
ofth
ep
revio
us
vari
able
sw
ere
mis
sin
g(i
ncl
ud
ing
the
teac
her
pra
ctic
e),
aver
age
lagge
dp
eer
ach
ieve
men
t(b
oth
mat
han
dE
LA
),av
erag
eag
e,%
mal
e,%
bla
ck,
%h
isp
anic
,%
race
oth
er,
%gi
fted
,%
EL
L,
and
%F
RP
L.
Od
dco
lum
ns
inst
rum
ent
the
stu
den
tre
por
tof
the
rele
vant
7Cd
omai
nw
ith
the
aver
age
ofth
atm
easu
refo
rb
oys
and
girl
sw
ho
had
the
sam
ete
ach
erin
the
pri
orye
ar.?
Rep
orts
the
P-v
alu
efr
omth
ejo
int
test
ofsi
gnifi
can
cefo
rC
onfe
ran
dC
apti
vate
.T
her
eis
sim
ilar
test
for
the
join
tsi
gnifi
can
ceof
the
thre
eeff
ort
vari
able
s.†
Rep
orts
the
Kle
iber
gen
-Paa
prk
Wal
dst
atis
tic
for
aw
eak
inst
rum
ent
test
.‡
Rep
ort
sth
eP
-valu
efr
omH
anse
n’s
Jst
atis
tic
test
ofov
erid
enti
fyin
gre
stri
ctio
ns.
55
Recommended