The Econometric Evaluation of Policy Design: Part III...

Model/Assumptions Identification Estimation

The Econometric Evaluation

of Policy Design:

Part III:

Selection Models and the MTE

Edward Vytlacil,Yale University

Renmin UniversityMarch 2019

1 / 72

Lectures primarily drawing upon:

Heckman and Vytlacil (2001b), “Local InstrumentalVariables”

Heckman, Vytlacil and Urzua (2006), “UnderstandingInstrumental Variables in Models with EssentialHeterogeneity”

while also drawing on other work to a lesser extent.

2 / 72

Imbens and Angrist (1994) establish that IV can identifyan interpretable parameter in the model with essentialheterogeneity. Their parameter is instrument dependent,and whether it is interesting depends on the context.

Their parameter is a discrete approximation to themarginal gain parameter of Bjorklund and Moffitt (1987).

Their assumptions are equivalent to imposingnonparametric selection model (Vytlacil, 2002).

These observations motivate Heckman and Vytlacil MTEframework.

3 / 72

Selection Models

Heckman, Vytlacil and co-authors

Impose Nonparametric Selection Model

Goals:

Unify literature with a common set of underlyingparameters interpretable across studies.

To understand relationship between selection andtreatment effect heterogeneity.

Consider strategies other than linear IV, and parametersother than LATE.

4 / 72

Selection Model, Assumptions.

The Marginal Treatment Effect

Identification Analysis

Estimation Analysis

Interpreting the Instrumental Variables Estimand

Testing for Essential Heterogeneity

Applications

Will continue to cover additional topics related to the MTEframework in the next lecture.

5 / 72

Model for outcomes

Y1 = µ1 (X ,U1) (1.1)

Y0 = µ0 (X ,U0) .

X are observed and (U1,U0) are unobserved by theanalyst.

Allow X 6⊥⊥ U0,U1.

∆ = Y1 − Y0 = µ1 (X ,U1)− µ0 (X ,U0)(Treatment Effect)

6 / 72

Model for outcomes

A special case that links our analysis to standard models ineconometrics:

Y1 = Xβ1 + U1 and

Y0 = Xβ0 + U0; so

∆ = X (β1 − β0) + (U1 − U0).

In the case of separable outcomes, heterogeneity in ∆ arisesbecause in general U1 6= U0 and people differ in their X .

We will not require linearity or any parametric form on µ1(X ),µ0(X ), but linearity useful in practice for estimation toincrease precision and allow for high dimensional X .

7 / 72

Conditioning on Observed Covariates

For purposes of identification, will conduct analysisconditional on Xi .

For now, I will suppress conditioning on Xi , leavingimplicit conditioning on Xi .

We will make conditioning on Xi explicit when discussionestimation issues.

8 / 72

Threshold Crossing Model for Di

Heckman-Vytlacil impose:

Di = 1 [D∗i > 0] ,

D∗i = µD(Zi)− Vi ,

Zi ⊥⊥ Vi ,

Zi observed random vector,

Vi continuous, unobserved random variable.

µD (Zi)− Vi can be interpreted as a net utility for a personwith characteristics (Zi ,Vi).

9 / 72

Additive Separability in Threshold Crossing Model for D

D = 1 [µD(Z )− V > 0] .

Separability between V and Z in latent index of choiceequation is conventional.

Plays a critical role in the properties of instrumentalvariable estimators in models with essential heterogeneity.

Wider class of latent index models will have arepresentation in this form (Vytlacil, 2006).

Vytlacil (2002) shows that independence andmonotonicity of Imbens and Angrist (1994) is equivalentto this model with Zi ⊥⊥ Vi .

10 / 72

Propensity Score, Normalization

Define P(z) as the propensity score:

P(z) = Pr(Di = 1 | Zi = z) = Pr(µD(z) > Vi) = FV (µ(z)).

As normalization, can then rewrite model as:

Di = 1 [µD (Zi)− Vi ≥ 0]

= 1 [FV (µD (Zi)) ≥ FV (Vi)]

= 1 [P (Zi) ≥ UDi ] ,

with UDi ≡ FV (Vi) ∼ Unif[0, 1].

11 / 72

Threshold Crossing Model as Reduced Form for Roy Model

Special Case: Generalized Roy model:

D = 1[Y1 − Y0 − C > 0],

Outcomes: Y1 = µ1(X ) + U1,

Y0 = µ0(X ) + U0,

Costs: C = µC (W ) + UC ,

Z = (X ,W ) .

Implying following reduced form:

⇒{ D = 1[µD(Z )− V ≥ 0]µD (Z ) = µ1 (X )− µ0 (X )− µC (W )

V = − (U1 − U0 − UC ) .

12 / 72

Threshold Crossing Model Nests Parametric Examples: Probit Model

Special Case: D determined by Probit Model

D = 1 [µD(Z )− V > 0].

Probit Model special case with

µD(Z ) = γZ ,

V ∼ N(0, 1).

⇒ P(z) = Φ(γz), with Φ(·) std. normal cdf,⇒ D = 1[γZ − V ≥ 0] = 1[Φ(γZ )− Φ(V ) ≥ 0].

If Yj = Xβj + Uj and (V ,U0,U1) joint normal, than classicHeckman normal selection model. (see, e.g. Heckman (1978)).

If Yj = 1[Xβj + Uj ≥ 0] and (V ,U0,U1) joint normal, thanclassic bivaraite probit model with structural shift.

13 / 72

Threshold Crossing Model Nests Parametric Examples: Logit Model

Logit Model special case with

µD(Z ) = γZ ,

V ∼ standard logistic .

⇒ P(z) = Λ(γz), with Λ(·) std. logistic cdf.⇒ D = 1[γZ − V ≥ 0] = 1[Λ(γZ )− Λ(V ) ≥ 0].

We will not impose any parametric functional form ordistributional assumption for identification theory, but may bewill be useful in practice to increase precision and allow formultidimensional Z .

See, e.g.,

Heckman, Tobias, and Vytlacil (2003)

Aakvik, Heckman and Vytlacil (2005).14 / 72

Key, unifying parameter,Marginal Treatment Effect (MTE):

MTE (u) = E (Y1 − Y0 | UD = u).

MTE and the local average treatment effect (LATE)parameter are closely related (MTE can be seen as limitof LATE parameters).

Generalizes the marginal gain parameter of Bjorklund andMoffitt (1987).

How MTE (u) varies with u uncovers relationship betweenselection and treatment effect heteorgeneity.

Provides average effect at given quantile of unobserveddesire for treatment.

15 / 72

Figure: MTE for Effect of DI on EmploymentFrom Maestas, Mullen and Strand (2013, AER)

MTE for Effect of DI on Emplolyment

1825MAESTAS ET AL.: CAUSAL EFFECTS OF DISABILITY INSURANCE RECEIPTVOL. 103 NO. 5

the predicted probability of SSDI receipt. Specifically, we regress initial allowance decisions on indicators for type of impairment, age group, decision month, and DDS, as well as a measure of average prior earnings, and construct the residual, Z, which by construction is orthogonal to the case mix controls and varies systemati-cally only with EXALLOW. Then we estimate a probit of ultimate SSDI receipt on the residualized Z. This is our measure of the predicted probability of SSDI receipt, P(Z ). Next we estimate a local quadratic regression of employment on predictedSSDI receipt and compute the numerical derivative of this function to estimate ∂E[ y]/∂P(Z ).

Figure 7 shows the MTE as a function of unobserved severity, where severity is reverse ordered and measured in percentiles (see definition of u in Section IVA),along with boot-strapped 95 percent confidence intervals. Applicants on the margin for an examiner with a predicted SSDI receipt rate of 65 percent (the mean rate)are in the sixty-fifth percentile of the unobserved (reverse) severity distribution.That is, they have an impairment that is less severe than 65 percent of applicants, and more severe than 35 percent of applicants. Since we estimate that 57 percent of applicants are always takers (that is, they would receive SSDI benefits regardless ofinitial examiner assignment), the MTE is not identified for applicants on the marginof SSDI receipt rates less than 57 percent. Similarly, the MTE is not identified for applicants on the margin of SSDI receipt rates greater than 80 percent (= 57 + 23,the fraction of marginal applicants). As a result, we are only able to trace the MTEfor applicants between the fifty-seventh and eightieth percentiles of the unobserved (reverse) severity distribution (or the twentieth to forty-third percentiles of theactual unobserved severity distribution s). The estimates become imprecise at themore extreme ends of the distribution since there are relatively small numbers of examiners with margins at these points.

–1.5

–0.5

0.55 0.6 0.65 0.7 0.75 0.8

Percentile of (reverse) unobserved severity distribution

Figure 7. Marginal Treatment Effect on Employment

Notes: Ninety-five percent confidence intervals shown with dashed lines. Bandwidth is 0.084.

Source: DIODS data for 2005 and 2006.

Source: Maestas, Mullen and Strand, “Does Disability Insurance Receipt Discourage Work? Using ExaminerAssignment to Estimate Causal Effects of SSDI Receipt.” (2013, AER)

16 / 72

Figure: MTE for Effect of Norwegian VR Training on EmploymentFrom Aakvik, Heckman and Vytlacil (2005, JOE)

MTE for Effect of Norwegian Vocational Rehabilitation on Employment

7.3. Heterogeneity in observables

The estimated treatment effect vary substantially with observed characteristics.For example, the variance of EðDjX Þ is 0.0064 (standard error ¼ 0:08), compared toits mean of �0:014: The variance of EðDjX ;D ¼ 1Þ is 0.0085 (standard error¼ 0:092)compared to its mean of �0:11: The degree to which the treatment effect varies withobservable characteristics can also be seen by studying the marginal effect of eachobservable characteristic on the expected treatment effect. The marginal effects onthe treatment parameters are reported in Table 4. For example, being older, havinglower pre-program income, having lower spouse’s income, and having youngchildren are all associated with a larger treatment effect for all definitions of meantreatment effects. We develop this point further after we analyze distributionaltreatment parameters.

7.4. Estimated distributional treatment parameters

The distributional treatment effect parameters capture an additional type oftreatment effect heterogeneity beyond that previously discussed for mean treatmenteffects. We now report estimates of the distributional treatment parameters. Table 5reports the distributional versions of ATE; TT ; and MTE evaluated at selectedvalues of UD: We find that if a random applicant is assigned to training, withprobability 0.225 the applicant benefits from the training, that is, will be employedafter receiving the training but would have been unemployed without the training.However, with probability 0.24 the applicant will be hurt by receiving the training,

ARTICLE IN PRESS

Fig. 1. Estimated marginal treatment effect.

A. Aakvik et al. / Journal of Econometrics 125 (2005) 15–5138

Source: Aakvik, Heckman and Vytlacil, “Estimating treatment effects for discrete outcomes when responses totreatment vary: an application to Norwegian vocational rehabilitation programs.” (2005, JOE)

17 / 72

Figure: MTE for Effect of Fertility on Schooling of First BornBrinch, Mogstad, and Wiswall (JPE, 2017)

MTE for Effect of Additional Child on Yrs of Schooling of First Born

Figure 5: MTE estimates with same-sex instrument

0.2 0.3 0.4 0.5 0.6 0.7−2

−1.5

−0.5

Y1 −

Y0 | p

Note: This �gure displays the MTE estimates from the semiparametric generalized Roy model based

on Assumptions 1 and 2, with �Same sex, �rst and second� as instrument. We construct P(Z) using

the parameter estimates from the logit model with average derivatives reported in Table 2. We use

the same speci�cation for the covariates as reported in Table 2. The MTE estimates are based on

double residual regression separately for the treated and non-treated, using local quadratic regression

with rectangular kernel and bandwidth of 0.055. The 95 percent con�dence interval is computed from a

non-parametric bootstrap with 100 bootstrap replications. The y-axis measures the value of the MTE

in years of schooling, whereas the x-axis represents the unobserved component of parents' net gain from

having 3 or more children rather than 2 children. A high value of p means that a family is less likely to

have 3 or more children.

Source: “Beyond LATE with a Discrete Instrument: Heterogeneity in the Quantity-Quality Interaction of Children”,by Brinch, Mogstad, and Wiswall (JPE, 2017).

18 / 72

Connection: LATE and MTE

LATE and MTE closely connected.

For (z , z ′) such that P(z) > P(z ′), LATE is:

E (Y1 − Y0 | D(z) = 1,D(z ′) = 0)

= E (Y1 − Y0 | P(z ′) < UD ≤ P(z))

∫ P(z)

P(z ′)

∆MTE (u)du,

using that UD ∼Unif[0, 1] and

D(z) = 1⇔ UD ≤ P(z)

D(z ′) = 0⇔ UD > P(z) .

19 / 72

Marginal Treatment Effect (MTE):

MTE (u) = E (Y1 − Y0 | UD = u).

Many treatment effect parameters can be represented asweighted averages of MTE.

Broad class of policy counterfactuals can be representedas weighted averages of MTE.

20 / 72

Table: A. Treatment effects and IV estimands as weighted averagesof MTE

ATE= E (Y1 − Y0) =∫ 1

0 MTE(uD) duD

TT= E (Y1 − Y0|D = 1) =∫ 1

0 MTE(uD)ωTT(uD) duD

TUT(x) = E (Y1 − Y0|D = 0) =∫ 1

0 MTE (uD) ωTUT (uD) duD

PRTE= E (Ya′)− E (Ya) =∫ 1

0 MTE (uD) ωPRTE (uD) duD

IVJ =∫ 1

0 ∆MTE(uD)ωJIV(uD) duD , given instrument J(Z )

B. Weights

ωATE(uD) = 1

ωTT(uD) =1− FP(Z)(uD)

E (P(Z ))

ωTUT (uD) =FP(Z)(uD)

E ((1− P(Z )))

ωPRTE(uD) =

[FPa′

(uD)−FPa (uD)

], where ∆P = E (Pa)− E (Pa′)

ωJIV(uD) = ωJ

IV (u) =E(J(Z)−J(Z)|P(Z)>u) · FP(Z)(u)

Cov(J(Z),D)

The weights in the table all integrate to one.

Weights for ATE, TT, and TUT will be nonnegative.

Weights for IV will be nonnegative if J(Z ) is a monotonicfunction of P(Z ), but need not be otherwise.

Policy Relevant Treatment Effect (PRTE) is effect of policycounterfactual that changes incentives for treatment withresulting tchange in distribution of fitted probabilities fromFPa to FPa′ .

Source: Heckman and Vytlacil (2005).

Figure 4: Weights on MTE for Alternative Parameters(Hypothetical Example)

Weights on MTE for Alternative Parameters (Hypothetical Example)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0

3 . 5 ω(uD )

MTE 0.35

Source: Heckman and Vytlacil (2005)

24 / 72

Figure: Weights on MTE for Empirical Example (Brinch et al)

Weights on MTE for Alternative Parameters

Figure 3: Weight of MTE for treatment e�ects parameters and instruments

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−0.5

weights for att

weights for atut

weights for ate

(a) ATT, ATUT, and ATE

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−0.5

both instruments

samesex only

twins only

(b) IV with Z− as instrument

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−0.5

both instruments

samesex only

twins only

(c) IV with P(Z) as instrument

Note: The upper panel graphs MTE weights associated with the average treatment e�ect on the treated

(ATT), the average treatment e�ect (ATE), and the average treatment e�ect on the untreated (ATUT).

The middle panel (Z− as instrument) and lower panel (P (Z) as instrument) graph MTE weights

associated with the IV estimates presented in Table 3. To compute the weights, we use the weight

formulas described in the Appendix. The y-axis measures the density of the distribution of weights,

whereas the x-axis represents the unobserved component of parents' net gain from having 3 or more

children rather than 2 children. A high value of p means that a family is less likely to have 3 or more

children. 37

25 / 72

Weights on MTE for Alternative Instruments

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−0.5

weights for att

weights for atut

weights for ate

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−0.5

both instruments

samesex only

twins only

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−0.5

both instruments

samesex only

twins only

children. 37

26 / 72

Weights on MTE for Alternative Instruments, P(Z)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−0.5

weights for att

weights for atut

weights for ate

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−0.5

both instruments

samesex only

twins only

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−0.5

both instruments

samesex only

twins only

children. 37

Source: “Beyond LATE with a Discrete Instrument: Heterogeneity in the Quantity-Quality Interaction of Children”,by Brinch, Mogstad, and Wiswall (JPE, 2017). 27 / 72

Identification of the MTE

Different parameters can be seen as different weightedaverages of MTE, IV is a weighted average of MTE.

If can identify MTE, can:

1 Integrate MTE to obtain other parameters of interest

2 Understand connection between selection into treatmentand individual effects.

How to identify MTE?

28 / 72

Identification of the MTE (cont’d)

E (Y | P(Z ) = p) = E (DY1 + (1− D)Y0 | P (Z ) = p)

= E (Y0) + E (D (Y1 − Y0) |P (Z ) = p)

= E (Y0) +

[E (Y1 − Y0|D = 1,P (Z ) = p)

·Pr (D = 1 | Z = z)

= E (Y0) +

E (Y1 − Y0|UD = u) du.

⇒ ∂

∂pE (Y | P(Z ) = p)

︸︷︷︸LIV

= E (Y1 − Y0|UD = p)︸︷︷︸MTE

29 / 72

Identification of the MTE (cont’d)

LIV (Local Instrumental Variables) identifies MTE

∂pE (Y | P(Z ) = p)

︸︷︷︸LIV

= E (Y1 − Y0|UD = p)︸︷︷︸MTE

. (2.1)

Suppose P(Z ) is continuous.(requires at least one component of Z be continuous)Then ∆MTE (u) identified by LIV for u ∈ Supp(P(Z )).

The greater the variation in P(Z ), the greater the rangeover which MTE is identified.

30 / 72

Using MTE for Identification of Treatment Effects

Treatment Parameter (j) =∫ 1

0∆MTE (u) ωj (u) du,

Identification using this relationship requires identificationof ∆MTE (u) for u such that ωj (u) 6= 0.

We identify ∆MTE (u) for u ∈ Supp(P(Z )).(supposing P(Z ) continuous).

Thus, to integrate MTE to identify treatment parameter,require Supp(P(Z )) ⊇ {u : ωj (u) 6= 0}.

Strong requirement for traditional treatment parameters,typically “identification at infinity” requirement.

31 / 72

Using MTE for Identification of Treatment Effects (cont’d)

To integrate MTE to identify treatment parameter,require Supp(P(Z )) ⊇ {u : ωj (u) 6= 0}.For example:

For ATE, need Supp(P(Z )) = [0, 1],For TT, need Supp(P) = [0, pu],For TUT, need Supp(P) = [pl , 1].

Even with identification at infinity, estimation oftraditional parameters involves estimation on thin sets,slow rate of convergence.

Same issue as Andrews and Schafgans (1998).

32 / 72

Can identify without integrating MTE to obtainparameter under slightly weaker conditions, but stillrequire identification at infinity without imposing morestructure.

For example, for ATE, require Supp(P(Z )) ⊇ {0, 1},instead of Supp(P(Z )) = [0, 1].

Can follow bounding/partal-identification approach ifsupport of P(Z ) does not allow point identification.

33 / 72

Partial Identification

Partial identification analysis for traditional treatmentparameters developed by Heckman and Vytlacil (2001),

Suppose potential outcomes are bounded,Pr[yl ≤ Yj ≤ yu] = 1, j = 0, 1.

They develop sharp bounds on traditional treatmentparameters.

Width of bounds on ATE depends linearly on distance ofmaximum propensity score from one and minimumpropensity score from zero.

Width of bounds on TT depends linearly on distance ofminimum propensity score from zero.

Relation to Balke and Pearl (1997)?

See also Mogstad, Santos and Torgovitsky (2018).34 / 72

Partial Identification of MTE

Heckman, Li, Oka and Vytlacil (2017):“Identification of Treatment Effects Under Discrete Variation in thePropensity Score.”

Suppose that

potential outcomes are bounded:Pr[yl ≤ Yj ≤ yu] = 1, j = 0, 1,know a priori that the MTE is a (weakly) monotonicfunction, know the direction of the monotonicity.

Develop sharp bounds on MTE for any given support of thepropensity score, including the propensity score being a discreterandom variable.

Bounds are nontrivial, even if the distribution of the propensityscore is degenerate.

Without imposing monotonicity on MTE, and without imposingother assumptions, the sharp bounds on MTE can be trivial.

35 / 72

Without large support, can still:

1 Bound conventional parameters as discussed above.

2 Understand treatment effect for some groups ofindividuals, and understand part of the connectionbetween selection and individual effects, by examiningMTE over identified values.

3 We will show that one can still nonparametricallyidentify average effect for those on margin ofindifference, and effect of marginal policy changes,without large support requirements.

36 / 72

Alternative Parameters of Interest: AMTE, MPRTE.

Carnerio, Heckman and Vytlacil (2010 ECMA, 2011 AER)consider following parameters:

Average Marginal Treatment Effect (AMTE):

E (Y1 − Y0|P(Z ) = U).

Marginal Policy Relevant Treatment Effect (MPRTE)

Consider policy counterfactuals that change incentivesfor treatment in particular direction. Consider limit ofsuch policy counterfactuals for infinitesimal change inincentives.

The AMTE parameter is not uniquely defined (Borel Paradox).

CHV show that alternative definitions of AMTE are equivalentto alternative versions of MPRTE corresponding toinfinitesimal policy changes in alternative directions.

37 / 72

Change Parameter of Interest: AMTE, MPRTE

CHV show that

AMTE and MPRTE parameters can be written asweighted averages of MTE, with weights that depend onparticular definition of AMTE, equivalently, depend ondirection of marginal policy change.

Identification of AMTE, MPRTE parameters depends onhaving a continuous instrument, but does not otherwisedepend on support of P(Z ). No need for identification atinfinity.

Nonparametric estimation of AMTE, MPRTE parametersfundamentally easier than estimation of traditionaltreatment parameters, can be consistently estimated at√N−rate.

38 / 72

Estimation

Possible parameters of interest:

1 The MTE function itself, E [Y1 − Y0 | UD = u].

2 Some parameter that is a functional of MTE,

1 Traditional parameters, e.g., ATE, TT, etc.2 Non-traditional parameters, e.g., AMTE, MPRTE, etc.3 Probability limit of IV, interpret.

First consider estimation of MTE.

39 / 72

Nonparametric Estimation of MTE

∂pE (Y | P(Z ) = p)

︸︷︷︸LIV

= E (Y1 − Y0|UD = p)︸︷︷︸MTE

Suppose no need to condition on other covariates forinstruments to be valid (or at least no other continuouscovariates).

Suppose at least one element of Z is continuous, andresulting P(Z ) is continuous.

Then can non-parametrically estimate∂∂pE (Y | P(Z ) = p), for example, through local

polynomial regression of Y on P(Z ), with P(Z )estimated in a first step.

40 / 72

∂pE (Y | P(Z ) = p)

︸︷︷︸LIV

= E (Y1 − Y0|UD = p)︸︷︷︸MTE

Explosion of recent analysis using as instrumentsjudges’/administrators’ proclivity to assign treatment.

Resulting instruments are

plausibly exogenous without conditioning on additionalcovariates,approximately continuous.Concerns with estimation, inference?

Such papers often nonparametrically estimate∂∂pE (Y | P(Z ) = p).

41 / 72

Often need to condition on X for Z to be plausiblyexogenous:

∂pE (Y | P(Z ) = p,X = x)

︸︷︷︸LIV

= E (Y1 − Y0|X = x ,UD = p)︸︷︷︸MTE

Z understood to possibly include some or all elements ofX .

Z |X varies due to elements of Z that are not elements ofX , those elements are the instruments.

42 / 72

Often need to condition on X for Z to be plausiblyexogenous:

∂pE (Y | P(Z ) = p,X = x)

︸︷︷︸LIV

= E (Y1 − Y0|X = x ,UD = p)︸︷︷︸MTE

In theory, can still non parametrically estimate∂∂pE (Y | P(Z ) = p,X ). However, . . .

43 / 72

Problem: Curse of Dimensionality

If X contains continuous elements, especially multiplecontinuous elements, point wise estimation ofE (Y | P(Z ) = p,X = x) will be very poor.

Formally: very slow rate of convergence. Expect largebias and high imprecision in finite samples. Expectasymptotics to be poor guide.

Point-wise estimation of derivative ofE (Y | P(Z ) = p,X = x) should be even more difficult.

All of above problems, but more so.

44 / 72

∂pE (Y | P(Z ) = p,X = x)

︸︷︷︸LIV

= E (Y1 − Y0|X = x ,UD = p)︸︷︷︸MTE

Additional Issue: Often Z discrete, or discrete variationconditional on X , in which case cannot nonparametricallyidentify (much less estimate) E (Y | P(Z ) = p,X = x).

45 / 72

Estimation of MTE

Options if continuous X and/or Z discrete:

1 Impose semiparametric structure.

For example, impose linear regression model on Y1,Y0

resulting in semiparametric, partially linear regressionmodel for E [Y | X ,P(Z )].

2 Impose parametric model on (D,Y0,Y1) | (X ,Z ).

For example, classical Heckman selection model withjoint normality.

3 Impose parametric functional form restrictions directly onE (Y1 | X = x ,UD = u),E (Y0 | X = x ,UD = u).

E (Y1 | X = x ,UD = u),E (Y0 | X = x ,UD = u) called“Marginal Treatment Response” (MTR) functions.See Brinch, Mogstad, and Wiswall (JPE, 2017)

46 / 72

Nonparametric Estimation of Treatment Parameters through MTE

Estimation of Other Treatment Parametersas a Functional of MTE:Additional Problem: Support Problem, Irregular Estimation

To estimate MTE non parametrically for all evaluationpoints, need support of P(Z ) conditional on X to be fullunit interval.

Requires extremely powerful instrument.

To integrate up MTE to traditional parameters, requireMTE over broad support.

Traditional treatment parameters are “non-smooth”functions of MTE, expect slower than

√N estimation.

47 / 72

Nonparametric Estimation of MTE, Treatment Parameters

Realistically, would need extremely large samples andextremely strong instruments to have nonparametricestimation of MTE and of traditional treatmentparameters to be feasible, even if X is low dimensional.

What is feasible?

Estimation of average effect for those on margin ofindifference, and effect of marginal policy changes,fundamentally easier than for traditional parameters.Can estimate IV, interpret.Can follow bounding approach.Can incorporate some parametric functional formrestrictions, follow semi parametric or parametricestimation approaches.

48 / 72

Semiparametric Estimation of MTE

When need to condition on covariates for validity ofinstruments, most common method is to follow Heckman,Vytlacil and Urzua (2006) and Carneiro, Heckman and Vytlacil(2011), impose:

Y1 = Xβ1 + U1,

Y0 = Xβ0 + U0,

⇒ Y = Xβ0 + DX (β1 − β0) + D(U1 − U0) + U0.

49 / 72

Semiparametric Estimation of MTE

Assume (X ,Z ) ⊥⊥ (UD ,U0,U1).

Y = Xβ0 + DX (β1 − β0)

+D(U1 − U0) + U0,

⇒ E (Y | X ,P(Z )) = Xβ0 + P(Z )X (β1 − β0) + K (P(Z )),

K (P(Z )) = E (D(U1 − U0)|P(Z ))

= P(Z )E (U1 − U0|D = 1,P(Z ))

= P(Z )E (U1 − U0|UD ≤ P(Z )).

50 / 72

Semiparametric Estimation of MTE (cont’d)

E (Y | X ,P(Z )) = Xβ0 + P(Z )X (β1 − β0) + K (P(Z )),

K (P(Z )) = P(Z )E (U1 − U0|UD ≤ P(Z )).

Thus, imposing linear model on potential outcomes results in apartially linear model for the observed outcome.

If impose joint normality assumptions on error terms, or otherjoint parametric distributional assumption on error terms, thanK is a known function (possibly up to finite dimensionalparameter vector), and we have a standard non-linearparametric regression model.

Without imposing parametric distributional assumption onerror terms, K (·) is an unknown, nonparametric function.

51 / 72

E (Y | X ,P(Z )) = Xβ0 + P(Z )X (β1 − β0) + K (P(Z )),

K (·) unknown function, suggests semiparametric multistepestimation strategy.

1 Estimate P(Z ) in first step, either parametrically orsemi/nonparametrically (using, e.g., Ichimura 1993; Kleinand Spady, 1993; Ahn, Ichimura and Powell 2004).

Most applications use a parametric model for P(Z ), aprobit or a logit.

2 Flexibly estimate E (Y | X ,P(Z )) using estimated P(Z ).

52 / 72

2 Flexibly estimate E (Y | X ,P(Z )) using estimated P(Z ),for example, using:

Partial linear regression/nonparametric double residualregression techniques, as in Robinson (1988), orRegress Y on X ,P(Z )X , and a series in P(Z ), adaptingDas, Newey and Vella (2003), Newey, Powell and Vella(1999).

Note dimension reduction.

See also Cattaneo, Jansson, and Ma (2019),“Two-Step Estimation and Inference with Possibly ManyIncluded Covariates.,” (Review of Economic Studies,forthcoming)

53 / 72

Implications of Semiparametric Model

E [Y | X = x ,P(Z ) = p] = xβ0 + p · x(β1 − β0) + K (p),

E (Y1 − Y0|X = x ,UD = p) =∂

∂pE (Y | X = x ,P(Z ) = p)

= X (β1 − β0) + k(p)

where k(p) = K ′(p).

With partially linear restriction, X shifts MTE function up anddown by a constant, does not change curvature.

54 / 72

Extrapolation based on Semiparametric Model

E [Y | X = x ,P(Z ) = p] = xβ0 + p · x(β1 − β0) + K (p),

E (Y1 − Y0|X = x ,UD = p) =∂

∂pE (Y | X = x ,P(Z ) = p)

= X (β1 − β0) + k(p)

where k(p) = K ′(p).

Nonparametrically, identify E [Y | X = x ,P(Z ) = p] for(x , p) ∈ Supp(X ,P(Z )).

Exploiting semiparametric, partially linear structure,identify E [Y | X = x ,P(Z ) = p] for(x , p) ∈ Supp(X )× Supp(P(Z )).

55 / 72

Extrapolation based on Semiparametric Model

Nonparametrically, identify E [Y | X = x ,P(Z ) = p] for(x , p) ∈ Supp(X ,P(Z )).

Exploiting semiparametric, partially linear structure,identify E [Y | X = x ,P(Z ) = p] for(x , p) ∈ Supp(X )× Supp(P(Z )).

Supp(P(Z )) typically much larger than Supp(P(Z ) | X ).

P(Z ) may be continuous even when P(Z ) | X is discrete.

Semiparametric, partially linear regression thus allowsconsiderable extrapolation, somewhat controversial.

56 / 72

Example: college attendance on wages for high school graduates

Table 2Definitions of the Variables Used in the Empirical Analysis

Variable Definition

Y Log Wage in 1991 (average of all non-missing wages between 1989 and 1993)S=1 If ever Enrolled in College by 1991: zero otherwiseX AFQT,a Mother’s Education, Number of Siblings, Average Log Earnings 1979-

2000 in County of Residence at 17, Average Unemployment 1979-2000 in State ofResidence at 17, Urban Residence at 14, Cohort Dummies, Years of Experience in1991, Average Local Log Earnings in 1991, Local Unemployment in 1991.

Z\Xb Presence of a College at Age 14 (Card, 1993, Cameron and Taber, 2004), LocalEarnings at 17 (Cameron and Heckman, 1998, Cameron and Taber, 2004), LocalUnemployment at 17 (Cameron and Heckman, 1998), Local Tuition in Public 4Year Colleges at 17 (Kane and Rouse, 1995).

Note: aWe use a measure of this score corrected for the effect of schooling attained by the participant at the date of the test, since at the

date the test was taken, in 1981, different individuals have different amounts of schooling and the effect of schooling on AFQT scores is

important. We use a version of the nonparametric method developed in Hansen, Heckman, et al. (2004). We perform this correction for

all demographic groups in the population and then standardize the AFQT to have mean 0 and variance 1. See Table A-2. bThe papers

in parentheses are papers that previously used these instruments.

Source: Carneiro, Heckman and Vytlacil (2009)

57 / 72

Figure 2: Support of P Conditional on X

0.20.3

0.40.5

0.60.7

0.80.9

P is the estimated probability of going to college. It is estimated from a logit regression of collegeattendance on corrected AFQT, mother’s education, number of siblings, urban residence at 14,permanent earnings in the county of residence at 17, permanent unemployment in the state ofresidence at 17, cohort dummies, a dummy variable indicating the presence of a college in thecounty of residence at age 14, average log earnings in the county of residence at age 17, averagestate unemployment in the state of residence at age 17 (see Table 3). X corresponds to an index ofvariables in the outcome equation.

58 / 72

E(Y | X ,P) as a function of P for average X

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−3.5

−3.4

−3.3

−3.2

−3.1

−2.9

−2.8

−2.7

E(Y|P)

59 / 72

E(Y1 − Y0) | X ,U) estimated using locally quadratic regression (averaged over X )

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−0.4

−0.2

60 / 72

Parametric Estimation of MTE

Alternatively, can impose parametric model on(D,Y0,Y1)|X ,Z ) for estimation:

Much less data intensive, reasonably precise estimationfeasible with smaller sample sizes.

Naturally provides extrapolation outside of support, canestimate MTE over full unit interval and estimate alltreatment parameters.

Negative: less flexible, parametric structure might beincorrect.

61 / 72

Parametric Examples:

Assume

Yj = Xβj + Uj , j = 0, 1,

D = 1 [Z ′γ − V > 0]

with (U0,U1,V ) ⊥⊥ (X ,Z ), (U0,U1,V ) ∼ Fθ, distributionknown up to finite-dimensional unknown parameter vector θ.

Again have

E [Y | X = x ,P(Z ) = p] = xβ0 + p · x(β1 − β0) + Kθ(p),

E (Y1 − Y0|X = x ,UD = p) = X (β1 − β0) + kθ(p)

but with K (θ·) and kθ(·) now known functions (up tofinite-dimensional parameters θ).

62 / 72

Parametric Examples (cont’d)

Assume

Yj = Xβj + Uj , j = 0, 1,

D = 1 [Z ′γ − V > 0]

with (U0,U1,V ) ⊥⊥ (X ,Z ), (U0,U1,V ) ∼ Fθ, distributionknown up to finite-dimensional unknown parameter vector θ.

If (V ,U0,U1) joint normal, than classic Heckman (1978)normal selection model.

Used for estimation of MTE by, e.g., Heckman, Vytlaciland Urzua (2006), Carneiro, Heckman and Vytlacil(2011).

63 / 72

Assume

Yj = Xβj + Uj , j = 0, 1,

D = 1 [Z ′γ − V > 0]

with (U0,U1,V ) ⊥⊥ (X ,Z ), (U0,U1,V ) ∼ Fθ, parametricdistribution up to unknown parameter vector θ.

Tobias, Heckman and Vytlacil (2003) estimate MTE whileconsidering other parametric distributions for (V ,U0,U1)

For example, student-tν distributions instead of jointnormal.

Adapts results from Lee (1982, 1983).

64 / 72

Assume

Yj = 1 [Xβj + Uj ≥ 0] , j = 0, 1,

D = 1 [Z ′γ − V > 0]

with (U0,U1,V ) ⊥⊥ (X ,Z ), (U0,U1,V ) ∼ Fθ, parametricdistribution up to unknown parameter vector θ.

If (V ,U0,U1) joint normal, than bivariate probit modelmodel with structural shift..

Developed for MTE by Aakvik, Heckman and Vytlacil(2005). They also consider factor-model generalizationsof joint normality.

65 / 72

Example: MTE for Effect of Vocational Rehabilitation on Employment

MTE for Effect of Vocational Rehabilitation on Employment

Source: Aakvik, Heckman and Vytlacil (2005)

66 / 72

Example: Effect of Vocational Rehabilitation on Employment

7.5. Cream-skimming: the relationship between selection into the program and

outcomes

A central question in the analysis of a program like VR is whether those whobenefit the most from it are those most likely to participate in it. We have alreadynoted that ATE is greater than TT ; i.e., that randomly selected persons benefit morefrom the program than those who participate in it. This suggests that thecombinations of UD and Z values that promote program participation are perverselyassociated with the observed and unobserved factors associated with gains from theprogram.In order to determine the extent of cream-skimming on both observables and

unobservables, it is necessary to relate D (as defined by the various means anddistributional parameter analogues) to ZbD and UD: We have estimated relation-ships among D and (Xb1;Xb0;U1;U0Þ; however. So the problem is how to go fromthe relationships we have estimated to determine the relationships between gains andZbD and UD:Given the factor structure model, we can easily determine how variation in UD

affects U1 and U0 (see Eq. (12)). By virtue of independence assumption (iii), thefactor relationship does not depend on values of ZbD; Xb1 and Xb0: We have usedthis relationship in computing Fig. 1 and in inferring that selection into the program

ARTICLE IN PRESS

Table 5

Mean and distributional treatment parameters

ATE Distributional version of ATE:

EðDÞ ¼ �0:014 Pr½D ¼ 1� ¼ 0:225ðstandard error ¼ 0:08Þ Pr½D ¼ 0� ¼ 0:532

Pr½D ¼ �1� ¼ 0:240

TT Distributional version of TT :

EðD j D ¼ 1Þ ¼ �0:110 Pr½D ¼ 1 j D ¼ 1� ¼ 0:178ðstandard error ¼ 0:09Þ Pr½D ¼ 0 j D ¼ 1� ¼ 0:534

Pr½D ¼ �1 j D ¼ 1� ¼ 0:288

MTE with UD ¼ 2 Distributional version of MTE with UD ¼ 2:

EðD j UD ¼ 2Þ ¼ 0:224 Pr½D ¼ 1 j UD ¼ 2� ¼ 0:350ðstandard error ¼ 0:17Þ Pr½D ¼ 0 j UD ¼ 2� ¼ 0:524

Pr½D ¼ �1 j UD ¼ 2� ¼ 0:126

MTE with UD ¼ 0 Distributional version of MTE with UD ¼ 0:

EðD j UD ¼ 0Þ ¼ �0:014 Pr½D ¼ 1 j UD ¼ 0� ¼ 0:219ðstandard error ¼ 0:07Þ Pr½D ¼ 0 j UD ¼ 0� ¼ 0:549

Pr½D ¼ �1 j UD ¼ 0� ¼ 0:233

MTE with UD ¼ �2 Distributional version of MTE with UD ¼ �2:EðD j UD ¼ �2Þ ¼ �0:255 Pr½D ¼ 1 j UD ¼ �2� ¼ 0:119ðstandard error ¼ 0:16Þ Pr½D ¼ 0 j UD ¼ �2� ¼ 0:508

Pr½D ¼ �1 j UD ¼ �2� ¼ 0:373

A. Aakvik et al. / Journal of Econometrics 125 (2005) 15–5140

Source: Aakvik, Heckman and Vytlacil (2005)67 / 72

Example: Effect of Year of College on Wages (Parametric)

MTE for Effect of Year of College on Wages2767cARnEiRO Et Al.: EStimAting mARginAl REtuRnS tO EducAtiOnVOl. 101 nO. 6

u S ). Individuals choose the schooling sector in which they have comparative advan-tage. The magnitude of the heterogeneity in returns on which agents select is sub-stantial: returns can vary from −15.6 percent (for high u S persons, who would lose from attending college) to 28.8 percent per year of college (for low u S persons).16 The magnitude of total heterogeneity is likely to be even higher since the MTE is the average gain at that quantile of desire to attend college. In general, there will be a distribution of returns centered at each value of the MTE. Furthermore, once we account for variation in X and its impact on returns through X( δ 1 − δ 0 ), we observe returns as low as −31.56 percent and as high as 51.02 percent.

Using the weights presented in online Appendix Table A-1B, we can construct the standard treatment parameters from the MTE. We present the results in the first column of Table 5 (standard errors are bootstrapped). These include marginal returns to the three different policies considered in Table 1 (MPRTE), which are all

16 One unattractive feature of the normal model is that (for our estimates of σ 1V and σ 0V ) mtE(x, 0) = + ∞ and mtE(x, 1) = −∞. In order to get finite values at the extremes of the normal MTE, we restrict the support of u S to be between 0.0001 and 0.9999.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−0.4

−0.3

−0.2

−0.1

Figure 1. MTE Estimated from a Normal Selection Model

notes: To estimate the function plotted here, we estimate a parametric normal selection model by maximum likeli-hood. The figure is computed using the following formula:

ΔMTE (x, uS) = μ1 (x) − μ0 (x) − (σ1V − σ0V) Φ−1 (uS),

where σ 1V and σ 0V are the covariances between the unobservables of the college and high school equation and the unobservable in the selection equation; and X includes experience, current average earnings in the county of resi-dence, current average unemployment in the state of residence, AFQT, mother’s education, number of siblings, urban residence at 14, permanent local earnings in the county of residence at 17, permanent unemployment in the state of residence at 17, and cohort dummies. We plot 90 percent confidence bands.

Source: Carneiro, Heckman and Vytlacil (2011)68 / 72

Example: Effect of Year of College on Wages (Semi-Parametric)

MTE for Effect of Year of College on Wages

2771cARnEiRO Et Al.: EStimAting mARginAl REtuRnS tO EducAtiOnVOl. 101 nO. 6

mean values in the sample. As above, we annualize the MTE. Our estimates show that, in agreement with the normal model, E( u 1 − u 0 | u S = u S ) is declining in u S , i.e., students with high values of u S have lower returns than those with low values of u S .

Even though the semiparametric estimate of the MTE has larger standard errors than the estimate based on the normal model, we still reject the hypothesis that its slope is zero. We have already discussed the rejection of the hypothesis that MTE is constant in u S , based on the test results reported in Table 4, panel A. But we can also directly test whether the semiparametric MTE is constant in u S or not. We evaluate the MTE at 26 points, equally spaced between 0 and 1 (with intervals of 0.04). We construct pairs of nonoverlapping adjacent intervals (0–0.04, 0.08–0.12, 0.16–0.20, 0.24–0.28, …), and we take the mean of the MTE for each pair. These are LATEs defined over different sections of the MTE. We compare adjacent LATEs. Table 4, panel B, reports the outcome of these comparisons. For example, the first column reports that

E ( Y 1 − Y 0 | X = _ x , 0 ≤ u S ≤ 0.04)

− E ( Y 1 − Y 0 | X = _ x , 0.08 ≤ u S ≤ 0.12) = 0.0689.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−0.6

−0.4

−0.2

Figure 4. E( Y 1 − Y 0 | X, u S ) with 90 Percent Confidence Interval— Locally Quadratic Regression Estimates

notes: To estimate the function plotted here, we first use a partially linear regression of log wages on polynomials in X, interactions of polynomials in X and P, and K(P), a locally quadratic function of P (where P is the predicted probability of attending college), with a bandwidth of 0.32; X includes experience, current average earnings in the county of residence, current average unemployment in the state of residence, AFQT, mother’s education, number of siblings, urban residence at 14, permanent local earnings in the county of residence at 17, permanent unemployment in the state of residence at 17, and cohort dummies. The figure is generated by evaluating by the derivative of (9) at the average value of X. Ninety percent standard error bands are obtained using the bootstrap (250 replications).

69 / 72

Example: Effect of Year of College on Wages

Effect of Year of College on Wages

2768 tHE AmERicAn EcOnOmic REViEW OctOBER 2011

below the return to the average student (t t = E(β | S = 1)), the average person (AtE = E(β)), and the IV estimate. But it is not clear if these estimates are reliable, given the strong normality assumption used to generate them. We next corroborate these estimates of marginal returns using a more robust semiparametric approach.

C. Estimating the mtE and marginal Policy Effects using local instrumental Variables

An alternative and more robust approach for estimating the MTE estimates E(Y | X, P(Z) = p) semiparametrically and then computes its derivative with respect to p, as shown in the analysis of equations (5) and (6). If all we are willing to assume is that ( u 0 , u 1 , V ) is independent of Z given X, then it is only possible to estimate the MTE over the support of P conditional on X. Figure 2 plots f (P | X), the density of P given X (P is estimated by a logit). Since X is multidimensional, we use an index of X (X[ δ 1 − δ 0 ]). It is striking how small the support of P is for each value of the X index. It is not possible to estimate MTE over the full unit interval, and as a conse-quence, it is not possible to estimate conventional treatment parameters such as the average treatment effect (E(β)) or the effect of treatment on the treated (E(β | S = 1)). It is still possible, however, to estimate MPRTE, since this parameter only puts posi-tive weight over sections of the MTE that are identified within the support of f (P | X).

Empirically, it is very difficult to apply the procedure described in Section I while conditioning on X nonparametrically. We first proceed by invoking the stronger assumption that (X, Z) is independent of ( u 0 , u 1 , u S ). We relax it below. Under this

Table 5—Returns to a Year of College

Model Normal Semiparametric

AtE = E(β) 0.0670 Not identified(0.0378)

tt = E(β | S = 1) 0.1433 Not identified(0.0346)

tut = E(β | S = 0) −0.0066 Not identified(0.0707)

MPRTEPolicy perturbation Metric Z α k

= Z k + α | Z γ − V | < e 0.0662 0.0802(0.0373) (0.0424)

P α = P + α | P − u | < e 0.0637 0.0865(0.0379) (0.0455)

P α = (1 + α)P | P _ u

− 1 | < e 0.0363 0.0148(0.0569) (0.0589)

Linear IV (Using P(Z) as the instrument) 0.0951(0.0386)

OLS 0.0836(0.0068)

notes: This table presents estimates of various returns to college, for the semiparametric and the normal selection models: average treatment effect (ATE), treatment on the treated (TT), treatment on the untreated (TUT), and different versions of the marginal policy relevant treat-ment effect (MPRTE). The linear IV estimate uses P as the instrument. Standard errors are bootstrapped (250 replications). See online Appendix Table A-1 for the exact definitions of the weights. See Table 1 for the weights for MPRTE. For more discussion of MPRTE, see Carneiro, Heckman, and Vytlacil (2010).

Source: Carneiro, Heckman and Vytlacil (2011) 70 / 72

Parametric Estimation of MTE, Structure on MTR

Alternative Parametric Approach:Brinch, Mogstad, and Wiswall (JPE, 2017)“Beyond LATE with a Discrete Instrument”

Place parametric structure directly on E (Y1|U), E (Y0|U)(MTRs).

Can identify model with linear E (Y1|U), E (Y0|U), linearMTE with only a binary instrument.

The greater the variation in Z , the richer the parametricmodel on E (Y1|U), E (Y0|U) that can be estimated.

Can test external validity of LATE even with binary Z .

See also Kowalski (2016)

71 / 72

Parametric Estimation of MTE, Structure on MTR (cont’d)

Suppose Yj = Xβj + Uj , j = 0, 1, with(U0,U1,UD) ⊥⊥ (X ,Z ), and define

k0(p) = E [U0 | UD = p]

k1(p) = E [U1 | UD = p]

k(p) = k1(p)− k0(p)

so that

E [Y0 | X ,P] = Xβ0 + k0(P)

E [Y1 | X ,P] = Xβ1 + k1(P)

MTE (X ,P) = X (β1 − β0) + k(p)

Brinch et al (2017) impose parametric structure directly onk1(p), k0(p).

72 / 72

E [Y0 | D = 0,X ,P] = Xβ0 + K0(P)

E [Y1 | D = 1,X ,P] = Xβ1 + K1(P)

K0(p) ≡ E [U0 | UD ≤ p] =1

1− p

k0(u) du

K1(p) ≡ E [U1 | UD > p] =1

k1(u) du

73 / 72

General idea:

1 impose parametric structure on k0(P), k1(P),

2 estimate parameters by estimating E [Y0 | D = 0,X ,P],E [Y1 | D = 1,X ,P], using parametric form implied byparametric restrictions on k0(P), k1(P).

3 use resulting parameters to estimate MTE, functionals ofMTE.

74 / 72

For example, suppose no X covariates, and assume:

k0(p) = α0 · p −1

k1(p) = α1 · p −1

which implies

K0(p) =1

K1(p) =1

2α1(p − 1)

and thus

E [Y0 | D = 0,P] = µ0 +1

E [Y1 | D = 1,P] = µ1 +1

2α1(p − 1)

Can estimate α0, α1, then use to form MTE. Can do so evenwith Z binary.

75 / 72

What Does Linear IV Estimate?

Consider J(Z ) as an instrument, a scalar function of Z .

∆IVJ =

Cov(Y , J(Z ))

Cov(D, J(Z )).

How to express as a weighted average of MTE?

76 / 72

∆IVJ =

∆MTE (u) ωJIV (u) du (4.1)

ωJIV (u) =

E(J (Z )− J(Z ) | P (Z ) > u

)Pr (P (Z ) > u)

Cov (J (Z ) ,D).

(4.2)J(Z ) and P(Z ) do not have to be continuous randomvariables.

Functional forms of P(Z ) and J(Z ) are general.

The weights are always positive if J (Z ) is monotonic in thescalar P(Z ).

77 / 72

The possibility of negative weights arises only when J(Z )is not a monotonic function of P(Z ).

This may arise, e.g., when there are two or moreinstruments, and the analyst computes estimates withonly one instrument or a combination of the Zinstruments that is not a monotonic fuction of P(Z ) sothat J(Z ) and P(Z ) are not perfectly dependent.

78 / 72

If use P(Z ) as the instrument, J(Z ) = P(Z ), then

The weights are everywhere non-negative.

Weighting function is maximal for u = E (P(Z ) | X = x)and minimal for u = 0, 1.

IV weights MTE more where density of P(Z ) is higher.

79 / 72

The weights can be constructed from data on (J ,P ,D).

Weights on ∆MTE (u) generating ∆IV are different fromthe weights corresponding to ∆TT (u), different fromweights corresponding to other standard treatmentparameters.

IV gives one weighted version of MTE, conventionaltreatment parameters give other weighted versions.

80 / 72

Discrete instruments J (Z )

Discrete Case

Support of the distribution of P(Z ) contains a finitenumber of values p1 < p2 < · · · < pK .

Support of the instrument J (Z ) is also discrete, taking Idistinct values, j1 < j2 < · · · < jI .

E (J(Z )|P(Z ) ≥ u) is constant in u for u within any(p`, p`+1) interval, and Pr(P(Z ) ≥ u) is constant in u foru within any (p`, p`+1) interval.

81 / 72

∆IVJ =

∫E (Y1 − Y0|UD = u)ωJ

IV (u) du (4.3)

=K−1∑

∫ p`+1

E (Y1 − Y0|UD = u)1

(p`+1 − p`)du

=K−1∑

∆LATE(p`, p`+1)λ`.

I∑i=1

(ji − E (J))K∑t>`

Pr [J = ji ,P = pt ]

Cov (J (Z ) ,D)(p`+1 − p`) (4.4)

82 / 72

Generalizes the expression presented by Imbens andAngrist (1994) and Yitzhak (1989).

Their analysis of the case of vector Z only considers thecase where J(Z ) and P(Z ) are perfectly dependentbecause J(Z ) is a monotonic function of P (Z ).

The weights can be positive or negative for any ` butthey must sum to 1 over the `.

83 / 72

The central role of the propensity score

For the IV weight to be correctly constructed andinterpreted, we need to know the correct model for P (Z ).

IV depends on:1 the choice of the instrument J (Z ),2 its dependence with P (Z ),3 the specification of the propensity score (i.e., what

variables go into Z ).

“Structural” LATE or MTE identified by P(Z ).

Can derive all other instruments in terms of this.

84 / 72

Comparing IV and OLS

In comparison to IV, what is plim of OLS?

Y = Y0 + D(Y1 − Y0).

E (Y |D = 1)− E (Y |D = 0)

= E (Y1 − Y0|D = 1)︸︷︷︸+E (Y0|D = 1)− E (Y0|D = 0)︸︷︷︸TT Selection Bias

= E (Y1 − Y0) +

{E (Y1 − Y0|D = 1)−E (Y1 − Y0)

{E (Y0|D = 1)−E (Y0|D = 0)

= ATE + Sorting Gain + Selection Bias

85 / 72

Comparing IV and OLS (cont’d)

If ATE is a parameter of interest, OLS suffers from bothsorting bias and Selection Bias.

If TT is parameter of interest, OLS suffers from SelectionBias.

Using IV removes Selection Bias, but changes theparameter being estimated (neither ATE nor TT ingeneral).

86 / 72

When will MTE be a constant?

Important question: Does∆MTE (u) = E (Y1 − Y0 | UD = u) vary with uD?

If E (Y1 − Y0 | UD = u) does not vary with u:“standard case.”Implies:

ATE = TT = LATE = policy counterfactuals = plim IV.

87 / 72

When will E (Y1 − Y0 | UD = u) not vary with u?

1 If Y1 = Y0 + β for some constant β.

2 More Generally, if Y1 − Y0 is mean independent of UD , sotreatment effect heterogeneity is allowed but individualsdo not act upon their own idiosyncratic effect.

If Y1 − Y0 is not independent of UD , so treatment effectheterogeneity is allowed and individuals do act upon their ownidiosyncratic effect, MTE will vary with uD , and treatmentparameters will differ.

88 / 72

ATE is only identified in limit sets (P (z) = 1 andP (z ′) = 0).

TT requires a limit set that sets P (z) = 1 for each X .

“Identification at infinity” stalks IV and control functionenterprise.

We can test to see if these complications are needed.

89 / 72

Testing for essential heterogeneity

Since ∂∂pE (Y | P(Z ) = p) = E (Y1 − Y0|UD = p), we have:

Y1 − Y0 ⊥⊥ D⇒ E (Y1 − Y0|UD = u) = E (Y1 − Y0)⇒ ∂

∂pE (Y | P(Z ) = p) = E (Y1 − Y0)

⇒ E (Y |P(Z ) = p) is linear in p.

Y1 − Y0 ⊥⊥ D ⇒ E (Y | P(Z ) = p) = a + bp,

where b = ∆MTE = ∆ATE = ∆TT.

90 / 72

Testing for essential heterogeneity (cont’d)

Y1 − Y0 ⊥⊥ D ⇒ E (Y | P(Z ) = p) = a + bp

If can’t reject E (Y |P(Z ) = p) is linear in p, can’t rejecteither Y1 − Y0 is constant or Y1 − Y0 ⊥⊥ D. No essentialheterogeneity, analysis simplifies tremendously.

If E (Y |P(Z ) = p) is nonlinear in p, then evidence ofessential heterogeneity – the returns to treatment vary inthe population, and individuals act upon it.

91 / 72

Testing for essential heterogeneity (cont’d)

Test E (Y | P(Z ) = p) = a + bp as a test of essentialheterogeneity?

Simple testing strategy from Carneiro, Heckman and Vytlacil(2011):

Regress Y on polynomial in P(Z ), test higher order termsof polynomial are jointly zero.

Not omnibus test, but valid test with power in somedirections.

Sequential version of test developed in Heckman,Schmierer, and Urzua (2010).

92 / 72

Example: Effect of Year of College on Wages

Table 4A - Test of Linearity of E (Y |X,P = p) using polynomials in P a

Degree of Polynomial 2 3 4 5p-value of joint test of nonlinear terms 0.013 0.018 0.032 0.028Adjusted critical value 0.026Outcome of test: Reject

Table 4B - Test of Equality of LATEs (H0 : LATE1

(U1LS , U1H

)− LATE1

(U2LS , U2H

)= 0) - Baseline Modelb

Ranges of US for LATE1 (0,0.04)- (0.08,0.12)- (0.12,0.20)- (0.24,0.28)- (0.32,0.36)- (0.40,0.44)-Ranges of US for LATE2 -(0.08,0.12) -(0.16,0.20) -(0.24,0.28) -(0.32,0.36) -(0.40,0.44) -(0.48,0.52)Difference in LATEs 0.0689 0.0629 0.0577 0.0531 0.0492 0.0459P-Value 0.0240 0.0280 0.0280 0.0320 0.0320 0.0520Ranges of US for LATE1 (0.48,0.52)- (0.56,0.60)- (0.64,0.68)- (0.72,0.76)- (0.80,0.84)- (0.88,0.92)-Ranges of US for LATE2 -(0.56,0.60) -(0.64,0.68) -(0.72,0.76) -(0.80,0.84) -(0.88,0.92) -(0.96,1)Difference in LATEs 0.0431 0.0408 0.0385 0.0364 0.0339 0.0311P-Value 0.0520 0.0760 0.0960 0.1320 0.1800 0.2400Joint P-Value 0.0520

Notes: aThe size of the test is controlled using a critical value constructed by the bootstrap method of Romano and Wolf (2005). bIn

order to compute the numbers in this table we construct groups of values of US and average the MTE within these groups, by computing

E(Y1 − Y0|X = x, UL

S ≤ US ≤ UHS

), where UL

S and UHS are the lowest and highest values of US for a given group. Then we compare the

average MTE across adjacent groups and test whether the difference is equal to zero (using the bootstrap with 250 replications).

93 / 72

Application: Foster Care and Adult Crime

Application: Doyle (2008),“Child Protection and Adult Crime: Using InvestigatorAssignment to Estimate Causal Effects of Foster Care”(Journal of Political Economy)

Foster Care in U.S.

U.S. spends $20 billion per year on foster care.

800,000 children per year spend some time in foster care.

20% of U.S. prison population under age 30 had been infoster care.

Investigate the effect of foster care on later adult crime,including “at the margin.”

94 / 72

Data from Illinois, linking Illinois State Police data to childabuse investigation data.

Cases first investigated 1990-2003, children aged 4-16.

Crime data 2000-2005, ages 18-31 in 2005.

23,254 observations.

16% of cases result in foster care placement.

26% of children are later arrested as adults by 2005.

95 / 72

Table: Summary Statistics, Foster Carechild protection and adult crime 753

TABLE 1Summary Statistics

Variable MeanStandardDeviation

Foster care placement .16 .36Race:

White .71 .46African American .25 .43Hispanic .03 .18

Initial reporter:Physician .07 .25School .17 .38Police .21 .41Family .18 .38Neighbor .07 .25Other government .14 .35Anonymous .12 .33Other reporter .03 .17

Age at report 11.0 3.1Sex: boy .50 .50Allegation:

Lack of supervision .26 .44Environmental neglect .11 .31Other neglect .06 .24Substantial risk of harm .35 .48Physical abuse .20 .40Other abuse .02 .16

Observations 23,254

Note.—The statistics pertain to children investigated outside of Cook Countybetween July 1, 1990, and June 30, 2003, and who were at least 18 years old in 2005.

not at risk for an adult arrest. This results in a sample of childrenbetween the ages of 4 and 16 at the time of the initial child protectionreport: children who would be between the ages of 18 and 31 in 2005.Thus, the results focus on older, poorer children than the populationof children who are investigated for abuse or neglect. In addition, sexualabuse cases (8 percent of the total) are excluded, since these cases donot enter into the rotational assignment of investigators.

The analysis sample includes over 23,000 children. To better under-stand the types of allegations, reporters, and child characteristics in thechild protection system, table 1 reports summary statistics: 16 percentof the children investigated were eventually placed in foster care (ap-proximately 10 percent of investigated children are placed in foster carein the United States as a whole, and the higher placement rate herelargely reflects the restriction of the sample to children who receivedPublic Assistance at some point prior to the abuse report); 71 percentof the investigated children are white, compared to 87 percent of thepopulation aged 5–14 in 2000 in Illinois outside of Cook County (figurefrom the U.S. Census of Population); the reporters of the abuse orneglect are typically school officials, police, and family members; and

This content downloaded from 130.132.173.185 on Fri, 10 Jun 2016 18:29:07 UTCAll use subject to http://about.jstor.org/terms

Source: Doyle (2008, JPE),“Child Protection and Adult Crime: Using Investigator Assignment to Estimate Causal Effects of Foster Care”

96 / 72

Table: Summary Statistics, Foster Care

child protection and adult crime 753

TABLE 1Summary Statistics

Variable MeanStandardDeviation

Foster care placement .16 .36Race:

White .71 .46African American .25 .43Hispanic .03 .18

Initial reporter:Physician .07 .25School .17 .38Police .21 .41Family .18 .38Neighbor .07 .25Other government .14 .35Anonymous .12 .33Other reporter .03 .17

Age at report 11.0 3.1Sex: boy .50 .50Allegation:

Lack of supervision .26 .44Environmental neglect .11 .31Other neglect .06 .24Substantial risk of harm .35 .48Physical abuse .20 .40Other abuse .02 .16

Observations 23,254

Note.—The statistics pertain to children investigated outside of Cook Countybetween July 1, 1990, and June 30, 2003, and who were at least 18 years old in 2005.

not at risk for an adult arrest. This results in a sample of childrenbetween the ages of 4 and 16 at the time of the initial child protectionreport: children who would be between the ages of 18 and 31 in 2005.Thus, the results focus on older, poorer children than the populationof children who are investigated for abuse or neglect. In addition, sexualabuse cases (8 percent of the total) are excluded, since these cases donot enter into the rotational assignment of investigators.

The analysis sample includes over 23,000 children. To better under-stand the types of allegations, reporters, and child characteristics in thechild protection system, table 1 reports summary statistics: 16 percentof the children investigated were eventually placed in foster care (ap-proximately 10 percent of investigated children are placed in foster carein the United States as a whole, and the higher placement rate herelargely reflects the restriction of the sample to children who receivedPublic Assistance at some point prior to the abuse report); 71 percentof the investigated children are white, compared to 87 percent of thepopulation aged 5–14 in 2000 in Illinois outside of Cook County (figurefrom the U.S. Census of Population); the reporters of the abuse orneglect are typically school officials, police, and family members; and

97 / 72

Children suspected of abuse reported to Illinois Department ofChildren and Family Services (DCFS) by physicians,educations, police, family members.

Once reported, allegation assigned to “case manager”.

Doyle argues assignment is essentially randomized(except for special cases, including alleged sexual abuse).

Case managers investigate, decide whether chargesunsubstantiated or to bring case to judge.

Differences across case managers in fraction of cases thatresult in foster care placement.

98 / 72

Doyle uses placement rate of assigned case manager as aninstrument:

He argues that case managers are essentially randomlyassigned, argues for instrument exogeneity.

Differences across case managers in fraction of cases thatresult in foster care placement, argues for instrumentrelevance.

In particular, constructs JIVE type instrument:

For each case, compute assigned case worker’splacement differential, defined in a “leave one out”manner to not include the particular case.

99 / 72

Figure: Placement Differential

Fig. 1.—Foster care placement (actual and predicted) and arrest indicator vs. case manager placement differential. Local linear estimates, evaluatedat each percentile of the case manager placement differential. Pilot bandwidth chosen by cross-validation is 0.034 for the actual and predicted placementrates. For the arrest rate the bandwidth is 0.056.

This content dow

nloaded from 130.132.173.185 on Fri, 10 Jun 2016 18:29:07 U

All use subject to http://about.jstor.org/term

100 / 72

Table: Determinants of Case Worker Placement Differentialchild protection and adult crime 759

TABLE 3Case Manager Assignment and Foster Care Placement

Dependent Variable: Foster Care Placement

Coefficient(1)

StandardError(2)

Coefficient(3)

StandardError(4)

Case manager placementdifferential .229 .036** .233 .035**

Race:White �.002 .029African American .093 .029**Hispanic �.030 .031

Initial reporter:Physician .043 .018*School .025 .015Police .073 .016**Family .016 .015Neighbor �.013 .016Other government .084 .016**Anonymous .002 .016

Age at report:Age 6 �.027 .018Age 7 .001 .016Age 8 .008 .017Age 9 .014 .017Age 10 .016 .017Age 11 .016 .017Age 12 .020 .017Age 13 .020 .018Age 14 .016 .017Age 15 �.007 .018Age 16 �.017 .018

Sex: boy �.016 .005**Allegation:

Physical abuse �.172 .015**Substantial risk �.180 .015**Other abuse �.162 .019**Lack of supervision �.152 .015**Environmental neglect �.188 .016**

Mean of dependent variable .16Observations 23,254

Note.—Models are estimated by OLS. Data are for school-aged children outside of Cook County. Standard errorsare clustered at the case manager level. All models include year indicators.

* Significant at 5 percent.** Significant at 1 percent.

Crime Outcomes

To compare crime outcomes, Y, empirical models for child i investigatedby case manager c in subteam j during year t are of the form

Y p a � a R � a X � d 1(t p k) � � . (7)�icj 0 1 icj 2 i k i icjk

This model is estimated separately for each outcome by OLS and two-

Source: Doyle (2008, JPE),“Child Protection and Adult Crime: Using Investigator Assignment to Estimate Causal Effects of Foster Care” 101 / 72

Table: Determinants of Case Worker Placement Differential

child protection and adult crime 759

TABLE 3Case Manager Assignment and Foster Care Placement

Dependent Variable: Foster Care Placement

Coefficient(1)

StandardError(2)

Coefficient(3)

StandardError(4)

Case manager placementdifferential .229 .036** .233 .035**

Race:White �.002 .029African American .093 .029**Hispanic �.030 .031

Initial reporter:Physician .043 .018*School .025 .015Police .073 .016**Family .016 .015Neighbor �.013 .016Other government .084 .016**Anonymous .002 .016

Age at report:Age 6 �.027 .018Age 7 .001 .016Age 8 .008 .017Age 9 .014 .017Age 10 .016 .017Age 11 .016 .017Age 12 .020 .017Age 13 .020 .018Age 14 .016 .017Age 15 �.007 .018Age 16 �.017 .018

Sex: boy �.016 .005**Allegation:

Physical abuse �.172 .015**Substantial risk �.180 .015**Other abuse �.162 .019**Lack of supervision �.152 .015**Environmental neglect �.188 .016**

Mean of dependent variable .16Observations 23,254

Note.—Models are estimated by OLS. Data are for school-aged children outside of Cook County. Standard errorsare clustered at the case manager level. All models include year indicators.

Crime Outcomes

To compare crime outcomes, Y, empirical models for child i investigatedby case manager c in subteam j during year t are of the form

Y p a � a R � a X � d 1(t p k) � � . (7)�icj 0 1 icj 2 i k i icjk

This model is estimated separately for each outcome by OLS and two-

Source: Doyle (2008, JPE),“Child Protection and Adult Crime: Using Investigator Assignment to Estimate Causal Effects of Foster Care” 102 / 72

Results: OLS, TSLS760 journal of political economy

TABLE 4Foster Care Placement and Crime Outcomes: 2000–2005

OLS(1)

OLS(2)

2SLS(3)

2SLS(4)

LIML(5)

LIML(6)

A. Dependent Variable: Arrested

Foster care placement .075 .060 .388 .391 .226 .217(.008)** (.008)** (.189)* (.182)* (.113)* (.111)*

Mean of dependentvariable .260

Full controls No Yes No Yes No YesObservations 23,254 23,254 23,254 23,254 22,691 22,632

B. Dependent Variable: Sentence of Guilty/Withheld

Foster care placement .045 .039 .403 .405 .236 .241(.007)** (.007)** (.160)* (.154)** (.092)** (.092)**

C. Dependent Variable: Sentenced to Prison

Foster care placement .035 .031 .219 .225 .176 .176(.005)** (.005)** (.104)* (.102)* (.070)* (.070)**

Note.—Data are for children investigated for abuse or neglect outside of Cook County. Standard errors are clusteredat the case manager level. All models include year indicators. Columns 1 and 2 report estimates from models estimatedby OLS. Columns 3 and 4 report estimates from models estimated by 2SLS, with the case manager removal differentialas the excluded instrument. Columns 5 and 6 report estimates from models estimated by LIML, with individual inves-tigator indicators as the excluded instruments and subteam by year fixed effects. The LIML models use a sample limitedto investigators with at least five investigations in the analysis sample.

stage least squares (2SLS), with the case manager placement differential,, used as an instrument for the indicator for removal, .Z Ricj icj

Similarly, the following model is estimated by LIML:

Y p a � a R � a X � q 1(T p k) � � , (8)�icj 0 1 icj 2 i k i icjk

where is a subteam indicator to estimate the model within the poolTi

of investigators who could have been assigned to a child, and case man-ager indicators are used as excluded instruments. The LIML estimatesare restricted to the sample of investigators with at least five investiga-tions in the analysis sample.

Table 4 reports the crime outcome results. As discussed above, themean arrest rate in the sample is 26 percent. In terms of the OLS results,children who were placed in foster care have higher crime outcomes,

103 / 72

Results: Subgroups

TABLE 5Results across Child Characteristics

Subgroup

SexAllegation/

Reporter Race

Location(Matched byName and

Date of Birth)

Boy(1)

Girl(2)

Abuse(3)

Neglect(4)

White(5)

AfricanAmerican

(6)Non-Cook

(7)Cook(8)

A. Dependent Variable: Placed in Foster Care

Case manager placement differential .211 .302 .276 .24 .213 .367 .256 .295(.052)** (.047)** (.047)** (.058)** (.040)** (.081)** (.040)** (.058)

Relative to overall first stage .92 1.32 1.21 1.05 .93 1.60 1.12 1.29

B. Dependent Variable: Arrested, 2000–2005

Foster care placement:OLS .062 .060 .051 .070 .059 .057 .079 .061

(.012)** (.011)** (.012)** (.013)** (.011)** (.010)** (.009) (.008)2SLS .221 .509 .385 .389 .541 .591 .235 .201

(.298) (.187)** (.203) (.297) (.263)* (.249)* (.210) (.111)Mean of dependent variable .305 .215 .261 .259 .249 .248 .344 .403Observations 11,673 11,581 13,149 10,105 16,402 17,459 23,899 22,357

Note.—Panel A reports the first-stage coefficients for each subgroup and the relative first stage to the overall first-stage coefficient reported in table 3, with the exception of cols. 7 and 8, whichreport the first-stage coefficient in each set of counties relative to the first stage from a pooled sample of all counties. The OLS and 2SLS cells report coefficients on foster care placement, withthe second set instrumented by the case manager placement differential. Standard errors clustered at the case manager level are reported. In cols. 7 and 8, the match using name and date ofbirth allows the use of individuals with missing social security numbers, resulting in larger sample sizes.

This content dow

nloaded from 130.132.173.185 on Fri, 10 Jun 2016 18:29:07 U

All use subject to http://about.jstor.org/term

104 / 72

Figure: Adult Crime on Fitted Probability of Foster Care (Doyle, 2008)

766 journal of political economy

Fig. 2.—A, Arrested vs. . B, Arrested marginal treatment effects. Lo-P(PlacementFZ)cal quadratic and associated derivative estimates, evaluated at each percentile of

: the predicted placement from a probit model that includes only the caseP(PlacementFZ)manager placement differential. Dashed lines report 5–95 percent bootstrapped confi-dence intervals. Pilot bandwidth chosen by cross-validation is 0.031.

VI. Conclusion

Foster care placement is a far-reaching intervention in the lives of chil-dren who are at high risk of arrests and incarceration as adults. Theanalysis here uses the effective randomization of families to child pro-tection investigators to estimate causal effects of foster care placementon crime outcomes. The results suggest that children placed in carehave two to three times higher arrest, conviction, and imprisonmentrates than children who remained at home. The point estimates arelarge and relatively imprecisely estimated, however, which suggests some

105 / 72

Figure: MTE, Foster Care on Adult Crime (Doyle, 2008)

766 journal of political economy

Fig. 2.—A, Arrested vs. . B, Arrested marginal treatment effects. Lo-P(PlacementFZ)cal quadratic and associated derivative estimates, evaluated at each percentile of

: the predicted placement from a probit model that includes only the caseP(PlacementFZ)manager placement differential. Dashed lines report 5–95 percent bootstrapped confi-dence intervals. Pilot bandwidth chosen by cross-validation is 0.031.

VI. Conclusion

Foster care placement is a far-reaching intervention in the lives of chil-dren who are at high risk of arrests and incarceration as adults. Theanalysis here uses the effective randomization of families to child pro-tection investigators to estimate causal effects of foster care placementon crime outcomes. The results suggest that children placed in carehave two to three times higher arrest, conviction, and imprisonmentrates than children who remained at home. The point estimates arelarge and relatively imprecisely estimated, however, which suggests some

106 / 72

Application: Quantity-Quality

Alternative Parametric Approach:Brinch, Mogstad, and Wiswall (2017, JPE)“Beyond LATE with a Discrete Instrument:Heterogeneity in the Quantity-Quality Interaction of Children”

Examine effect of two or more siblings rather than onesibling on years of schooling of first born.

Use administrative data from Statistics Norway.

Entire resident population of Norway, children aged atleast 25 in 2000.

Sample of 514,049 children.

Use first two children same-sex and second birth resultedin twins as instruments.

107 / 72

Table: Sample StatisticsBrinch, Mogstad, and Wiswall (JPE, 2017)

6 Tables

Table 1: Descriptive Statistics

Mean Std. Dev.Outcome:Years of schooling 12.3 2.7

Instruments:Same sex, 1st and 2nd child 0.501 0.5Twins at second birth 0.0096 0.097

Endogenous regressor:At least three children 0.5021 0.5

Covariates:Female 0.47 0.50Age in 2000 39.5 9.2Mother's age at �rst birth 24.0 4.2Father's age at �rst birth 26.8 4.5Mother's years of schooling 10.0 1.4Father's years of schooling 10.1 2.6

Note: Descriptive statistics are for 514,049 children. All children are �rst born with at least one sibling.

Twins at �rst birth are excluded from the sample. All children, parents and siblings are aged between 16

and 74 years at some point between 1986 and 2000.

108 / 72

Table: Brinch, Mogstad, and Wiswall (JPE, 2017)

Table 2: Fertility decision model - Average Derivatives

Average e�ect (std. err.)

Covariates:Age in 2000 0.0163 (0.0011)Mother's age at �rst birth -0.0161 (0.0013)Father's age at �rst birth 0.0007 (0.0008)Mother's years of schooling 0.0030 (0.0016)Father's years of schooling -0.0038 (0.0019)Female -0.0016 (0.0018)

Instruments:Same sex, �rst and second 0.0567 (0.0012)Twins at 2nd parity 0.5179 (0.0007)

Note: This table reports the average partial e�ect (average treatment e�ect for binary variables) from a

logit model for the probability of being in a family with 2 or more siblings rather than 1 sibling. The

emodel is speci�ed in the following way: We use a third order polynomial in �Age in 2000�, �Mother's

age at �rst birth�, �Father's age at �rst birth birth�, �Mother's years of schooling� and �Father's years of

schooling�; We include interactions between the �rst order terms of all covariates; �Same sex, �rst and

second� enters the model without interaction terms; �Twins at 2nd parity� is interacted with all covariates

(including higher order terms and interactions) to ensure that the model is consistent with the fact that

there are no never takers with twins. Standard errors in parantheses are computed by nonparametric

bootstrap with 100 bootstrap replications.

Table 3: OLS and IV estimates

Z− as instrument P (Z) as instrumentIV:Same-sex instrument 0.174 -0.208

(0.115) (0.104)Twins instrument 0.050 -0.060

(0.063) (0.063)Both instruments 0.076 -0.015

(0.055) (0.054)

OLS -0.052(0.007)

Note: This table reports OLS and IV estimates of the e�ect of family size on the educational attainment

of �rst born children. The �rst column (Z− as instrument) uses the �rst stage equation (16). The second

column (P(Z) as instrument) uses the �rst stage equation (17). We construct P(Z) using the parameter

estimates from the logit model with average derivatives reported in Table 2. The second stage is given by

equation (15). We use the same speci�cation for the covariates as reported in Table 2. The �rst row uses

the �Same sex, �rst and second� instrument, the second row uses the �Twins at 2nd parity� instrument,

and the third row uses both instruments. The OLS estimates is reported in the fourth row. Standard

errors in parantheses are heteroskedasticity-robust.

109 / 72