Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
1
Strengthening The Regression Discontinuity Design Using Additional Design Elements: A Within-Study Comparison
Abstract
The sharp regression discontinuity has three key weaknesses when compared with the randomized clinical trial (RCT). The RDD has lower statistical power than an equivalent RCT. Treatment effect estimates from an RDD are usually more dependent on modeling assumptions than an equivalent RCT. And an RDD produces an estimate of average treatment effects among the narrow sub-population with assignment scores near the cut-off score, while a typical RCT produces an estimate of the treatment effect in the overall study population. In this paper, we explore one method for strengthening the RDD along these three dimensions. The method involves adding a comparison “untreated” comparison group to the standard RDD. In the example we consider, the comparison is constructed from pretest values of the outcome variable. The additional data provides a check on functional form assumptions, improves statistical precision, and supports extrapolation beyond the cut-off sub-population. We designed a within study comparison to study the performance of the method relative to standard RDD and RCT designs. The within study comparison involves (1) a standard posttest-only RDD; (2) a pretest-supplemented RDD; and (3) an RCT that serves as a benchmark. The designs are replicated using data from three different states and we also consider three different assignment cut-offs so that we can study the performance of the methods under different conditions. Our results show that, relative to the posttest-only RDD, adding the pretest makes functional form assumptions more transparent and produces statistically more efficient estimates. Relative to the RCT benchmark, both versions of the RDD show no substantial bias at the cutoff. Most importantly, the pretest-supplemented RDD makes it possible to estimate causal effects in the region beyond the cut-off, and these estimates are very similar to the benchmark RCT estimates. This indicates that the supplemented design can be used to support more general causal inferences. Several other types of supplemented RDDs are conceptually possible and are discussed.
2
Strengthening The Regression Discontinuity Design Using Additional Design Elements: A Within-Study Comparison
Introduction
A carefully executed regression discontinuity design (RDD) is now widely considered a sound basis for
causal inference. But the status of RDD as a trustworthy method has evolved over time. The design was
introduced in Thistlewaite and Campbell (1960). Goldberger (1972 a, b) formally showed that the RDD
produces unbiased parameter estimates, but is less statistically efficient than a comparable randomized
clinical trial (RCT). More recent work has clarified the assumptions that support parametric and non-
parametric identification in the RDD (Hahn, Todd, and Van der Klauuw 2001; Lee 2008), and has
examined the statistical properties of common estimators (Porter 1998, 2003; Lee and Card, 2008;
Schochet 2009). A growing literature compares RDD treatment effect estimates to benchmark estimates
from an RCT. These ‘within study comparisons’ provide empirical evidence that an RDD produces causal
estimates that are similar to those from an RCT on the same topic (Cook and Wong 2008; Green et al
2009; Shadish et al, 2011).
Advancements in the technical literature have improved our understanding of RDD assumptions and
approaches to data analysis. But the basic elements of the design have not changed. An RDD requires an
outcome variable, a binary treatment, a continuous assignment variable, and a cutoff based treatment
assignment rule. The assignment rule is the crucial detail: in a successful RDD, individuals with
assignment scores on one side of the cutoff receive treatment and individuals on the other side of the
cutoff receive the control condition. The RDD is called “sharp” when all people receive the treatment
intended for them. When compliance is partial, the RDD is called “fuzzy”. This paper deals only with
sharp RDD studies.
3
At a conceptual level, the analysis of an RDD is not complicated. Researchers estimate treatment effects
by comparing mean outcomes among people with assignment scores just below and just above the
cutoff. The difference between these two conditional means is called a “regression discontinuity”
because it can be understood as a discontinuity in the regression function that links average outcomes
across sub-populations defined by different scores on the assignment variable. In the absence of a
treatment effect, the regression function would be smooth near the cutoff. A sudden break or
‘discontinuity’ at the cutoff is evidence of a treatment effect. The size of the discontinuity measures the
magnitude of the effect.
RDD has at least three important limitations relative to an RCT. The first involves the amount of
statistical modeling required to identify and estimate causal effects. In an RCT, treatment effects are
non-parametrically identified: assumptions about the underlying statistical model are not required and
the connection between the research design and the statistical tools used to perform the analysis is
quite close1. RDD treatment effects are non-parametrically identified only in very large samples. In
practice, researchers proceed by specifying the parametric functional form of the regression and
allowing for an intercept shift at the cut-off value (Card and Lee, 2008). Since choosing the wrong
functional form can lead to biased treatment effect estimates, good applied RDD papers seek to
estimate the regression function flexibly and to evaluate how sensitive the results are to alternative
specifications. There is considerable value in new methods that can reduce the RDDs conceptual
dependency on functional form assumptions either by providing additional sensitivity analysis or by
offering some way to partially validate functional form assumptions.
A second limitation is that treatment effect estimates are less statistically precise in an RDD than in a
comparable RCT. Lack of precisions means that the statistical power of key hypothesis tests are lower in
1 Of course, analysts often employ parametric regression models in the analysis of experimental data either to
improve the statistical precision of the treatment effect estimates or to adjust for chance imbalances in observable covariates. But this additional modeling is usually not central to the study's findings.
4
an RDD (Goldberger, 1972b; Schochet, 2009). In parametric approaches to RDD, the efficiency loss is due
to multicollinearity that arises because assignment scores are highly correlated with individual
treatment status. Non-parametric RDD estimates may also have lower power, in part because they
employ a bandwidth that decreases the study’s effective sample size. Power is a secondary concern in
RDD studies with very large sample sizes, but it may be a central issue when investigators design RDD
studies prospectively and collect their own data directly from respondents. Adding more cases may then
prove costly and even diminish the value of RDD relative to other research designs that offer may
provide a weaker foundation for causal inference but higher statistical power (Schochet, 2009).
A third limitation concerns the generality of RDD results. An RCT produces estimates of the average
treatment effect across all members of the study population. In contrast, RDD estimates are only
informative about average treatment effects among members of the narrow sub-population located
immediately around the cutoff. If a treatment is given to students who score above the 75th percentile
on some test, then the RDD study will produce results generalizable only to students scoring near the
75th percentile. Most social science theories and policy questions are not concerned with such narrow
sub-populations. But treatment estimates for broader sub-populations -- like all students, or all students
in the upper quartile of the distribution – require extrapolations beyond the cutoff. The extrapolations
usually are not trustworthy because no theoretical basis exists for assuming that the functional form of
the regression is stable beyond the range of the observed data. The crux of the problem is that we do
not know what the treatment group functional form would have looked like above the cutoff in the
absence of treatment. The standard practice is conservative. It limits inference to the sub-population in
the immediate neighborhood of the cut-off. The approach is methodologically admirable in that it sticks
to the assumptions supported by the research design. But it courts irrelevancy for many policy and
scientific questions. In our view, the narrow applicability and limited generalizability of causal results is
5
the most serious weakness of standard RDD as a practical method for program evaluation and policy
analysis.
In this paper, we explore an RDD variant that can improve on all three limitations. The approach involves
supplementing the conventional posttest-only RDD with an additional design element that can aid
causal inference. In the application considered here the additional design element is a pretest measure
of the outcome variable. We refer to the conventional RDD as a “posttest RDD” because it only requires
posttest information; and we refer to the supplemented design as a “pretest RDD” even though it makes
use of both pretest and posttest outcome data. The key idea is that the pretest data provides
information about what the regression function linking outcomes and assignment scores would have
“looked like” before the treatment was available. The method is not fool proof. The regression function
may have changed over time, for example. But the main assumptions of the design are partly testable
because the pretest and post-test untreated regression functions can be directly compared below the
cutoff score to observe whether they differ. Minor changes such as intercept shifts are easily
accommodated. But very dissimilar functional forms below the cutoff cast doubt on the results of any
pretest-supplemented RDD. Even when functional forms are similar below the cutoff, causal
extrapolation beyond the cutoff requires untestable assumption that the two functions would have
remained similar above the cutoff.
The core of this paper is an empirical evaluation of the performance of the pretest and posttest RD
designs relative to each other and also relative to a benchmark RCT. To assess performance we
constructed a within study comparison. The method of within study comparison was developed by
Lalonde (1986). He examined whether various econometric adjustments for selection bias could
reproduce the results of a job-training RCT. Since then, researchers have used the method to study the
performance of RDD (Shadish et al., 2011), different forms of intact group and individual case matching
6
(Cook, Shadish & Wong, 2008; Bifulco, 2012), and alternative strategies of covariate selection (Cook &
Steiner, 2010). The implementation details of specific within study comparison method vary, but the
basic idea is always to test the validity of a non-experimental method by comparing its estimates to a
trustworthy benchmark from an RCT. Methods for conducting a high quality within study comparison
have evolved over time; Cook et al. (2008) describe best practices and we follow them in this paper.
Our within study comparison is based on data from the Cash and Counseling Demonstration Experiment
(Dale & Brown, 2007). In the original study, disabled Medicaid beneficiaries in Arkansas, Florida and New
Jersey were randomly assigned either to obtaining home and community based health services through
Medicaid (the control group), or to receiving a spending account that they could use to procure home
and community based services directly (the treatment group). In our analysis, the outcome variable is a
person’s Medicaid expenditures in the 12 months after the study began. We used baseline age as the
assignment variable. To construct pretest RDDs and posttest RDDs from the RCT data, we systematically
retained and excluded treatment or control observations. Specifically, we sorted the treatment group
and control group cases from the RCT by age (the assignment variable). Then for the RD designs we
defined a cut-off age and systematically deleted control group cases on one side of the cutoff and
treatment cases on the other. For replication purposes, we chose three different age cutoff values – 35,
50 and 70. And we analyzed the data separately for three different states – Florida, New Jersey and
Arkansas. In total, we examined 9 different posttest-only RDDs and 9 different pretest RDDs. We
compared each RDD estimate to the corresponding RCT estimate both at the cutoff age under analysis.
And in the pretest RDD we used the comparison data to extrapolate beyond the cut-off to compute an
estimate of the average treatment effect for everyone older than the cutoff. We compared these above
the cut-off averages to RCT benchmarks as well. Estimates of average treatment effects above the cut-
off depend on extrapolations and the accuracy of these estimates represent at test of the ability of the
pretest RDD to improve the generality of the standard design. The average effect above the cutoff
7
corresponds to the Average Treatment Effect on the Treated (ATT) parameter, which is often the target
parameter in applied program evaluation research. In all the analyses to be reported, standard errors for
the RCT and RDD estimates were computed using the bootstrap. Technical details about design and
analysis are presented later.
The results of our analysis indicate that the pretest RDD can shore up all three key weaknesses of the
RDD. The pretest RDD improved statistical precision relative to the posttest RDD, although the RCT
remained the most precise. Similar functional forms were observed for the pretest and posttest RDDs on
the untreated side of the assignment variable as well as for the pretest RDD and the RCT on both sides
of the cutoff. This provides support for the pretest RDD’s key assumption that the difference in
functional forms in the two time periods is well approximated by a simple intercept shift and nothing
more. Extrapolation beyond the cutoff was also quite successful: the pretest RDD produced estimates of
the average treatment effect above the cutoff that were very close to the RCT benchmarks. Taken
together, these results indicate that supplementing the standard posttest RDD with pretest outcome
data can increase statistical power, improve the credibility of functional form assumptions, and generate
unbiased causal inference at and beyond the cutoff.
We start the body of the paper with a short description of the Cash and Counseling Demonstration RCT.
Details of the within study research design are in the second section, and estimation methods for each
of the research designs are described in the third section. The results are in the fourth section.
Conclusions and a short discussion end the paper.
Experimental Data
The Cash and Counseling Demonstration and Evaluation is described in detail elsewhere (Dale and
Brown, 2007a; Brown & Dale, 2007; Doty, Mahoney & Simon-Rusinowitz, 2007). The treatment
condition was a “consumer-directed budget” program, which allowed disabled Medicaid beneficiaries to
8
procure their own home and community based support services using a Medicaid-financed spending
allowance. The control group received home and community based support services under the status
quo Medicaid program. That is, in the control group, a Medicaid agency procured services for clients
from Medicaid certified providers; in the treatment group, people found their own services and paid for
them using the allotted budget. People were randomly assigned to treatment and control arms of the
study, and treatment group subjects could choose to use a consumer directed budget or not2. The new
program was meant to be budget-neutral, and the new personal allowance for support services was set
equal to the amount the agency would otherwise allocate to controls.
Study participants were disabled elderly and non-elderly adult Medicaid beneficiaries who agreed to
participate and lived in Arkansas, New Jersey, or Florida from 1999 to 2003. State Medicaid agencies
operated the demonstration. But Mathematica Policy Research (MPR) was responsible for random
assignment, data collection, analysis, and evaluation. The study employed a rolling enrollment design in
which new enrollees completed a baseline survey and then were randomly assigned to treatment or
control status, after which the state agency was informed of the assignments. The program resulted in
higher levels of patient satisfaction, small improvements on selected health outcomes (Carlson, Foster,
Dale and Brown, 2007), and higher post-treatment Medicaid expenditures (Dale & Brown, 2007b). The
increase in Medicaid expenditures is interesting because the treatment program was intended to be
budget neutral. The increase in spending occurred because the status quo Medicaid agency often failed
to successfully procure the home and community based services that program recipients were entitled
to receive. The treatment condition put people in charge of their own budgets and they were more
successful at actually procuring their goods and services. For instance, treatment group members were
2 People who chose the Cash and Counseling option were also assigned a counselor who would help them develop
a spending plan, provide advice about hiring workers, and would also monitor the person’s use of the allowance and general welfare. All of the subjects in the study (treatment and control) expressed interest in using the consumer directed budgeting option.
9
more likely to receive any services than members of the control group. They also received a larger
fraction of the services they were authorized to receive than members of the control group. In essence,
Medicaid expenditures were higher in the treatment group because the agency serving the control
group members often did not manage to actually spend the allotted funds.
Our methodological study used a small number of measures from the original study. We retained
information on age at baseline, state of residence, and randomly assigned treatment status. We also
created a measure of annual Medicaid expenditures by adding up 6 categories of monthly expenditures
across the 12 months before random assignment (pretest) and after it (posttest). The categories were:
Inpatient Expenditures, Diagnosis Related Group Expenditures, Skilled Nursing Expenditures, Personal
Assistance Services Expenditures, Home Health Services Expenditures, and Other Services Expenditures3.
Throughout, we refer to this 6-item index as “Medicaid expenditures” and it is the sole outcome variable
analyzed.
Table 1 gives summary statistics from the three states. In the RCT, Arkansas had 1004 participants in
each of the treatment and control arms, Florida had 906 control and 907 treatment participants, and
New Jersey had 869 control and 861 treatment group members. In Arkansas, the average participant
was 70 years old at baseline, compared to 55 in Florida and 62 in New Jersey. Within each state, average
pretest expenditures were similar in the treatment and control groups. But the level of spending varied
across states. The average person in Arkansas had expenditures of $6,400 in the pretest year compared
to $14,300 in Florida and $18,500 in New Jersey. Mean posttest expenditures were consistently higher
3 The claims data included a small number of cases with very high levels of expenditures that could be either real
or data entry errors. To reduce concerns that these outliers would skew our regression estimates we top coded the pretest and posttest Medicaid expenditures variable at the 99
th percentile of the pooled distribution of posttest
expenditures, which was equal to $78,273. The top coding procedure affected 89 posttest observations and 79 pretest observations.
10
in the treatment groups. Intent to treat (ITT) estimates of the mean difference between the treatment
and control groups were about $1860 (p < .01) in Arkansas, $1856 (p = .01) in Florida, and $1200 (p =
.09) in New Jersey. These simple estimates show that the Cash and Counseling treatment consistently
led to higher average Medicaid expenditures.
Within Study Research Design
To implement the within study comparison, we created 21 different subsets of the original RCT data.
The first 3 subsets consist of the state-specific RCT treatment and control groups; sample sizes and basic
descriptive statistics for these data are in Table 1. Only posttest expenditure data are included in the
RCT subsets.
The next 9 subsets represent state-specific posttest RDDs based on age cutoffs of 35, 50, and 70. No
pretest expenditure data are involved here. To create these posttest RDD subsets, we removed from the
experimental data all of the treatment group members younger than the relevant age cutoff, and also all
of the control group members at least as old as the cutoff. Table 2 shows the resulting sample sizes for
the 9 posttest RDD subsets. Notice that the number of observations below the cutoff increases with the
cutoff age. At 35, there are many more observations above the cutoff than below; at age 50,
observations are somewhat more balanced; and at age 70 there are actually more observations below
the cutoff than above it in New Jersey and Florida but not Arkansas. In addition to creating variation in
the number of observations on each side of the cutoff, the different cut-offs determine how much
extrapolation is required to compute average effects for everyone above the particular age cutoff under
analysis. For example, when the cutoff is set at age 35, estimating the average effect among everyone
older than 35 would require an extrapolation from age 36 to 90.
Next, we used Medicaid Expenditures over the year prior to randomization to create 9 pretest RDD data
subsets based on the same cutoff values and states. With the pretest and posttest RDD subsets in hand,
11
we created a “long-form” dataset by stacking the pretest and posttest RDD data, and defined an
indicator variable to identify which observations were from which time period. Stacking the data in this
way results in twice as many observations than the posttest RDD because each participant is now
observed twice.
These subsetting procedures resulted in three research designs (an RCT, a posttest RDD, and a pretest
RDD). The posttest RDD and pretest RDD were replicated with three different age cutoff values (35, 50
and 70). And all of the designs were replicated for the three states: Arkansas, New Jersey, and Florida.
The core task of our study is to construct estimates of the same causal parameters using each of these
subsets/research designs. Interpreting the estimates from the RCT data as a benchmark provides a way
to measure the performance of the posttest and pretest RDD estimates.
Methods
Implementing the within study comparison requires i) defining treatment effects of interest, ii)
specifying estimators for each effect in each design, and iii) developing measures of performance that
can be used to judge the strengths and weaknesses of each design. In this section, we describe our
approach to each of these tasks in sequence.
Parameters of Interest
To understand the treatment effects of interest in our analysis, start by letting index individuals and
index the pretest and posttest time periods. is a person’s (time invariant) age at baseline,
and identifies observations made during the pretest time period. We adopt a potential
outcomes framework in which denotes the ith person’s treated outcome at time , and
denotes the person’s untreated outcome at time . if the person has received the treatment at
time and if the person has not received the treatment at time . The person’s realized
12
outcome at time is . In the Cash and Counseling data, a person is
treated if she has the option to control her own Medicaid financed home care budget, and no one
received the treatment in the pretest time period. This means that . The outcome
variable -- – represents the person’s Medicaid Expenditures in period .
To estimate treatment effects at the conventional RDD cutoff and also beyond it, we define treatment
effects conditional on specific ages and age ranges. To this end, write the average treatment effect in
the post-treatment time period for people who are, say, 70 years old as:
. Suppose that the cutoff value in a regression discontinuity
design is set at age 50. Then, in our notation, is the average treatment effect in the
cutoff sub-population for that particular RDD. In a conventional RDD, inference is limited to the average
treatment effect at the cutoff. But it is also useful to describe average treatment effects across a range
of ages by weighting the age-specific treatment effects by the relative frequency distribution of ages in
the age group of interest. For example, when the cutoff is set at age 50 and the maximum age in the
study population is , the average treatment effect across all people above the RDD cutoff is:
∑
In a sharp RDD, represents the average treatment effect above the cutoff, which might also
be called the average treatment effect on the treated (ATT): . Where possible we
will compare estimates of the average treatment effect at the RDD cutoff and the average treatment
effect among all observations above the cutoff. Note too that each of these treatment effect parameters
is defined separately for the three states in the analysis: state identifiers are suppressed in the notation
to reduce notational clutter.
Estimation
13
To estimate the quantities of interest described above, we used regression methods that account for
unknown functional forms either with kernel weighting or a polynomial series in the age variable. These
are the two most common methods used in the modern RDD literature and so our work is consistent
with existing best practices. That said, the implementation of flexible models based on kernel weighting
or polynomial approximations meant that we could not specify a single polynomial model or a single
bandwidth for all the designs and states in the analysis. Instead, we specified a method of selecting
polynomial specifications and bandwidth parameters that was applied uniformly across the designs. In
what follows, we describe the general approach to estimation employed with the RCT, posttest RDD,
and pretest RDD. Then we explain the model selection algorithm used to guide our choice of smoothing
parameters like bandwidths and polynomial series lengths. The exact bandwidths and polynomial
specifications employed in the analysis are reported in the Appendix.
Estimation in the RCT For the three state-specific benchmark data sets, we estimated age specific
treatment effects using two methods. First, we estimated local linear regressions of Medicaid
Expenditures on age separately for the treatment and control groups. Then we computed age-specific
treatment effects as point-wise differences in treatment and control regression functions for each age.
To calculate average treatment effects above the cutoff we weighted these age-specific differences
according to the relative frequency distribution of ages in the full state sample.
Since many applied researchers prefer to work with flexible polynomial specifications rather than kernel
based regressions, we also estimated OLS regressions of Medicaid Expenditures on a polynomial series
in age, a treatment group indicator, and interactions between the polynomial series and age for each
state. Treatment effect estimates were computed using the coefficients on the treatment indicator and
the appropriate interaction terms. Average treatment effects above the cutoff were taken as weighted
14
averages of age-specific differences with weights equal to the relative frequency of each age in the state
sample.
Estimation in the Posttest RDD We also estimated treatment effects in the posttest RDDs using two
methods. First we estimated treatment effects at the cutoff using local linear regressions applied
separately to the data from above and below the cutoff in each state. Treatment effects at the cutoff
were calculated using the difference in estimates of mean Medicaid expenditures at the cutoff.
In the second method, we pooled data from above and below the cutoff and estimated OLS regressions
of Medicaid Expenditures on a polynomial in age, a dummy variable set to 1 for observations above the
cutoff, and interactions between the age polynomial series and the above the cutoff dummy variable. In
these posttest RDD analyses we computed treatment effects only at the cutoff. We did not make
extrapolations based on the functional form implied by the polynomial regression coefficients because
this is almost never done in practice due to the tendency of polynomial series estimates to have good
within-sample fits but very poor out of sample properties.
Estimation in the Pretest RDD The pretest RDD combines both the pretest and posttest RDD data. The
key idea underlying this design is that information about the relationship between the assignment
variable and the pretest outcomes can be used to produce more trustworthy extrapolations beyond the
assignment cutoff in the posttest time period. Putting the idea into practice requires a model of the
untreated outcome variable in the pretest and posttest periods that can account for simple non-
equivalencies between the two periods. We consider models in which pretest and posttest untreated
outcome regression functions are parallel in the sense that pretest and posttest untreated outcome
regression functions differ by a constant across all ages. We work with the following model for the
untreated potential outcomes:
15
In this model, represents the fixed difference in conditional mean outcomes across the pre and post-
test periods, and is an unknown smooth function that is constant across the two periods. We
assume that . One approach to estimating this model involves approximating the
unknown smooth function using a polynomial series. For instance, one might specify a Kth order
polynomial series and estimate model parameters using OLS regression. Then the specification for a
chosen K is:
∑
The equation can be estimated by applying OLS to all of the untreated cases in the sample. The key point
is that the untreated sample includes pretest Medicaid expenditures from the full range of ages and also
the posttest Medicaid expenditures of people under the design’s age cutoff. In this setting, ̂
∑ ̂ represents an estimate of . The idea here is that extrapolations
beyond the cutoff are now made with what might be called “partial” empirical support. Rather than
extrapolate outside the range of the data, extrapolations are made to the posttest outcomes on the
support of pretest data and under the maintained assumption that estimates of are sufficient to
account for any between period non-equivalence. This method provides estimates of the untreated
outcome function. To form estimates of treatment effects, we still need estimates of the treated
outcome function. An obvious strategy is to estimate polynomial regressions of expenditures on age
using the posttest data from sample members who are above the age cutoff. Then treatment effects can
be computed at the cutoff using differences between the fitted value of the treated function and the
untreated functions from the treated outcome model. Average treatment effects among all
observations above the cutoff can be formed by computing age-specific treatment effects for each age
16
above the cutoff and then forming a weighted average of these differences based on the relative
frequency of the ages above the cutoff.
We also implemented the pretest RDD model by estimating a version of Robinson’s (1988) partial linear
regression model. The model exploits the same assumptions that is a smooth function and that
, but it also requires a support condition so that the pretest indicator in the
parametric component of the model is not a deterministic function of the assignment variable. Formally,
the requirement is that . The support condition fails by definition in the full sample
because there are no untreated RDD observations above the cutoff. Our solution is to estimate the
parametric period effect using only data that fall on the common support of the different time periods.
In practice this means that we estimate the model using only the below the cutoff data in the two time
periods. Then, with estimates of in hand, we estimate the non-parametric component using the full
sample of observations both above and below the cutoff value. We calculated treatment effects at the
cutoff using differences in the predicted values from local linear regressions among treated observations
from the posttest time period and predicted values from the partially linear model. And we constructed
average treatment effects above the cutoff by taking age-specific differences between the predicted
values from the two models and weighting them by the relative frequency of each age in each state
sample.
Procedures for Choosing Smoothing Parameters Each of the methods described above depends on
assumptions about the degree of smoothing across different age groups. The local linear regressions
depend on a kernel function and a bandwidth parameter, and the OLS polynomial series regressions
require specifying an appropriate polynomial function. The goal of these flexible modeling approaches is
to allow the data to determine the specification rather than to impose a specification that is based on
theoretical reasoning alone. However, there is always some arbitrariness associated with selecting these
17
smoothing parameters, and it seems wise to define a model selection protocol in advance of the data
analysis.
To this end, we selected bandwidth parameters by using least squares cross-validation to evaluate a grid
of candidate bandwidths ranging from 1 year to 90 years in width. We then inspected the function
produced by using the bandwidth that minimized the cross-validation statistic4. When visual inspection
revealed that the bandwidth chosen by the cross validation exercise led to a function that appeared very
under-smoothed, we increased the bandwidth slightly to achieve a more regular function. Details about
the selected bandwidth for each of the research designs in the paper are reported in the appendix.
To choose a polynomial functional form, we used least squares cross-validation to evaluate a set of
candidate models that included linear, quadratic, cubic, and quartic polynomial functions, and also
models that fully interacted the polynomial terms with a treatment group indicator variable. In the
within study comparisons, we always worked with the polynomial models that minimized the cross-
validation function. That is, we conducted the cross validation for each of the candidate specifications
and chose the specification that produced the smallest mean square out of sample prediction errors.
The specific functional forms that were used in each part of the within study comparison are reported in
the appendix.
Estimating Standard Errors To ensure comparability across the different designs, we used a non-
parametric bootstrap to estimate standard errors for all treatment effect estimates. We always used
500 bootstrap replications. Point estimates were re-calculated for each replicate, and the standard
deviation of the point estimates across the 500 replicates was used as the bootstrap estimate of the
standard error of the estimate. Bandwidths, polynomial functional forms, and relative frequency
weights for computing above the cutoff averages were fixed across bootstrap replicates. In the pretest
4 The cross validation statistic we worked with is the standard mean squared out of sample prediction error formed
by predicting the value of each observation when it is left out of the estimation.
18
RDD designs we resampled individual participants rather than individual observations in order to
account for within-person dependencies in the error structure.
Measuring Performance
Comparing point estimates and standard errors from the different designs is one approach to measuring
performance. We also examined two other measures of the performance of the different estimators
that clarify particular strengths and weaknesses. First, we considered a measure of the standardized bias
of the quasi-experimental point estimate. Let be the point estimate of a given parameter produced
by quasi-experimental estimator, . And let be the point estimate of the same parameter
produced by the RCT. Finally, let be the standard deviation of posttest Medicaid expenditures
observed in the RCT. The standardized bias measure that we worked with is
.
Essentially, measures the magnitude of the bias in a particular quasi-experimental estimate in
standard deviation units. We computed this measure of standardized bias for each parameter estimated
with each cutoff age, state and research design. The metric offers a uniform account of the size of the
bias across different causal parameters and different research designs.
The second measure of performance combines bias and variance estimates in a mean squared error
framework. To compute the mean square error statistic, we centered the point estimates from each
bootstrap replicate around the experimental benchmark. Then we squared these deviations and
computed the average of the squared deviations across the 500 bootstrap replicates. Formally, the
statistic we work with is ( )
∑ (
)
, where indexes the bootstrap
replicates. To keep the scale of the statistic interpretable in dollar terms, we report the square root of
the MSE statistics in the tables of results. The advantage of this MSE measure is that it incorporates
information about the performance of the design in terms of both point bias and statistical precision.
19
Results
RCT Benchmarks Figures 1--3 plot estimates of average Medicaid expenditures in the treatment and
control groups from each state across all ages. The parametric estimates are from OLS regressions and
the non-parametric estimates are from local linear regressions estimated separately for each treatment
and control group. The polynomial specifications and bandwidths were chosen using the procedure
described earlier in the paper. In each state, the two estimation approaches yield very similar results
even though the state-specific functional forms vary -- the expenditure-age function is linear in Florida,
highly non-linear in New Jersey, and somewhat non-linear in Arkansas.
Figure 4 shows local linear regression estimates of pretest and posttest Medicaid expenditures in each
state’s control group. The expenditure-age relationship is quite similar at each time period. More
specifically, the estimated functions are “most parallel” in regions with a large number of observations
and are “least parallel” for the youngest age groups where observations are noticeably rarer because
young disabled persons are rare in both the Cash and Counseling Demonstration and in society at large.
The posttest regressions is usually higher than the pretest regressions. But the key assumption in a
pretest RDD is that the difference between the pretest and posttest expenditure-age regression
functions is well described by a simple intercept shift and nothing more. So simple visual inspection of
the RCT control group data in Figure 4 indicates it is reasonable to assume that the pretest and posttest
untreated outcomes have similar functional forms along nearly all of the assignment variable
distribution.
Table 4 reports state RCT estimates of treatment effects at and above the three age cutoffs. The
estimates based on parametric and non-parametric regressions are very similar in all three states at
each cutoff age. There is only one discrepancy. Figure 2 shows that, in New Jersey at age 70, average
spending in the control group is always lower in the nonparametric regressions but is not so in the
20
polynomial model. But this one discrepancy aside, it is otherwise clear that the causal results are not
much affected by the estimation method.
Table 4 also shows that the age-specific RCT treatment effects have quite large standard errors. This is
to be expected since the RCT was not designed to estimate treatment effects in narrow one-year age
brackets around a cutoff. Fortunately, standard errors are much more precise for average effects across
all participants older than the ages we use as cutoffs in the RDD analysis, especially for the younger age
cutoffs that inevitably entail more cases on the treated side of the assignment variable. Then, average
standard errors above the cutoff are about one-third of those at the cutoff.
Posttest RDD. Table 5 reports results from the within study comparison of the RCT and posttest RDD
estimates at each age cutoff5. The standard errors for the cut-off treatment effect estimates are about
three times larger for the posttest RDD than the RCT. As for bias, the results in table 5 show that
performance of the RDD estimators varies by age cutoff. At 35, both the parametric and nonparametric
analyses produce similar point estimates. But both are biased relative to the RCT benchmarks. For
example, the local linear regression estimates are biased by .45 standard deviations in Arkansas, by .20
SDs in New Jersey, though by only .06 in Florida. The mean square error statistics (RMSE) also show that
the RDD did not perform very well with the cutoff at age 35. At the older cutoffs the balance of cases
each side of the cutoff is better, and the posttest RDD performed noticeably better. Local linear
regressions with the age 50 cutoff resulted in bias of .03 SDs in New Jersey, .06 in Florida and .32 in
Arkansas. At age 70, the bias estimates from the local linear regressions were .06 SDs in Arkansas, .01 in
New Jersey, and .08 in Florida. Considering that the RCT estimates are noisily estimated, the RCT and
posttest RDD estimates for the two older age cutoffs are strikingly similar at the cutoff. These results
5 Note that we do not present any estimates of the average treatment effect across all ages above the cutoff for
the posttest RDD analysis because extrapolation beyond the cut-off is not possible with non-parametric methods and is not usually considered trustworthy with polynomial models either. This lack of generalizability is – of course – a key weakness of the standard posttest RDD.
21
provide another demonstration of the lack of bias in standard RDD, though only at one point -- the
cutoff (Cook et al, 2008; Green et al., 2009; Shadish et al., 2011).
Pretest RDD The pretest RDD allows us to estimate the average treatment effect not just at the cutoff
but also above it. Estimates of treatment effects at the cutoff are in Table 6 and can be directly
compared to the posttest RDD results in Table 5. At the imbalanced age 35 cutoff where the posttest
RDD fared poorly, the pretest RDD shows bias estimates of .01 SDs in Arkansas and .15 in both New
Jersey and Florida, which compares quite favorable to the .45, .20 and .06 standardized biases in the
posttest RDD analysis. With cutoffs at ages 50 and 70, all the pretest RDD estimates of bias relative to
the RCT were below .10 SDs except for Arkansas at age 50. So in terms of bias at the cutoff, the pretest
RDD does as well as the posttest RDD when cases are balanced and somewhat better when they are not.
The RMSE statistic penalizes an estimator for both bias and high sampling variability. Using it provides
clear and consistent support (see Table 5) for the superiority of the pretest RDD over the posttest RDD
at the cutoff. However, as the same table illustrates, neither set of RMSE statistics is as impressive as the
corresponding RMSE statistics in the benchmark RCT.
Crucial to our causal generalization goal is estimating the average treatment effect among all those
scoring above the assignment cutoff in the RDD. Table 7 presents the relevant results and shows that the
pretest RDD estimators performed well. The flexible partially linear model performed best. Across all
three age cutoffs, its estimates were biased by less than .10 SDs in Arkansas and Florida. But in New
Jersey there is a complicated treatment by age interaction (see Figure 2) and standardized bias was .19
at age 35, .13 at 50, and .12 at 70. With respect to biases in estimates of the average treatment effect
among people above the cut-off, the pretest RDD does very well in Arkansas and Florida where it never
equals even .10 SDs. It also does quite well in the New Jersey, where is still less than .20 SDs.
22
Evaluation of the total impact of extrapolating causal connections also requires consideration of the
standard errors for the above-the-cutoff average effects. They are in Table 7 and are higher in the
pretest RDD than the RCT benchmark. For the partially linear model, the RMSE estimates are about 3 to
5 times higher. However, this difference decreases as the length of the required extrapolation
decreases. Extrapolation is greatest at age 35, and then the RMSE is from 3.5 to 5.6 times higher in the
pretest RDD depending on the state. But for the age 70 cutoff, where the required extrapolation is least,
the RMSE is only 1.5 to 3.1 times higher in the pretest RDD compared to the RCT. Not surprisingly
perhaps, smaller extrapolations entail lower variance. Even so, the best pretest RDD is not as good as
the RCT when both bias and sampling error are considered together.
Conclusions
This paper provides further empirical evidence that the standard posttest RDD is a dependable way of
estimating causal effects for the narrow subpopulation of cases immediately around the cutoff. In a
series of nine within study comparisons based on three states and three age cutoff values, the posttest
RDD provided treatment effect estimates at the cutoff that were generally close to those from the RCT
at this same point. The degree of correspondence improved to near perfect (considering the inevitability
of some sampling variability in the RCT) with older age cutoffs where the number of cases was more
balanced across the cutoff and where the functional form of the underlying regression was more linear
in the region of the cutoff. This empirical comparability between RCT and RDD estimates at the cutoff is
consistent with the results of other within study comparisons (Cook & Wong, 2008; Green et al, 2009;
Shadish et al., 2011).
Despite the strong performance of the posttest RDD, the paper also indicates that adding a pretest RDD
function ameliorates key weaknesses of the standard RDD. Consider statistical power first. Standard
errors of estimated treatment effects were consistently lower when pretest data were incorporated into
23
RDD analyses. So were RMSE statistics based on joint changes in bias and variance. This result is not
surprising in light of the gain in sample size due to supplementing the posttest data with pretest data.
Consider next confidence in assumptions about functional forms. Causal inference in the standard
posttest RDD depends on how well functional form assumptions are met. Modern non-parametric and
semi-parametric methods mitigate this problem, but they only really solve the problem when studies
have atypically large sample sizes. The present study showed that a pretest RDD offers the opportunity
to observe the untreated regression function in an earlier time period and to compare it to the posttest
regression function along the untreated part of the assignment variable – see Figure 4. Strong
correspondences between the functional forms suggest that pretest information can be used as a partial
validation of what the untreated potential outcomes regression function would have been in the treated
part of the assignment variable had there been no treatment – the missing counterfactual that bedevils
all posttest RDD studies. In the data we presented, the pretest and posttest functional forms were very
similar in the absence of treatment. Had they not been, we would probably not have proceeded with
the analysis. Use of a pretest RDD does not guarantee knowledge of the missing untreated outcomes on
the treated side of the cutoff. A case has to be made for each new application after examining the
untreated functional forms; the encouraging correspondences shown here do not constitute evidence
that pretest RDDs will work in all situations. Even functional forms that are comparable in the untreated
pretest and posttest data do not guarantee they would have continued to be similar in the treated part
of the assignment variable. But without the pretest RDD the analyst would have no evidence at all by
which to judge the suitability of functional form estimates. The pretest provides strong evidence about
comparability of functional forms along at least part of the assignment variable.
The third limitation of the posttest RDD is that treatment effect estimates are limited to the narrow and
often irrelevant sub-population immediately around the cutoff. Our results regarding
24
extrapolation/generalization are a novel contribution to the literature on quasi-experimental methods.
We used a simple set of tools to make extrapolations that were informed by the pretest functional form
estimates. These extrapolations performed well across different states and different age cutoff values.
In particular, when we used the pretest RDD to estimate the ATT parameter along the entire assignment
variable above the cutoff, very low amounts of standardized bias were obtained in the pretest RDD. That
is, it was successful in generalizing treatment effects to the entire sub-population above the cutoff
rather than just to those at the cutoff -- a task that the conventional posttest RDD cannot accomplish
credibly. So the pretest RDD can help researchers make progress on the difficult problem of generalizing
beyond the sub-population of persons immediately around the cutoff. This has to be done with care, of
course, and success depends on obtaining comparable functional forms in the untreated part of the
assignment variable.
Our results indicate that the standard posttest RDD can be improved by combining it with pretest
outcome data on the same persons who provide outcome data. In RDD work, design elements (Corrin &
Cook,1998) other than this pretest may also improve functional form estimation, statistical power, and
causal generalization. One possibility involves using repeated cross-sectional samples rather than
longitudinal data, as in Lohr’s study of how the introduction of Medicaid affected the number of doctor
visits (reported in Cook & Campbell, 1979). In that study, household income was the assignment
variable, an income threshold adjusted for family size was the cutoff, and the number of doctor visits in
the year after the introduction of Medicaid was the outcome variable. The supplemental RDD design
element was doctor visits the year before the introduction of Medicaid for a nationally representative
sample of families independent of the next year posttest sample. Thus, there were two national cohorts
instead of longitudinal pretest and posttest data on the same persons, as in the application examined
here. Lohr demonstrated that functional forms were very similar in the untreated part of the assignment
25
variable at both pretest and posttest, suggesting that this intact cohort-based design supplement would
also perform well beyond the cutoff.
Other RDD supplements that could be considered include contemporaneous but non-equivalent
comparison groups in which the treatment is not offered to some persons. These might come from a
different geographical area or from institution where the treatment was not available. Depending on the
context, one can imagine supplementing an RDD with data from another city, state, school, or
workplace; perhaps matching on pre-treatment covariates to make the regression functions more
comparable. In the early stages of this study, we explored using non-equivalent comparison groups and
constructed them by pairing the RCT control groups from one state with the non-equivalent comparison
group for another state. For instance, we explored supplementing the standard posttest RDD in
Arkansas with the untreated control group data from New Jersey. We soon decided this strategy was
not viable because, as Figure 4 makes abundantly clear, the regression functions are very different
across the three states. The aim is to achieve RDD supplements that are especially likely to be valid
because they manifest functional forms in the untreated part of the assignment variable that are similar
to those found in the posttest data.
Another kind of supplement to the basic RDD is what Cook and Campbell (1979) have called “non-
equivalent dependent variables” – variables that should be affected by the most plausible alternative
interpretations operating at the cutoff but that are not related to treatment. It is now standard in RDD
analyses to examine variables other than the outcome to show that they do not change at the cutoff.
But at issue with “non-equivalent dependent variables” is the requirement that they should change at
the cutoff if a specific alternative interpretation holds at that cutoff. An example of this comes from
Ludwig and Miller (2007) who showed that spending on other poverty programs did not differentially
26
occur at the 300th poorest county cutoff – the cutoff for counties to get help in writing their grant
applications for Head Start funds.
Our analysis shows that the standard RDD is a strong design but that a supplement to it can increase (i)
statistical power, (ii) confidence in functional form assumptions and (iii) causal generalization away from
the cutoff value. The application we presented required an RCT in order to demonstrate such unbiased
generalization, and no such RCT will be available to RDD researchers using some kind of a design
supplement. But even so, they will be able to empirically examine how similar untreated functional
forms are below the cutoff. If they correspond closely, a shot at causal generalization is warranted; if
they do not, causal generalization entails considerably greater risk.
27
Table 1: Descriptive statistics for the variables and samples forming the within-study comparisons.
Arkansas Florida New Jersey
Variable Control Treatment Control Treatment Control Treatment
Pretest Medicaid Expenditures $6,358 $6,439 $14,300 $14,377 $18,779 $18,215
Posttest Medicaid Expenditures $7,583 $9,443 $18,088 $19,944 $20,100 $21,299
Mean Age 70 70 55 55 62 63
N 1,004 1,004 906 907 869 861
Table 2: Sample Sizes in Nine Constructed Posttest-Only Regression Discontinuity Designs
State Age Cutoff Below The Cutoff Above the Cutoff Total
Arkansas 35 59 944 1003
Florida 35 296 609 905
New Jersey 35 106 770 876
Arkansas 50 143 868 1011
Florida 50 417 496 913
New Jersey 50 224 650 874
Arkansas 70 361 623 984
Florida 70 555 359 914
New Jersey 70 491 387 878
28
Table 3: Sample Sizes in Nine Constructed Pretest Regression Discontinuity Designs
State Age Cutoff Below The Cutoff Above the Cutoff Total
Arkansas 35 118 1888 2006
Florida 35 592 1218 1810
New Jersey 35 212 1540 1752
Arkansas 50 286 1736 2022
Florida 50 834 992 1826
New Jersey 50 448 1300 1748
Arkansas 70 722 1246 1968
Florida 70 1110 718 1828
New Jersey 70 982 774 1756
29
Table 4: Estimated Treatment Effects At and Above the Cutoff Value In The RCT (Benchmark) Data
Average Treatment Effects At The Cutoff Average Treatment Effects Above The Cutoff
State Model Estimation Cutoff Point Estimate Bootstrap Bias SE RMSE Point Estimate Bootstrap Bias SE RMSE
Arkansas Benchmark Parametric 35
2980 16 1334 1334 1703 14 258 258
New Jersey Benchmark Parametric 35
3202 -69 1467 1469 622 30 717 718
Florida Benchmark Parametric 35
3329 35 1590 1590 788 29 728 729
Arkansas Benchmark LLR 35
2738 39 1330 1331 1772 1 256 256
New Jersey Benchmark LLR 35
3529 48 1594 1594 755 -3 724 724
Florida Benchmark LLR 35 3456 3 1139 1139 741 -17 562 562
State Model Estimation Point Estimate Bootstrap Bias SE RMSE Point Estimate Bootstrap Bias SE RMSE
Arkansas Benchmark Parametric 50
1467 24 616 616 1671 14 250 250
New Jersey Benchmark Parametric 50
822 10 1155 1155 387 41 726 728
Florida Benchmark Parametric 50
2034 32 978 978 357 28 795 795
Arkansas Benchmark LLR 50
1547 73 884 887 1752 -5 239 239
New Jersey Benchmark LLR 50
611 13 1235 1235 529 -8 727 727
Florida Benchmark LLR 50 2191 -9 774 774 224 -21 604 604
State Model Estimation Point Estimate Bootstrap Bias SE RMSE Point Estimate Bootstrap Bias SE RMSE
Arkansas Benchmark Parametric 70
1134 21 410 410 1872 10 249 249
New Jersey Benchmark Parametric 70
-84 52 901 902 562 43 785 786
Florida Benchmark Parametric 70
739 29 732 732 0 27 901 901
Arkansas Benchmark LLR 70
1294 -1 335 335 1934 -17 237 238
New Jersey Benchmark LLR 70
466 -20 788 788 679 -16 843 843
Florida Benchmark LLR 70 625 -20 568 569 -180 -23 677 677
30
Table 5: Performance of the Posttest-Only RDD at the Cutoff
State Model Estimation Cutoff Point
Estimate Bootstrap
Bias SE Bias in SD Units Relative
to Polynomial Benchmark
Bias in SD Units Relative to LLR
Benchmark
RMSE Relative to Polynomial
Benchmark
RMSE Relative to LLR
Benchmark
Arkansas RDD Parametric 35 6694 60 2583 0.49 0.52 4573 4775
New Jersey RDD Parametric 35 8173 -114 2338 0.24 0.22 5391 5098
Florida RDD Parametric 35 890 103 3715 -0.12 -0.13 4388 4458
Arkansas RDD LLR 35 6121 -161 3563 0.42 0.45 4645 4804
New Jersey RDD LLR 35 7712 158 5010 0.22 0.20 6848 6629
Florida RDD LLR 35 2342 23 4159 -0.05 -0.06 4269 4299
State Model Estimation Cutoff Point
Estimate Bootstrap
Bias SE Bias in SD Units Relative
to Polynomial Benchmark
Bias in SD Units Relative to LLR
Benchmark
RMSE Relative to Polynomial
Benchmark
RMSE Relative to LLR
Benchmark
Arkansas RDD Parametric 50 1786 -58 1267 0.04 0.03 1293 1280
New Jersey RDD Parametric 50 -1356 -157 3676 -0.10 -0.09 4355 4245
Florida RDD Parametric 50 6969 79 3359 0.25 0.24 6035 5905
Arkansas RDD LLR 50 3966 3 2300 0.33 0.32 3398 3340
New Jersey RDD LLR 50 -69 -134 3689 -0.04 -0.03 3829 3778
Florida RDD LLR 50 3284 136 4705 0.06 0.06 4905 4863
State Model Estimation Cutoff Point
Estimate Bootstrap
Bias SE Bias in SD Units Relative
to Polynomial Benchmark
Bias in SD Units Relative to LLR
Benchmark
RMSE Relative to Polynomial
Benchmark
RMSE Relative to LLR
Benchmark
Arkansas RDD Parametric 70 919 37 894 -0.03 -0.05 911 956
New Jersey RDD Parametric 70 832 -117 2322 0.04 0.02 2455 2335
Florida RDD Parametric 70 -717 -84 1785 -0.07 -0.07 2357 2284
Arkansas RDD LLR 70 830 -4 857 -0.04 -0.06 910 976
New Jersey RDD LLR 70 578 -57 2377 0.03 0.01 2453 2378
Florida RDD LLR 70 2207 -228 2642 0.07 0.08 2918 2968
31
Table 6: Performance at the Cutoff when the Pretest RDD is added to the Posttest-Only RDD
State Model Estimation Cutoff
Point Estimate
Bootstrap Bias SE
Bias in SD Units Relative to Polynomial Benchmark
Bias in SD Units Relative to LLR
Benchmark
RMSE Relative to Polynomial Benchmark
RMSE Relative to LLR Benchmark
Arkansas Pretest RDD Parametric 35 2423 -88 2949 -0.07 -0.04 3019 2822
New Jersey Pretest RDD Parametric 35 7583 105 2120 0.21 0.19 4962 2176
Florida Pretest RDD Parametric 35 -969 96 3313 -0.22 -0.22 5351 3407
Arkansas Pretest RDD LLR 35 2825 75 2114 -0.02 0.01 2116 2150
New Jersey Pretest RDD LLR 35 6724 -49 1634 0.17 0.15 3839 1546
Florida Pretest RDD LLR 35 424 152 3226 -0.15 -0.15 4241 3338
State Model Estimation Cutoff Point
Estimate Bootstrap
Bias SE
Bias in SD Units Relative to Polynomial Benchmark
Bias in SD Units Relative to LLR
Benchmark
RMSE Relative to Polynomial Benchmark
RMSE Relative to LLR Benchmark
Arkansas Pretest RDD Parametric 50 8970 0.3 941 0.99 0.98 7563 869
New Jersey Pretest RDD Parametric 50 3128 -36 2292 0.11 0.12 3226 2243
Florida Pretest RDD Parametric 50 6309 -8 2743 0.22 0.21 5073 2744
Arkansas Pretest RDD LLR 50 5367 -56 2018 0.52 0.51 4341 1889
New Jersey Pretest RDD LLR 50 1678 106 2662 0.04 0.05 2830 2695
Florida Pretest RDD LLR 50 1333 -124 3943 -0.04 -0.04 4028 3746
State Model Estimation Cutoff Point
Estimate Bootstrap
Bias SE
Bias in SD Units Relative to Polynomial Benchmark
Bias in SD Units Relative to LLR
Benchmark
RMSE Relative to Polynomial Benchmark
RMSE Relative to LLR Benchmark
Arkansas Pretest RDD Parametric 70 1611 5 536 0.06 0.04 721 542
New Jersey Pretest RDD Parametric 70 766 -57 1937 0.04 0.01 2093 1900
Florida Pretest RDD Parametric 70 -2555 -140 1387 -0.17 -0.16 3703 1267
Arkansas Pretest RDD LLR 70 1834 11 521 0.09 0.07 881 532
New Jersey Pretest RDD LLR 70 2187 31 1645 0.11 0.08 2829 1676
Florida Pretest RDD LLR 70 316 -27 2223 -0.02 -0.02 2268 2196
32
Table 7: Performance above the cutoff when the Pretest RDD is added to the Posttest-Only RDD
State Model Estimation Cutoff Point
Estimate Bootstrap
Bias SE
Bias in SD Units Relative to Polynomial Benchmark
Bias in SD Units Relative to LLR
Benchmark
RMSE Relative to Polynomial
Benchmark
RMSE Relative to LLR
Benchmark
Arkansas Pretest RDD Parametric 35 3103 -83 1312 0.19 0.18 1859 1811
New Jersey Pretest RDD Parametric 35 4736 7 1394 0.20 0.19 4351 4225
Florida Pretest RDD Parametric 35 -1369 -31 841 -0.11 -0.11 2344 2301
Arkansas Pretest RDD LLR 35 2271 -9 1096 0.08 0.07 1230 1201
New Jersey Pretest RDD LLR 35 4661 -25 1181 0.19 0.19 4184 4056
Florida Pretest RDD LLR 35 -1110 21 704 -0.10 -0.09 2004 1961
State Model Estimation Cutoff Point
Estimate Bootstrap
Bias SE
Bias in SD Units Relative to Polynomial Benchmark
Bias in SD Units Relative to LLR
Benchmark
RMSE Relative to Polynomial
Benchmark
RMSE Relative to LLR
Benchmark
Arkansas Pretest RDD Parametric 50 2508 -13 877 0.11 0.10 1203 1149
New Jersey Pretest RDD Parametric 50 2347 73 1176 0.09 0.09 2349 2227
Florida Pretest RDD Parametric 50 -1446 31 738 -0.09 -0.08 1920 1798
Arkansas Pretest RDD LLR 50 2155 44 709 0.06 0.05 884 838
New Jersey Pretest RDD LLR 50 3224 15 1002 0.14 0.13 3023 2890
Florida Pretest RDD LLR 50 -1360 -24 704 -0.09 -0.08 1878 1755
State Model Estimation Cutoff Point
Estimate Bootstrap
Bias SE
Bias in SD Units Relative to Polynomial Benchmark
Bias in SD Units Relative to LLR
Benchmark
RMSE Relative to Polynomial
Benchmark
RMSE Relative to LLR
Benchmark
Arkansas Pretest RDD Parametric 70 2529 18 469 0.09 0.08 822 771
New Jersey Pretest RDD Parametric 70 3540 -5 933 0.14 0.14 3116 3004
Florida Pretest RDD Parametric 70 -1028 -130 891 -0.05 -0.04 1461 1323
Arkansas Pretest RDD LLR 70 2237 5 449 0.05 0.04 582 544
New Jersey Pretest RDD LLR 70 3114 56 886 0.12 0.12 2754 2644
Florida Pretest RDD LLR 70 -970 32 640 -0.05 -0.04 1136 992
33
Figure 1: Average Posttest Medicaid Expenditures By Age For Treatment and Control Participants In Arkansas (RCT Benchmark)
5000
10000
15000
20000
25000
20 40 60 80 100
Arkansas: Parametric vs Non-parametric
34
Figure 2: Average Posttest Medicaid Expenditures By Age For Treatment And Control Participants In New Jersey (RCT Benchmark)
16000
18000
20000
22000
24000
20 40 60 80 100
New Jersey: Parametric vs Non-parametric
35
Figure 3: Average Posttest Medicaid Expenditures By Age For Treatment And Control Participants In Florida (RCT Benchmark)
10000
15000
20000
25000
30000
20 40 60 80 100
Florida: Parametric vs Non-parametric
36
Figure 4: Local Linear Regression Estimates of Mean Pretest and Posttest Medicaid Expenditures By Age In The Experimental Control Groups
4000
6000
8000
10000
12000
14000
16000
20 25 30 35 40 45 50 55 60 65 70 75 80 85 90
Arkansas
16000
17000
18000
19000
20000
21000
22000
20 25 30 35 40 45 50 55 60 65 70 75 80 85 90
New Jersey
12000
14000
16000
18000
20000
22000
24000
20 25 30 35 40 45 50 55 60 65 70 75 80 85 90Age
Pretest Expenditures (Control Group) Posttest Expenditures (Control Group)
Florida
Pretest and Posttest Medicaid Expenditure In The Experimental Control Groups
37
Appendix 1: Selection of Bandwidths and Polynomial Series Lengths
Overview
The models we used to compute treatment effects required selection of smoothing parameters. The
local linear regressions and partially linear regressions required a bandwidth parameter that affects the
value of the Epanechnikov weighting function. And the polynomial series models required a functional
form for the polynomial chain used to approximate the unknown underlying function.
For both types of models, we used leave-one-out least squares cross validation to guide our choice of
the smoothing parameters. The participants in the Cash and Counseling Demonstration ranged in age
from 18 to 90+. We used cross validation to evaluate a grid of candidate bandwidths from 1 year to 90
years in 1 year increments. For the polynomial models we considered a grid of possible functions for the
treatment and control data: linear, quadratic, cubic, and quartic. In the post-test RDD we expanded the
grid to include functions that were fully interacted with the cutoff indicator variable.
When choosing a bandwidth for the local linear and partially linear regressions, we often found that
cross validation led an under smoothed function: the function appeared excessively jagged when
graphed. In these cases, we increased the bandwidth slightly until the estimated conditional mean
function was smooth. This appendix describes the bandwidth and functional forms used for each model
in our analysis.
RCT Benchmarks
Bandwidth Selection To estimate the age specific and above the cutoff average treatment effects using
the benchmark RCTs we estimated local linear regressions of Medicaid Expenditures on age separately
for each treatment and control group in each state.
In Arkansas, the cross validation exercise implied an optimal bandwidth of 11 years for both the
treatment and control groups. But this bandwidth led to a very under-smoothed function and so we
increased the bandwidth to 20 years, which produced a smooth regression but did not change the basic
shape of the regression very much.
In New Jersey, the cross validation led to bandwidths of 13 years for the control group and 9 years for
the treatment group. These were both under-smoothed. In the analysis we worked with a bandwidth of
25 years for the treatment and control groups in the New Jersey data.
In Florida, the cross validation led to a bandwidth of 90 years for the treatment and control groups. With
such a large bandwidth the functions are almost perfectly linear. To ensure that the functions were not
over smoothed, we experimented with smaller bandwidths. The underlying function remains very linear
even with much smaller bandwidths and so we worked with the bandwidth of 90 years in our analysis.
Polynomial Selection To choose functional forms for the polynomial series version of the benchmark
model, we used cross validation to select a model from the following set of 8 candidates: i) an intercept
shift for the treatment group and a linear function of age, ii) an intercept shift for the treatment group
38
and a quadratic function of age, iii) an intercept shift for the treatment group and a cubic function of
age, iv) an intercept shift for the treatment group and a quartic function of age, v) an intercept shift for
the treatment group, a linear function of age and the interaction of the linear age term and the
treatment indicator, vi) an intercept shift for the treatment group, a quadratic function of age and the
interaction of the quadratic age terms and the treatment indicator, vii) an intercept shift for the
treatment group, a cubic function of age and the interaction of the cubic age terms and the treatment
indicator, and viii) an intercept shift for the treatment group, a quartic function of age and the
interaction of the quartic age terms and the treatment indicator. The cross validation exercise led us to
work with the interacted quadratic models in Arkansas, and New Jersey, and the interacted linear model
in Florida.
Posttest RDD
Bandwidth Selection In the posttest RDD analysis, we selected bandwidths for each state and cutoff
separately and we allowed for a different bandwidth below and above the cutoff. We used cross-
validation followed by a visual inspection to choose the bandwidth for the analysis.
In Arkansas the cross validation led to bandwidths of 90 above and below the cutoff for the age 70
cutoff design. For the age 50 design the cross validation led to bandwidths of 90 years above the cutoff
and 2 years below the cutoff. The 2 year bandwidth led to under-smoothing below the cutoff and so in
the empirical work we used a bandwidth of 90 years above the cutoff and 20 years below the cutoff.
Finally, in the age 35 design, we used the cross validation bandwidths of 90 above the cutoff and 19
below the cutoff.
In New Jersey, for the age 70 cutoff, the cross validation bandwidths were 13 above the cutoff and 9
below the cutoff. These were under-smoothed and in the empirical work we used a bandwidth of 19
years above the cutoff and 13 years below the cutoff. In the age 50 design we used the cross validation
bandwidths of 90 years above the cutoff and 15 years below the cutoff. Finally, in the age 35 cutoff
design, we used the cross validation bandwidths of 14 years above the cutoff and 90 years below the
cutoff.
In Florida, for the age 70 cutoff value, we used the cross validation parameters of 90 above the cutoff
and 11 below the cutoff. For the age 50 design we used the cross validation parameter of 90 years
above the cutoff. The cross validation bandwidth was 4 years below the cut; this was under-smoothed
and so we used a bandwidth of 11 below the cutoff. For the age 35 design we used the cross validation
bandwidths of 90 years above the cutoff and 17 years below the cutoff.
Polynomial Selection We used cross validation to choose polynomial using the same basic method used
for the benchmark analysis.
39
In Arkansas, we used a quadratic model with an intercept shift at the cutoff for the age 35 design, a
linear model with an intercept shift for the age 50 design, and a linear model with an intercept shift for
the age 70 design.
In New Jersey, we used a linear model with an intercept shift for the age 35 design, a quartic model with
an intercept shift for the age 50 design, and a quartic model with an intercept shift for the age 70
design.
In Florida, we used a quartic model with an intercept shift for the age 35 design, a cubic model with an
intercept shift for the age 50 design, and an interacted linear model for the age 70 design.
Adding the Pretest RDD
Bandwidth Selection: Partially Linear Model The partially linear model approach to the pretest RDD
required 4 bandwidth parameters. First we needed a bandwidth for the treated observations above the
cutoff and for this we used the same bandwidth used above the cutoff in the posttest RDD. Then – for
the untreated pretest and posttest observations, we needed a bandwidth for the first step pretest
indicator function, a bandwidth for the first step expenditure function, and a bandwidth for the residual
function. In each case, we selected the bandwidths by first using cross validation and then adjusting the
bandwidth manually to compensate for any under-smoothing. Here we discuss each bandwidth and
design in sequence.
We used a pretest indicator bandwidth of 90 years for all three age cutoff designs and all three states.
This was the cross validation selection and it led to a basically linear function that held up to sensitivity
analysis.
For the Medicaid expenditures bandwidth in Arkansas, we used a bandwidth of 90 years for the age 35
and age 50 cutoffs, and we used a bandwidth of 21 years for the age 70 design. In New Jersey the cross
validation led to a bandwidth of 1 year for all three age cutoffs. But this led to under-smoothed
functions and so we worked with a bandwidth of 10 years for the age 35 and age 50 designs and of 13
years for the age 70 design. In Florida, we used a bandwidth of 90 years for the age 50 and age 70
designs. For the age 35 design, the cross validation parameter was 2 years but this was very under
smoothed and we increased the bandwidth to 20 years for the analysis.
Finally, for the residualized expenditures bandwidth, in Arkansas, we used the cross-validation
bandwidths of 13 for the age 35 design, 90 for the age 50 design, and 11 for the age 70 design. In New
Jersey, for the age 35 design, the cross validation bandwidth was 9 but this was under-smoothed and we
used a bandwidth of 18 in the empirical analysis. The cross validation minimum bandwidth was 7 years
for the New Jersey age 50 design but this was under-smoothed and we increased it to 15 for the
empirical work. For the New Jersey age 70 design we used the cross validation selection of 13 years.
Finally, in Florida we used the cross validation minimizing bandwidths of 66 year for the age 35 design,
66 years for the age 50 design, and 67 years for the age 70 design.
40
Polynomial Selection when the Pretest RDD is Added. To estimate the pretest RDD we estimated a
regression of expenditures on a polynomial series in age among the treated (above the cutoff, posttest
period) samples for each state. Then we estimated a regression of expenditures on a polynomial in age
and a posttest dummy variable using the untreated (pretest data for all ages, posttest data from below
the cutoff) samples for each state. We used cross-validation to select the length of the polynomial series
for each model.
In Arkansas, for the untreated samples, we used a quartic model in the age 35 design, a linear model for
the age 50 design, and a quartic model for the age 70 design. For the treated samples in Arkansas, we
used a quartic model for the age 35 design, a cubic model for the age 50 design, and a linear model for
the age 70 design.
In New Jersey, for the untreated samples, we used a quartic for all three designs. For the treated
samples in New Jersey, we used a linear model for the age 35 design, a quadratic model for the age 50
design, and a quartic model for the age 70 design.
In Florida, for the untreated samples, we used a linear model in the age 35 design, a quadratic model for
the age 50 design, and a linear model for the age 70 design. For the treated samples in Florida, we used
a cubic model for the age 35 design, a quadratic model for the age 50 design, and a linear model for the
age 70 design.
41
References
Brown, R. S., & Dale, S. B. (2007). The research design and methodological issues for the Cash and Counseling evaluation. Health Services Research, 42(1p2), 414-445.
Carlson, B. L., Foster, L., Dale, S. B., & Brown, R. (2007). Effects of Cash and Counseling on Personal Care and Well‐Being. Health Services Research, 42(1p2), 467-487.
Cook, T. D., & Campbell, D. T. (1979). Quasi-experimentation: Design and analysis for field settings. Chicago, IL: Rand McNally.
Cook, T. D., Shadish, W. R., & Wong, V. C. (2008). Three conditions under which experiments and observational studies often produce comparable causal estimates: New findings from within-study comparisons. Journal of Policy Analysis and Management, 27(4), 724-750.
Cook, T. D., & Steiner, P. M. (2010). Case Matching and the Reduction of Selection Bias in Quasi-Experiments: The Relative Importance of Covariate Choice, Unreliable Measurement and Mode of Data Analysis. Psychological Methods, 1(1), 56-68.
Cook, T. D., & Wong, V. C. (2008). Empirical Tests of the Validity of the Regression Discontinuity Design. Annals of Economics and Statistics/Annales d'Economie et de Statistique(91/92), 127-150.
Corrin, W., & Cook, T. (1998). Design elements of quasi-experimentation. Advances in Educational Productivity, 7, 35-57.
Dale, S. B., & Brown, R. S. (2007). How does Cash and Counseling affect costs? Health Services Research, 42(1p2), 488-509.
Doty, P., Mahoney, K. J., & Simon‐Rusinowitz, L. (2007). Designing the cash and counseling demonstration and evaluation. Health Services Research, 42(1p2), 378-396.
Goldberger, A. S. (1972a). Selection bias in evaluating treatment effects: Some formal illustrations. Institute for Research on Poverty. Madison, WI.
Goldberger, A. S. (1972b). Selection bias in evaluating treatment effects: The case of interaction. Institute for Research on Poverty. Madison, WI.
Green, D. P., Leong, T. Y., Kern, H. L., Gerber, A. S., & Larimer, C. W. (2009). Testing the accuracy of regression discontinuity analysis using experimental benchmarks. Political Analysis, 17(4), 400.
Hahn, J., Todd, P., & van der Klaauw, W. (2001). Identification and estimation of treatment effects with a regression-discontinuity design. Econometrica, 69(1), 201-209.
LaLonde, R. (1986). Evaluating the econometric evaluations of training with experimental data. The American Economic Review, 76(4), 604-620.
Lee, D.S. (2008). Randomized experiments from non-random selection in U.S. House elections. Journal of Econometrics 142(2) 675-697
Lee, D. S., & Card, D. (2008). Regression discontinuity inference with specification error. Journal of Econometrics, 142(2), 655-674.
Ludwig, J., & Miller, D. L. (2007). Does Head Start Improve Children's Life Chances? Evidence from a Regression Discontinuity Design*. The Quarterly journal of economics, 122(1), 159-208.
Porter, J. (2003) Estimation in the regression discontinuity model. Mimeo. Department of Economics, University of Wisconsin.
Rubin, D. B. (2008). For objective causal inference, design trumps analysis. The Annals of Applied Statistics, 808-840.
Schochet, P. Z. (2009). Statistical power for regression discontinuity designs in education evaluations.
Journal of Educational and Behavioral Statistics, 34(2), 238-266. Shadish, W., Galindo, R., Wong, V., Steiner, P., & Cook, T. (2011). A randomized experiment comparing
random and cutoff-based assignment. Psychological Methods. Thistlewaite, D. L., & Campbell, D. T. (1960). Regression-discontinuity analysis: An alternative to the ex-
post facto experiment. Journal of Educational Psychology, 51, 309-317.