42
1 Strengthening The Regression Discontinuity Design Using Additional Design Elements: A Within-Study Comparison Abstract The sharp regression discontinuity has three key weaknesses when compared with the randomized clinical trial (RCT). The RDD has lower statistical power than an equivalent RCT. Treatment effect estimates from an RDD are usually more dependent on modeling assumptions than an equivalent RCT. And an RDD produces an estimate of average treatment effects among the narrow sub-population with assignment scores near the cut-off score, while a typical RCT produces an estimate of the treatment effect in the overall study population. In this paper, we explore one method for strengthening the RDD along these three dimensions. The method involves adding a comparison “untreated” comparison group to the standard RDD. In the example we consider, the comparison is constructed from pretest values of the outcome variable. The additional data provides a check on functional form assumptions, improves statistical precision, and supports extrapolation beyond the cut-off sub-population. We designed a within study comparison to study the performance of the method relative to standard RDD and RCT designs. The within study comparison involves (1) a standard posttest-only RDD; (2) a pretest- supplemented RDD; and (3) an RCT that serves as a benchmark. The designs are replicated using data from three different states and we also consider three different assignment cut-offs so that we can study the performance of the methods under different conditions. Our results show that, relative to the posttest-only RDD, adding the pretest makes functional form assumptions more transparent and produces statistically more efficient estimates. Relative to the RCT benchmark, both versions of the RDD show no substantial bias at the cutoff. Most importantly, the pretest-supplemented RDD makes it possible to estimate causal effects in the region beyond the cut-off, and these estimates are very similar to the benchmark RCT estimates. This indicates that the supplemented design can be used to support more general causal inferences. Several other types of supplemented RDDs are conceptually possible and are discussed.

1 Strengthening The Regression Discontinuity Design Using Additional Design Elements

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

1

Strengthening The Regression Discontinuity Design Using Additional Design Elements: A Within-Study Comparison

Abstract

The sharp regression discontinuity has three key weaknesses when compared with the randomized clinical trial (RCT). The RDD has lower statistical power than an equivalent RCT. Treatment effect estimates from an RDD are usually more dependent on modeling assumptions than an equivalent RCT. And an RDD produces an estimate of average treatment effects among the narrow sub-population with assignment scores near the cut-off score, while a typical RCT produces an estimate of the treatment effect in the overall study population. In this paper, we explore one method for strengthening the RDD along these three dimensions. The method involves adding a comparison “untreated” comparison group to the standard RDD. In the example we consider, the comparison is constructed from pretest values of the outcome variable. The additional data provides a check on functional form assumptions, improves statistical precision, and supports extrapolation beyond the cut-off sub-population. We designed a within study comparison to study the performance of the method relative to standard RDD and RCT designs. The within study comparison involves (1) a standard posttest-only RDD; (2) a pretest-supplemented RDD; and (3) an RCT that serves as a benchmark. The designs are replicated using data from three different states and we also consider three different assignment cut-offs so that we can study the performance of the methods under different conditions. Our results show that, relative to the posttest-only RDD, adding the pretest makes functional form assumptions more transparent and produces statistically more efficient estimates. Relative to the RCT benchmark, both versions of the RDD show no substantial bias at the cutoff. Most importantly, the pretest-supplemented RDD makes it possible to estimate causal effects in the region beyond the cut-off, and these estimates are very similar to the benchmark RCT estimates. This indicates that the supplemented design can be used to support more general causal inferences. Several other types of supplemented RDDs are conceptually possible and are discussed.

2

Strengthening The Regression Discontinuity Design Using Additional Design Elements: A Within-Study Comparison

Introduction

A carefully executed regression discontinuity design (RDD) is now widely considered a sound basis for

causal inference. But the status of RDD as a trustworthy method has evolved over time. The design was

introduced in Thistlewaite and Campbell (1960). Goldberger (1972 a, b) formally showed that the RDD

produces unbiased parameter estimates, but is less statistically efficient than a comparable randomized

clinical trial (RCT). More recent work has clarified the assumptions that support parametric and non-

parametric identification in the RDD (Hahn, Todd, and Van der Klauuw 2001; Lee 2008), and has

examined the statistical properties of common estimators (Porter 1998, 2003; Lee and Card, 2008;

Schochet 2009). A growing literature compares RDD treatment effect estimates to benchmark estimates

from an RCT. These ‘within study comparisons’ provide empirical evidence that an RDD produces causal

estimates that are similar to those from an RCT on the same topic (Cook and Wong 2008; Green et al

2009; Shadish et al, 2011).

Advancements in the technical literature have improved our understanding of RDD assumptions and

approaches to data analysis. But the basic elements of the design have not changed. An RDD requires an

outcome variable, a binary treatment, a continuous assignment variable, and a cutoff based treatment

assignment rule. The assignment rule is the crucial detail: in a successful RDD, individuals with

assignment scores on one side of the cutoff receive treatment and individuals on the other side of the

cutoff receive the control condition. The RDD is called “sharp” when all people receive the treatment

intended for them. When compliance is partial, the RDD is called “fuzzy”. This paper deals only with

sharp RDD studies.

3

At a conceptual level, the analysis of an RDD is not complicated. Researchers estimate treatment effects

by comparing mean outcomes among people with assignment scores just below and just above the

cutoff. The difference between these two conditional means is called a “regression discontinuity”

because it can be understood as a discontinuity in the regression function that links average outcomes

across sub-populations defined by different scores on the assignment variable. In the absence of a

treatment effect, the regression function would be smooth near the cutoff. A sudden break or

‘discontinuity’ at the cutoff is evidence of a treatment effect. The size of the discontinuity measures the

magnitude of the effect.

RDD has at least three important limitations relative to an RCT. The first involves the amount of

statistical modeling required to identify and estimate causal effects. In an RCT, treatment effects are

non-parametrically identified: assumptions about the underlying statistical model are not required and

the connection between the research design and the statistical tools used to perform the analysis is

quite close1. RDD treatment effects are non-parametrically identified only in very large samples. In

practice, researchers proceed by specifying the parametric functional form of the regression and

allowing for an intercept shift at the cut-off value (Card and Lee, 2008). Since choosing the wrong

functional form can lead to biased treatment effect estimates, good applied RDD papers seek to

estimate the regression function flexibly and to evaluate how sensitive the results are to alternative

specifications. There is considerable value in new methods that can reduce the RDDs conceptual

dependency on functional form assumptions either by providing additional sensitivity analysis or by

offering some way to partially validate functional form assumptions.

A second limitation is that treatment effect estimates are less statistically precise in an RDD than in a

comparable RCT. Lack of precisions means that the statistical power of key hypothesis tests are lower in

1 Of course, analysts often employ parametric regression models in the analysis of experimental data either to

improve the statistical precision of the treatment effect estimates or to adjust for chance imbalances in observable covariates. But this additional modeling is usually not central to the study's findings.

4

an RDD (Goldberger, 1972b; Schochet, 2009). In parametric approaches to RDD, the efficiency loss is due

to multicollinearity that arises because assignment scores are highly correlated with individual

treatment status. Non-parametric RDD estimates may also have lower power, in part because they

employ a bandwidth that decreases the study’s effective sample size. Power is a secondary concern in

RDD studies with very large sample sizes, but it may be a central issue when investigators design RDD

studies prospectively and collect their own data directly from respondents. Adding more cases may then

prove costly and even diminish the value of RDD relative to other research designs that offer may

provide a weaker foundation for causal inference but higher statistical power (Schochet, 2009).

A third limitation concerns the generality of RDD results. An RCT produces estimates of the average

treatment effect across all members of the study population. In contrast, RDD estimates are only

informative about average treatment effects among members of the narrow sub-population located

immediately around the cutoff. If a treatment is given to students who score above the 75th percentile

on some test, then the RDD study will produce results generalizable only to students scoring near the

75th percentile. Most social science theories and policy questions are not concerned with such narrow

sub-populations. But treatment estimates for broader sub-populations -- like all students, or all students

in the upper quartile of the distribution – require extrapolations beyond the cutoff. The extrapolations

usually are not trustworthy because no theoretical basis exists for assuming that the functional form of

the regression is stable beyond the range of the observed data. The crux of the problem is that we do

not know what the treatment group functional form would have looked like above the cutoff in the

absence of treatment. The standard practice is conservative. It limits inference to the sub-population in

the immediate neighborhood of the cut-off. The approach is methodologically admirable in that it sticks

to the assumptions supported by the research design. But it courts irrelevancy for many policy and

scientific questions. In our view, the narrow applicability and limited generalizability of causal results is

5

the most serious weakness of standard RDD as a practical method for program evaluation and policy

analysis.

In this paper, we explore an RDD variant that can improve on all three limitations. The approach involves

supplementing the conventional posttest-only RDD with an additional design element that can aid

causal inference. In the application considered here the additional design element is a pretest measure

of the outcome variable. We refer to the conventional RDD as a “posttest RDD” because it only requires

posttest information; and we refer to the supplemented design as a “pretest RDD” even though it makes

use of both pretest and posttest outcome data. The key idea is that the pretest data provides

information about what the regression function linking outcomes and assignment scores would have

“looked like” before the treatment was available. The method is not fool proof. The regression function

may have changed over time, for example. But the main assumptions of the design are partly testable

because the pretest and post-test untreated regression functions can be directly compared below the

cutoff score to observe whether they differ. Minor changes such as intercept shifts are easily

accommodated. But very dissimilar functional forms below the cutoff cast doubt on the results of any

pretest-supplemented RDD. Even when functional forms are similar below the cutoff, causal

extrapolation beyond the cutoff requires untestable assumption that the two functions would have

remained similar above the cutoff.

The core of this paper is an empirical evaluation of the performance of the pretest and posttest RD

designs relative to each other and also relative to a benchmark RCT. To assess performance we

constructed a within study comparison. The method of within study comparison was developed by

Lalonde (1986). He examined whether various econometric adjustments for selection bias could

reproduce the results of a job-training RCT. Since then, researchers have used the method to study the

performance of RDD (Shadish et al., 2011), different forms of intact group and individual case matching

6

(Cook, Shadish & Wong, 2008; Bifulco, 2012), and alternative strategies of covariate selection (Cook &

Steiner, 2010). The implementation details of specific within study comparison method vary, but the

basic idea is always to test the validity of a non-experimental method by comparing its estimates to a

trustworthy benchmark from an RCT. Methods for conducting a high quality within study comparison

have evolved over time; Cook et al. (2008) describe best practices and we follow them in this paper.

Our within study comparison is based on data from the Cash and Counseling Demonstration Experiment

(Dale & Brown, 2007). In the original study, disabled Medicaid beneficiaries in Arkansas, Florida and New

Jersey were randomly assigned either to obtaining home and community based health services through

Medicaid (the control group), or to receiving a spending account that they could use to procure home

and community based services directly (the treatment group). In our analysis, the outcome variable is a

person’s Medicaid expenditures in the 12 months after the study began. We used baseline age as the

assignment variable. To construct pretest RDDs and posttest RDDs from the RCT data, we systematically

retained and excluded treatment or control observations. Specifically, we sorted the treatment group

and control group cases from the RCT by age (the assignment variable). Then for the RD designs we

defined a cut-off age and systematically deleted control group cases on one side of the cutoff and

treatment cases on the other. For replication purposes, we chose three different age cutoff values – 35,

50 and 70. And we analyzed the data separately for three different states – Florida, New Jersey and

Arkansas. In total, we examined 9 different posttest-only RDDs and 9 different pretest RDDs. We

compared each RDD estimate to the corresponding RCT estimate both at the cutoff age under analysis.

And in the pretest RDD we used the comparison data to extrapolate beyond the cut-off to compute an

estimate of the average treatment effect for everyone older than the cutoff. We compared these above

the cut-off averages to RCT benchmarks as well. Estimates of average treatment effects above the cut-

off depend on extrapolations and the accuracy of these estimates represent at test of the ability of the

pretest RDD to improve the generality of the standard design. The average effect above the cutoff

7

corresponds to the Average Treatment Effect on the Treated (ATT) parameter, which is often the target

parameter in applied program evaluation research. In all the analyses to be reported, standard errors for

the RCT and RDD estimates were computed using the bootstrap. Technical details about design and

analysis are presented later.

The results of our analysis indicate that the pretest RDD can shore up all three key weaknesses of the

RDD. The pretest RDD improved statistical precision relative to the posttest RDD, although the RCT

remained the most precise. Similar functional forms were observed for the pretest and posttest RDDs on

the untreated side of the assignment variable as well as for the pretest RDD and the RCT on both sides

of the cutoff. This provides support for the pretest RDD’s key assumption that the difference in

functional forms in the two time periods is well approximated by a simple intercept shift and nothing

more. Extrapolation beyond the cutoff was also quite successful: the pretest RDD produced estimates of

the average treatment effect above the cutoff that were very close to the RCT benchmarks. Taken

together, these results indicate that supplementing the standard posttest RDD with pretest outcome

data can increase statistical power, improve the credibility of functional form assumptions, and generate

unbiased causal inference at and beyond the cutoff.

We start the body of the paper with a short description of the Cash and Counseling Demonstration RCT.

Details of the within study research design are in the second section, and estimation methods for each

of the research designs are described in the third section. The results are in the fourth section.

Conclusions and a short discussion end the paper.

Experimental Data

The Cash and Counseling Demonstration and Evaluation is described in detail elsewhere (Dale and

Brown, 2007a; Brown & Dale, 2007; Doty, Mahoney & Simon-Rusinowitz, 2007). The treatment

condition was a “consumer-directed budget” program, which allowed disabled Medicaid beneficiaries to

8

procure their own home and community based support services using a Medicaid-financed spending

allowance. The control group received home and community based support services under the status

quo Medicaid program. That is, in the control group, a Medicaid agency procured services for clients

from Medicaid certified providers; in the treatment group, people found their own services and paid for

them using the allotted budget. People were randomly assigned to treatment and control arms of the

study, and treatment group subjects could choose to use a consumer directed budget or not2. The new

program was meant to be budget-neutral, and the new personal allowance for support services was set

equal to the amount the agency would otherwise allocate to controls.

Study participants were disabled elderly and non-elderly adult Medicaid beneficiaries who agreed to

participate and lived in Arkansas, New Jersey, or Florida from 1999 to 2003. State Medicaid agencies

operated the demonstration. But Mathematica Policy Research (MPR) was responsible for random

assignment, data collection, analysis, and evaluation. The study employed a rolling enrollment design in

which new enrollees completed a baseline survey and then were randomly assigned to treatment or

control status, after which the state agency was informed of the assignments. The program resulted in

higher levels of patient satisfaction, small improvements on selected health outcomes (Carlson, Foster,

Dale and Brown, 2007), and higher post-treatment Medicaid expenditures (Dale & Brown, 2007b). The

increase in Medicaid expenditures is interesting because the treatment program was intended to be

budget neutral. The increase in spending occurred because the status quo Medicaid agency often failed

to successfully procure the home and community based services that program recipients were entitled

to receive. The treatment condition put people in charge of their own budgets and they were more

successful at actually procuring their goods and services. For instance, treatment group members were

2 People who chose the Cash and Counseling option were also assigned a counselor who would help them develop

a spending plan, provide advice about hiring workers, and would also monitor the person’s use of the allowance and general welfare. All of the subjects in the study (treatment and control) expressed interest in using the consumer directed budgeting option.

9

more likely to receive any services than members of the control group. They also received a larger

fraction of the services they were authorized to receive than members of the control group. In essence,

Medicaid expenditures were higher in the treatment group because the agency serving the control

group members often did not manage to actually spend the allotted funds.

Our methodological study used a small number of measures from the original study. We retained

information on age at baseline, state of residence, and randomly assigned treatment status. We also

created a measure of annual Medicaid expenditures by adding up 6 categories of monthly expenditures

across the 12 months before random assignment (pretest) and after it (posttest). The categories were:

Inpatient Expenditures, Diagnosis Related Group Expenditures, Skilled Nursing Expenditures, Personal

Assistance Services Expenditures, Home Health Services Expenditures, and Other Services Expenditures3.

Throughout, we refer to this 6-item index as “Medicaid expenditures” and it is the sole outcome variable

analyzed.

Table 1 gives summary statistics from the three states. In the RCT, Arkansas had 1004 participants in

each of the treatment and control arms, Florida had 906 control and 907 treatment participants, and

New Jersey had 869 control and 861 treatment group members. In Arkansas, the average participant

was 70 years old at baseline, compared to 55 in Florida and 62 in New Jersey. Within each state, average

pretest expenditures were similar in the treatment and control groups. But the level of spending varied

across states. The average person in Arkansas had expenditures of $6,400 in the pretest year compared

to $14,300 in Florida and $18,500 in New Jersey. Mean posttest expenditures were consistently higher

3 The claims data included a small number of cases with very high levels of expenditures that could be either real

or data entry errors. To reduce concerns that these outliers would skew our regression estimates we top coded the pretest and posttest Medicaid expenditures variable at the 99

th percentile of the pooled distribution of posttest

expenditures, which was equal to $78,273. The top coding procedure affected 89 posttest observations and 79 pretest observations.

10

in the treatment groups. Intent to treat (ITT) estimates of the mean difference between the treatment

and control groups were about $1860 (p < .01) in Arkansas, $1856 (p = .01) in Florida, and $1200 (p =

.09) in New Jersey. These simple estimates show that the Cash and Counseling treatment consistently

led to higher average Medicaid expenditures.

Within Study Research Design

To implement the within study comparison, we created 21 different subsets of the original RCT data.

The first 3 subsets consist of the state-specific RCT treatment and control groups; sample sizes and basic

descriptive statistics for these data are in Table 1. Only posttest expenditure data are included in the

RCT subsets.

The next 9 subsets represent state-specific posttest RDDs based on age cutoffs of 35, 50, and 70. No

pretest expenditure data are involved here. To create these posttest RDD subsets, we removed from the

experimental data all of the treatment group members younger than the relevant age cutoff, and also all

of the control group members at least as old as the cutoff. Table 2 shows the resulting sample sizes for

the 9 posttest RDD subsets. Notice that the number of observations below the cutoff increases with the

cutoff age. At 35, there are many more observations above the cutoff than below; at age 50,

observations are somewhat more balanced; and at age 70 there are actually more observations below

the cutoff than above it in New Jersey and Florida but not Arkansas. In addition to creating variation in

the number of observations on each side of the cutoff, the different cut-offs determine how much

extrapolation is required to compute average effects for everyone above the particular age cutoff under

analysis. For example, when the cutoff is set at age 35, estimating the average effect among everyone

older than 35 would require an extrapolation from age 36 to 90.

Next, we used Medicaid Expenditures over the year prior to randomization to create 9 pretest RDD data

subsets based on the same cutoff values and states. With the pretest and posttest RDD subsets in hand,

11

we created a “long-form” dataset by stacking the pretest and posttest RDD data, and defined an

indicator variable to identify which observations were from which time period. Stacking the data in this

way results in twice as many observations than the posttest RDD because each participant is now

observed twice.

These subsetting procedures resulted in three research designs (an RCT, a posttest RDD, and a pretest

RDD). The posttest RDD and pretest RDD were replicated with three different age cutoff values (35, 50

and 70). And all of the designs were replicated for the three states: Arkansas, New Jersey, and Florida.

The core task of our study is to construct estimates of the same causal parameters using each of these

subsets/research designs. Interpreting the estimates from the RCT data as a benchmark provides a way

to measure the performance of the posttest and pretest RDD estimates.

Methods

Implementing the within study comparison requires i) defining treatment effects of interest, ii)

specifying estimators for each effect in each design, and iii) developing measures of performance that

can be used to judge the strengths and weaknesses of each design. In this section, we describe our

approach to each of these tasks in sequence.

Parameters of Interest

To understand the treatment effects of interest in our analysis, start by letting index individuals and

index the pretest and posttest time periods. is a person’s (time invariant) age at baseline,

and identifies observations made during the pretest time period. We adopt a potential

outcomes framework in which denotes the ith person’s treated outcome at time , and

denotes the person’s untreated outcome at time . if the person has received the treatment at

time and if the person has not received the treatment at time . The person’s realized

12

outcome at time is . In the Cash and Counseling data, a person is

treated if she has the option to control her own Medicaid financed home care budget, and no one

received the treatment in the pretest time period. This means that . The outcome

variable -- – represents the person’s Medicaid Expenditures in period .

To estimate treatment effects at the conventional RDD cutoff and also beyond it, we define treatment

effects conditional on specific ages and age ranges. To this end, write the average treatment effect in

the post-treatment time period for people who are, say, 70 years old as:

. Suppose that the cutoff value in a regression discontinuity

design is set at age 50. Then, in our notation, is the average treatment effect in the

cutoff sub-population for that particular RDD. In a conventional RDD, inference is limited to the average

treatment effect at the cutoff. But it is also useful to describe average treatment effects across a range

of ages by weighting the age-specific treatment effects by the relative frequency distribution of ages in

the age group of interest. For example, when the cutoff is set at age 50 and the maximum age in the

study population is , the average treatment effect across all people above the RDD cutoff is:

In a sharp RDD, represents the average treatment effect above the cutoff, which might also

be called the average treatment effect on the treated (ATT): . Where possible we

will compare estimates of the average treatment effect at the RDD cutoff and the average treatment

effect among all observations above the cutoff. Note too that each of these treatment effect parameters

is defined separately for the three states in the analysis: state identifiers are suppressed in the notation

to reduce notational clutter.

Estimation

13

To estimate the quantities of interest described above, we used regression methods that account for

unknown functional forms either with kernel weighting or a polynomial series in the age variable. These

are the two most common methods used in the modern RDD literature and so our work is consistent

with existing best practices. That said, the implementation of flexible models based on kernel weighting

or polynomial approximations meant that we could not specify a single polynomial model or a single

bandwidth for all the designs and states in the analysis. Instead, we specified a method of selecting

polynomial specifications and bandwidth parameters that was applied uniformly across the designs. In

what follows, we describe the general approach to estimation employed with the RCT, posttest RDD,

and pretest RDD. Then we explain the model selection algorithm used to guide our choice of smoothing

parameters like bandwidths and polynomial series lengths. The exact bandwidths and polynomial

specifications employed in the analysis are reported in the Appendix.

Estimation in the RCT For the three state-specific benchmark data sets, we estimated age specific

treatment effects using two methods. First, we estimated local linear regressions of Medicaid

Expenditures on age separately for the treatment and control groups. Then we computed age-specific

treatment effects as point-wise differences in treatment and control regression functions for each age.

To calculate average treatment effects above the cutoff we weighted these age-specific differences

according to the relative frequency distribution of ages in the full state sample.

Since many applied researchers prefer to work with flexible polynomial specifications rather than kernel

based regressions, we also estimated OLS regressions of Medicaid Expenditures on a polynomial series

in age, a treatment group indicator, and interactions between the polynomial series and age for each

state. Treatment effect estimates were computed using the coefficients on the treatment indicator and

the appropriate interaction terms. Average treatment effects above the cutoff were taken as weighted

14

averages of age-specific differences with weights equal to the relative frequency of each age in the state

sample.

Estimation in the Posttest RDD We also estimated treatment effects in the posttest RDDs using two

methods. First we estimated treatment effects at the cutoff using local linear regressions applied

separately to the data from above and below the cutoff in each state. Treatment effects at the cutoff

were calculated using the difference in estimates of mean Medicaid expenditures at the cutoff.

In the second method, we pooled data from above and below the cutoff and estimated OLS regressions

of Medicaid Expenditures on a polynomial in age, a dummy variable set to 1 for observations above the

cutoff, and interactions between the age polynomial series and the above the cutoff dummy variable. In

these posttest RDD analyses we computed treatment effects only at the cutoff. We did not make

extrapolations based on the functional form implied by the polynomial regression coefficients because

this is almost never done in practice due to the tendency of polynomial series estimates to have good

within-sample fits but very poor out of sample properties.

Estimation in the Pretest RDD The pretest RDD combines both the pretest and posttest RDD data. The

key idea underlying this design is that information about the relationship between the assignment

variable and the pretest outcomes can be used to produce more trustworthy extrapolations beyond the

assignment cutoff in the posttest time period. Putting the idea into practice requires a model of the

untreated outcome variable in the pretest and posttest periods that can account for simple non-

equivalencies between the two periods. We consider models in which pretest and posttest untreated

outcome regression functions are parallel in the sense that pretest and posttest untreated outcome

regression functions differ by a constant across all ages. We work with the following model for the

untreated potential outcomes:

15

In this model, represents the fixed difference in conditional mean outcomes across the pre and post-

test periods, and is an unknown smooth function that is constant across the two periods. We

assume that . One approach to estimating this model involves approximating the

unknown smooth function using a polynomial series. For instance, one might specify a Kth order

polynomial series and estimate model parameters using OLS regression. Then the specification for a

chosen K is:

The equation can be estimated by applying OLS to all of the untreated cases in the sample. The key point

is that the untreated sample includes pretest Medicaid expenditures from the full range of ages and also

the posttest Medicaid expenditures of people under the design’s age cutoff. In this setting, ̂

∑ ̂ represents an estimate of . The idea here is that extrapolations

beyond the cutoff are now made with what might be called “partial” empirical support. Rather than

extrapolate outside the range of the data, extrapolations are made to the posttest outcomes on the

support of pretest data and under the maintained assumption that estimates of are sufficient to

account for any between period non-equivalence. This method provides estimates of the untreated

outcome function. To form estimates of treatment effects, we still need estimates of the treated

outcome function. An obvious strategy is to estimate polynomial regressions of expenditures on age

using the posttest data from sample members who are above the age cutoff. Then treatment effects can

be computed at the cutoff using differences between the fitted value of the treated function and the

untreated functions from the treated outcome model. Average treatment effects among all

observations above the cutoff can be formed by computing age-specific treatment effects for each age

16

above the cutoff and then forming a weighted average of these differences based on the relative

frequency of the ages above the cutoff.

We also implemented the pretest RDD model by estimating a version of Robinson’s (1988) partial linear

regression model. The model exploits the same assumptions that is a smooth function and that

, but it also requires a support condition so that the pretest indicator in the

parametric component of the model is not a deterministic function of the assignment variable. Formally,

the requirement is that . The support condition fails by definition in the full sample

because there are no untreated RDD observations above the cutoff. Our solution is to estimate the

parametric period effect using only data that fall on the common support of the different time periods.

In practice this means that we estimate the model using only the below the cutoff data in the two time

periods. Then, with estimates of in hand, we estimate the non-parametric component using the full

sample of observations both above and below the cutoff value. We calculated treatment effects at the

cutoff using differences in the predicted values from local linear regressions among treated observations

from the posttest time period and predicted values from the partially linear model. And we constructed

average treatment effects above the cutoff by taking age-specific differences between the predicted

values from the two models and weighting them by the relative frequency of each age in each state

sample.

Procedures for Choosing Smoothing Parameters Each of the methods described above depends on

assumptions about the degree of smoothing across different age groups. The local linear regressions

depend on a kernel function and a bandwidth parameter, and the OLS polynomial series regressions

require specifying an appropriate polynomial function. The goal of these flexible modeling approaches is

to allow the data to determine the specification rather than to impose a specification that is based on

theoretical reasoning alone. However, there is always some arbitrariness associated with selecting these

17

smoothing parameters, and it seems wise to define a model selection protocol in advance of the data

analysis.

To this end, we selected bandwidth parameters by using least squares cross-validation to evaluate a grid

of candidate bandwidths ranging from 1 year to 90 years in width. We then inspected the function

produced by using the bandwidth that minimized the cross-validation statistic4. When visual inspection

revealed that the bandwidth chosen by the cross validation exercise led to a function that appeared very

under-smoothed, we increased the bandwidth slightly to achieve a more regular function. Details about

the selected bandwidth for each of the research designs in the paper are reported in the appendix.

To choose a polynomial functional form, we used least squares cross-validation to evaluate a set of

candidate models that included linear, quadratic, cubic, and quartic polynomial functions, and also

models that fully interacted the polynomial terms with a treatment group indicator variable. In the

within study comparisons, we always worked with the polynomial models that minimized the cross-

validation function. That is, we conducted the cross validation for each of the candidate specifications

and chose the specification that produced the smallest mean square out of sample prediction errors.

The specific functional forms that were used in each part of the within study comparison are reported in

the appendix.

Estimating Standard Errors To ensure comparability across the different designs, we used a non-

parametric bootstrap to estimate standard errors for all treatment effect estimates. We always used

500 bootstrap replications. Point estimates were re-calculated for each replicate, and the standard

deviation of the point estimates across the 500 replicates was used as the bootstrap estimate of the

standard error of the estimate. Bandwidths, polynomial functional forms, and relative frequency

weights for computing above the cutoff averages were fixed across bootstrap replicates. In the pretest

4 The cross validation statistic we worked with is the standard mean squared out of sample prediction error formed

by predicting the value of each observation when it is left out of the estimation.

18

RDD designs we resampled individual participants rather than individual observations in order to

account for within-person dependencies in the error structure.

Measuring Performance

Comparing point estimates and standard errors from the different designs is one approach to measuring

performance. We also examined two other measures of the performance of the different estimators

that clarify particular strengths and weaknesses. First, we considered a measure of the standardized bias

of the quasi-experimental point estimate. Let be the point estimate of a given parameter produced

by quasi-experimental estimator, . And let be the point estimate of the same parameter

produced by the RCT. Finally, let be the standard deviation of posttest Medicaid expenditures

observed in the RCT. The standardized bias measure that we worked with is

.

Essentially, measures the magnitude of the bias in a particular quasi-experimental estimate in

standard deviation units. We computed this measure of standardized bias for each parameter estimated

with each cutoff age, state and research design. The metric offers a uniform account of the size of the

bias across different causal parameters and different research designs.

The second measure of performance combines bias and variance estimates in a mean squared error

framework. To compute the mean square error statistic, we centered the point estimates from each

bootstrap replicate around the experimental benchmark. Then we squared these deviations and

computed the average of the squared deviations across the 500 bootstrap replicates. Formally, the

statistic we work with is ( )

∑ (

)

, where indexes the bootstrap

replicates. To keep the scale of the statistic interpretable in dollar terms, we report the square root of

the MSE statistics in the tables of results. The advantage of this MSE measure is that it incorporates

information about the performance of the design in terms of both point bias and statistical precision.

19

Results

RCT Benchmarks Figures 1--3 plot estimates of average Medicaid expenditures in the treatment and

control groups from each state across all ages. The parametric estimates are from OLS regressions and

the non-parametric estimates are from local linear regressions estimated separately for each treatment

and control group. The polynomial specifications and bandwidths were chosen using the procedure

described earlier in the paper. In each state, the two estimation approaches yield very similar results

even though the state-specific functional forms vary -- the expenditure-age function is linear in Florida,

highly non-linear in New Jersey, and somewhat non-linear in Arkansas.

Figure 4 shows local linear regression estimates of pretest and posttest Medicaid expenditures in each

state’s control group. The expenditure-age relationship is quite similar at each time period. More

specifically, the estimated functions are “most parallel” in regions with a large number of observations

and are “least parallel” for the youngest age groups where observations are noticeably rarer because

young disabled persons are rare in both the Cash and Counseling Demonstration and in society at large.

The posttest regressions is usually higher than the pretest regressions. But the key assumption in a

pretest RDD is that the difference between the pretest and posttest expenditure-age regression

functions is well described by a simple intercept shift and nothing more. So simple visual inspection of

the RCT control group data in Figure 4 indicates it is reasonable to assume that the pretest and posttest

untreated outcomes have similar functional forms along nearly all of the assignment variable

distribution.

Table 4 reports state RCT estimates of treatment effects at and above the three age cutoffs. The

estimates based on parametric and non-parametric regressions are very similar in all three states at

each cutoff age. There is only one discrepancy. Figure 2 shows that, in New Jersey at age 70, average

spending in the control group is always lower in the nonparametric regressions but is not so in the

20

polynomial model. But this one discrepancy aside, it is otherwise clear that the causal results are not

much affected by the estimation method.

Table 4 also shows that the age-specific RCT treatment effects have quite large standard errors. This is

to be expected since the RCT was not designed to estimate treatment effects in narrow one-year age

brackets around a cutoff. Fortunately, standard errors are much more precise for average effects across

all participants older than the ages we use as cutoffs in the RDD analysis, especially for the younger age

cutoffs that inevitably entail more cases on the treated side of the assignment variable. Then, average

standard errors above the cutoff are about one-third of those at the cutoff.

Posttest RDD. Table 5 reports results from the within study comparison of the RCT and posttest RDD

estimates at each age cutoff5. The standard errors for the cut-off treatment effect estimates are about

three times larger for the posttest RDD than the RCT. As for bias, the results in table 5 show that

performance of the RDD estimators varies by age cutoff. At 35, both the parametric and nonparametric

analyses produce similar point estimates. But both are biased relative to the RCT benchmarks. For

example, the local linear regression estimates are biased by .45 standard deviations in Arkansas, by .20

SDs in New Jersey, though by only .06 in Florida. The mean square error statistics (RMSE) also show that

the RDD did not perform very well with the cutoff at age 35. At the older cutoffs the balance of cases

each side of the cutoff is better, and the posttest RDD performed noticeably better. Local linear

regressions with the age 50 cutoff resulted in bias of .03 SDs in New Jersey, .06 in Florida and .32 in

Arkansas. At age 70, the bias estimates from the local linear regressions were .06 SDs in Arkansas, .01 in

New Jersey, and .08 in Florida. Considering that the RCT estimates are noisily estimated, the RCT and

posttest RDD estimates for the two older age cutoffs are strikingly similar at the cutoff. These results

5 Note that we do not present any estimates of the average treatment effect across all ages above the cutoff for

the posttest RDD analysis because extrapolation beyond the cut-off is not possible with non-parametric methods and is not usually considered trustworthy with polynomial models either. This lack of generalizability is – of course – a key weakness of the standard posttest RDD.

21

provide another demonstration of the lack of bias in standard RDD, though only at one point -- the

cutoff (Cook et al, 2008; Green et al., 2009; Shadish et al., 2011).

Pretest RDD The pretest RDD allows us to estimate the average treatment effect not just at the cutoff

but also above it. Estimates of treatment effects at the cutoff are in Table 6 and can be directly

compared to the posttest RDD results in Table 5. At the imbalanced age 35 cutoff where the posttest

RDD fared poorly, the pretest RDD shows bias estimates of .01 SDs in Arkansas and .15 in both New

Jersey and Florida, which compares quite favorable to the .45, .20 and .06 standardized biases in the

posttest RDD analysis. With cutoffs at ages 50 and 70, all the pretest RDD estimates of bias relative to

the RCT were below .10 SDs except for Arkansas at age 50. So in terms of bias at the cutoff, the pretest

RDD does as well as the posttest RDD when cases are balanced and somewhat better when they are not.

The RMSE statistic penalizes an estimator for both bias and high sampling variability. Using it provides

clear and consistent support (see Table 5) for the superiority of the pretest RDD over the posttest RDD

at the cutoff. However, as the same table illustrates, neither set of RMSE statistics is as impressive as the

corresponding RMSE statistics in the benchmark RCT.

Crucial to our causal generalization goal is estimating the average treatment effect among all those

scoring above the assignment cutoff in the RDD. Table 7 presents the relevant results and shows that the

pretest RDD estimators performed well. The flexible partially linear model performed best. Across all

three age cutoffs, its estimates were biased by less than .10 SDs in Arkansas and Florida. But in New

Jersey there is a complicated treatment by age interaction (see Figure 2) and standardized bias was .19

at age 35, .13 at 50, and .12 at 70. With respect to biases in estimates of the average treatment effect

among people above the cut-off, the pretest RDD does very well in Arkansas and Florida where it never

equals even .10 SDs. It also does quite well in the New Jersey, where is still less than .20 SDs.

22

Evaluation of the total impact of extrapolating causal connections also requires consideration of the

standard errors for the above-the-cutoff average effects. They are in Table 7 and are higher in the

pretest RDD than the RCT benchmark. For the partially linear model, the RMSE estimates are about 3 to

5 times higher. However, this difference decreases as the length of the required extrapolation

decreases. Extrapolation is greatest at age 35, and then the RMSE is from 3.5 to 5.6 times higher in the

pretest RDD depending on the state. But for the age 70 cutoff, where the required extrapolation is least,

the RMSE is only 1.5 to 3.1 times higher in the pretest RDD compared to the RCT. Not surprisingly

perhaps, smaller extrapolations entail lower variance. Even so, the best pretest RDD is not as good as

the RCT when both bias and sampling error are considered together.

Conclusions

This paper provides further empirical evidence that the standard posttest RDD is a dependable way of

estimating causal effects for the narrow subpopulation of cases immediately around the cutoff. In a

series of nine within study comparisons based on three states and three age cutoff values, the posttest

RDD provided treatment effect estimates at the cutoff that were generally close to those from the RCT

at this same point. The degree of correspondence improved to near perfect (considering the inevitability

of some sampling variability in the RCT) with older age cutoffs where the number of cases was more

balanced across the cutoff and where the functional form of the underlying regression was more linear

in the region of the cutoff. This empirical comparability between RCT and RDD estimates at the cutoff is

consistent with the results of other within study comparisons (Cook & Wong, 2008; Green et al, 2009;

Shadish et al., 2011).

Despite the strong performance of the posttest RDD, the paper also indicates that adding a pretest RDD

function ameliorates key weaknesses of the standard RDD. Consider statistical power first. Standard

errors of estimated treatment effects were consistently lower when pretest data were incorporated into

23

RDD analyses. So were RMSE statistics based on joint changes in bias and variance. This result is not

surprising in light of the gain in sample size due to supplementing the posttest data with pretest data.

Consider next confidence in assumptions about functional forms. Causal inference in the standard

posttest RDD depends on how well functional form assumptions are met. Modern non-parametric and

semi-parametric methods mitigate this problem, but they only really solve the problem when studies

have atypically large sample sizes. The present study showed that a pretest RDD offers the opportunity

to observe the untreated regression function in an earlier time period and to compare it to the posttest

regression function along the untreated part of the assignment variable – see Figure 4. Strong

correspondences between the functional forms suggest that pretest information can be used as a partial

validation of what the untreated potential outcomes regression function would have been in the treated

part of the assignment variable had there been no treatment – the missing counterfactual that bedevils

all posttest RDD studies. In the data we presented, the pretest and posttest functional forms were very

similar in the absence of treatment. Had they not been, we would probably not have proceeded with

the analysis. Use of a pretest RDD does not guarantee knowledge of the missing untreated outcomes on

the treated side of the cutoff. A case has to be made for each new application after examining the

untreated functional forms; the encouraging correspondences shown here do not constitute evidence

that pretest RDDs will work in all situations. Even functional forms that are comparable in the untreated

pretest and posttest data do not guarantee they would have continued to be similar in the treated part

of the assignment variable. But without the pretest RDD the analyst would have no evidence at all by

which to judge the suitability of functional form estimates. The pretest provides strong evidence about

comparability of functional forms along at least part of the assignment variable.

The third limitation of the posttest RDD is that treatment effect estimates are limited to the narrow and

often irrelevant sub-population immediately around the cutoff. Our results regarding

24

extrapolation/generalization are a novel contribution to the literature on quasi-experimental methods.

We used a simple set of tools to make extrapolations that were informed by the pretest functional form

estimates. These extrapolations performed well across different states and different age cutoff values.

In particular, when we used the pretest RDD to estimate the ATT parameter along the entire assignment

variable above the cutoff, very low amounts of standardized bias were obtained in the pretest RDD. That

is, it was successful in generalizing treatment effects to the entire sub-population above the cutoff

rather than just to those at the cutoff -- a task that the conventional posttest RDD cannot accomplish

credibly. So the pretest RDD can help researchers make progress on the difficult problem of generalizing

beyond the sub-population of persons immediately around the cutoff. This has to be done with care, of

course, and success depends on obtaining comparable functional forms in the untreated part of the

assignment variable.

Our results indicate that the standard posttest RDD can be improved by combining it with pretest

outcome data on the same persons who provide outcome data. In RDD work, design elements (Corrin &

Cook,1998) other than this pretest may also improve functional form estimation, statistical power, and

causal generalization. One possibility involves using repeated cross-sectional samples rather than

longitudinal data, as in Lohr’s study of how the introduction of Medicaid affected the number of doctor

visits (reported in Cook & Campbell, 1979). In that study, household income was the assignment

variable, an income threshold adjusted for family size was the cutoff, and the number of doctor visits in

the year after the introduction of Medicaid was the outcome variable. The supplemental RDD design

element was doctor visits the year before the introduction of Medicaid for a nationally representative

sample of families independent of the next year posttest sample. Thus, there were two national cohorts

instead of longitudinal pretest and posttest data on the same persons, as in the application examined

here. Lohr demonstrated that functional forms were very similar in the untreated part of the assignment

25

variable at both pretest and posttest, suggesting that this intact cohort-based design supplement would

also perform well beyond the cutoff.

Other RDD supplements that could be considered include contemporaneous but non-equivalent

comparison groups in which the treatment is not offered to some persons. These might come from a

different geographical area or from institution where the treatment was not available. Depending on the

context, one can imagine supplementing an RDD with data from another city, state, school, or

workplace; perhaps matching on pre-treatment covariates to make the regression functions more

comparable. In the early stages of this study, we explored using non-equivalent comparison groups and

constructed them by pairing the RCT control groups from one state with the non-equivalent comparison

group for another state. For instance, we explored supplementing the standard posttest RDD in

Arkansas with the untreated control group data from New Jersey. We soon decided this strategy was

not viable because, as Figure 4 makes abundantly clear, the regression functions are very different

across the three states. The aim is to achieve RDD supplements that are especially likely to be valid

because they manifest functional forms in the untreated part of the assignment variable that are similar

to those found in the posttest data.

Another kind of supplement to the basic RDD is what Cook and Campbell (1979) have called “non-

equivalent dependent variables” – variables that should be affected by the most plausible alternative

interpretations operating at the cutoff but that are not related to treatment. It is now standard in RDD

analyses to examine variables other than the outcome to show that they do not change at the cutoff.

But at issue with “non-equivalent dependent variables” is the requirement that they should change at

the cutoff if a specific alternative interpretation holds at that cutoff. An example of this comes from

Ludwig and Miller (2007) who showed that spending on other poverty programs did not differentially

26

occur at the 300th poorest county cutoff – the cutoff for counties to get help in writing their grant

applications for Head Start funds.

Our analysis shows that the standard RDD is a strong design but that a supplement to it can increase (i)

statistical power, (ii) confidence in functional form assumptions and (iii) causal generalization away from

the cutoff value. The application we presented required an RCT in order to demonstrate such unbiased

generalization, and no such RCT will be available to RDD researchers using some kind of a design

supplement. But even so, they will be able to empirically examine how similar untreated functional

forms are below the cutoff. If they correspond closely, a shot at causal generalization is warranted; if

they do not, causal generalization entails considerably greater risk.

27

Table 1: Descriptive statistics for the variables and samples forming the within-study comparisons.

Arkansas Florida New Jersey

Variable Control Treatment Control Treatment Control Treatment

Pretest Medicaid Expenditures $6,358 $6,439 $14,300 $14,377 $18,779 $18,215

Posttest Medicaid Expenditures $7,583 $9,443 $18,088 $19,944 $20,100 $21,299

Mean Age 70 70 55 55 62 63

N 1,004 1,004 906 907 869 861

Table 2: Sample Sizes in Nine Constructed Posttest-Only Regression Discontinuity Designs

State Age Cutoff Below The Cutoff Above the Cutoff Total

Arkansas 35 59 944 1003

Florida 35 296 609 905

New Jersey 35 106 770 876

Arkansas 50 143 868 1011

Florida 50 417 496 913

New Jersey 50 224 650 874

Arkansas 70 361 623 984

Florida 70 555 359 914

New Jersey 70 491 387 878

28

Table 3: Sample Sizes in Nine Constructed Pretest Regression Discontinuity Designs

State Age Cutoff Below The Cutoff Above the Cutoff Total

Arkansas 35 118 1888 2006

Florida 35 592 1218 1810

New Jersey 35 212 1540 1752

Arkansas 50 286 1736 2022

Florida 50 834 992 1826

New Jersey 50 448 1300 1748

Arkansas 70 722 1246 1968

Florida 70 1110 718 1828

New Jersey 70 982 774 1756

29

Table 4: Estimated Treatment Effects At and Above the Cutoff Value In The RCT (Benchmark) Data

Average Treatment Effects At The Cutoff Average Treatment Effects Above The Cutoff

State Model Estimation Cutoff Point Estimate Bootstrap Bias SE RMSE Point Estimate Bootstrap Bias SE RMSE

Arkansas Benchmark Parametric 35

2980 16 1334 1334 1703 14 258 258

New Jersey Benchmark Parametric 35

3202 -69 1467 1469 622 30 717 718

Florida Benchmark Parametric 35

3329 35 1590 1590 788 29 728 729

Arkansas Benchmark LLR 35

2738 39 1330 1331 1772 1 256 256

New Jersey Benchmark LLR 35

3529 48 1594 1594 755 -3 724 724

Florida Benchmark LLR 35 3456 3 1139 1139 741 -17 562 562

State Model Estimation Point Estimate Bootstrap Bias SE RMSE Point Estimate Bootstrap Bias SE RMSE

Arkansas Benchmark Parametric 50

1467 24 616 616 1671 14 250 250

New Jersey Benchmark Parametric 50

822 10 1155 1155 387 41 726 728

Florida Benchmark Parametric 50

2034 32 978 978 357 28 795 795

Arkansas Benchmark LLR 50

1547 73 884 887 1752 -5 239 239

New Jersey Benchmark LLR 50

611 13 1235 1235 529 -8 727 727

Florida Benchmark LLR 50 2191 -9 774 774 224 -21 604 604

State Model Estimation Point Estimate Bootstrap Bias SE RMSE Point Estimate Bootstrap Bias SE RMSE

Arkansas Benchmark Parametric 70

1134 21 410 410 1872 10 249 249

New Jersey Benchmark Parametric 70

-84 52 901 902 562 43 785 786

Florida Benchmark Parametric 70

739 29 732 732 0 27 901 901

Arkansas Benchmark LLR 70

1294 -1 335 335 1934 -17 237 238

New Jersey Benchmark LLR 70

466 -20 788 788 679 -16 843 843

Florida Benchmark LLR 70 625 -20 568 569 -180 -23 677 677

30

Table 5: Performance of the Posttest-Only RDD at the Cutoff

State Model Estimation Cutoff Point

Estimate Bootstrap

Bias SE Bias in SD Units Relative

to Polynomial Benchmark

Bias in SD Units Relative to LLR

Benchmark

RMSE Relative to Polynomial

Benchmark

RMSE Relative to LLR

Benchmark

Arkansas RDD Parametric 35 6694 60 2583 0.49 0.52 4573 4775

New Jersey RDD Parametric 35 8173 -114 2338 0.24 0.22 5391 5098

Florida RDD Parametric 35 890 103 3715 -0.12 -0.13 4388 4458

Arkansas RDD LLR 35 6121 -161 3563 0.42 0.45 4645 4804

New Jersey RDD LLR 35 7712 158 5010 0.22 0.20 6848 6629

Florida RDD LLR 35 2342 23 4159 -0.05 -0.06 4269 4299

State Model Estimation Cutoff Point

Estimate Bootstrap

Bias SE Bias in SD Units Relative

to Polynomial Benchmark

Bias in SD Units Relative to LLR

Benchmark

RMSE Relative to Polynomial

Benchmark

RMSE Relative to LLR

Benchmark

Arkansas RDD Parametric 50 1786 -58 1267 0.04 0.03 1293 1280

New Jersey RDD Parametric 50 -1356 -157 3676 -0.10 -0.09 4355 4245

Florida RDD Parametric 50 6969 79 3359 0.25 0.24 6035 5905

Arkansas RDD LLR 50 3966 3 2300 0.33 0.32 3398 3340

New Jersey RDD LLR 50 -69 -134 3689 -0.04 -0.03 3829 3778

Florida RDD LLR 50 3284 136 4705 0.06 0.06 4905 4863

State Model Estimation Cutoff Point

Estimate Bootstrap

Bias SE Bias in SD Units Relative

to Polynomial Benchmark

Bias in SD Units Relative to LLR

Benchmark

RMSE Relative to Polynomial

Benchmark

RMSE Relative to LLR

Benchmark

Arkansas RDD Parametric 70 919 37 894 -0.03 -0.05 911 956

New Jersey RDD Parametric 70 832 -117 2322 0.04 0.02 2455 2335

Florida RDD Parametric 70 -717 -84 1785 -0.07 -0.07 2357 2284

Arkansas RDD LLR 70 830 -4 857 -0.04 -0.06 910 976

New Jersey RDD LLR 70 578 -57 2377 0.03 0.01 2453 2378

Florida RDD LLR 70 2207 -228 2642 0.07 0.08 2918 2968

31

Table 6: Performance at the Cutoff when the Pretest RDD is added to the Posttest-Only RDD

State Model Estimation Cutoff

Point Estimate

Bootstrap Bias SE

Bias in SD Units Relative to Polynomial Benchmark

Bias in SD Units Relative to LLR

Benchmark

RMSE Relative to Polynomial Benchmark

RMSE Relative to LLR Benchmark

Arkansas Pretest RDD Parametric 35 2423 -88 2949 -0.07 -0.04 3019 2822

New Jersey Pretest RDD Parametric 35 7583 105 2120 0.21 0.19 4962 2176

Florida Pretest RDD Parametric 35 -969 96 3313 -0.22 -0.22 5351 3407

Arkansas Pretest RDD LLR 35 2825 75 2114 -0.02 0.01 2116 2150

New Jersey Pretest RDD LLR 35 6724 -49 1634 0.17 0.15 3839 1546

Florida Pretest RDD LLR 35 424 152 3226 -0.15 -0.15 4241 3338

State Model Estimation Cutoff Point

Estimate Bootstrap

Bias SE

Bias in SD Units Relative to Polynomial Benchmark

Bias in SD Units Relative to LLR

Benchmark

RMSE Relative to Polynomial Benchmark

RMSE Relative to LLR Benchmark

Arkansas Pretest RDD Parametric 50 8970 0.3 941 0.99 0.98 7563 869

New Jersey Pretest RDD Parametric 50 3128 -36 2292 0.11 0.12 3226 2243

Florida Pretest RDD Parametric 50 6309 -8 2743 0.22 0.21 5073 2744

Arkansas Pretest RDD LLR 50 5367 -56 2018 0.52 0.51 4341 1889

New Jersey Pretest RDD LLR 50 1678 106 2662 0.04 0.05 2830 2695

Florida Pretest RDD LLR 50 1333 -124 3943 -0.04 -0.04 4028 3746

State Model Estimation Cutoff Point

Estimate Bootstrap

Bias SE

Bias in SD Units Relative to Polynomial Benchmark

Bias in SD Units Relative to LLR

Benchmark

RMSE Relative to Polynomial Benchmark

RMSE Relative to LLR Benchmark

Arkansas Pretest RDD Parametric 70 1611 5 536 0.06 0.04 721 542

New Jersey Pretest RDD Parametric 70 766 -57 1937 0.04 0.01 2093 1900

Florida Pretest RDD Parametric 70 -2555 -140 1387 -0.17 -0.16 3703 1267

Arkansas Pretest RDD LLR 70 1834 11 521 0.09 0.07 881 532

New Jersey Pretest RDD LLR 70 2187 31 1645 0.11 0.08 2829 1676

Florida Pretest RDD LLR 70 316 -27 2223 -0.02 -0.02 2268 2196

32

Table 7: Performance above the cutoff when the Pretest RDD is added to the Posttest-Only RDD

State Model Estimation Cutoff Point

Estimate Bootstrap

Bias SE

Bias in SD Units Relative to Polynomial Benchmark

Bias in SD Units Relative to LLR

Benchmark

RMSE Relative to Polynomial

Benchmark

RMSE Relative to LLR

Benchmark

Arkansas Pretest RDD Parametric 35 3103 -83 1312 0.19 0.18 1859 1811

New Jersey Pretest RDD Parametric 35 4736 7 1394 0.20 0.19 4351 4225

Florida Pretest RDD Parametric 35 -1369 -31 841 -0.11 -0.11 2344 2301

Arkansas Pretest RDD LLR 35 2271 -9 1096 0.08 0.07 1230 1201

New Jersey Pretest RDD LLR 35 4661 -25 1181 0.19 0.19 4184 4056

Florida Pretest RDD LLR 35 -1110 21 704 -0.10 -0.09 2004 1961

State Model Estimation Cutoff Point

Estimate Bootstrap

Bias SE

Bias in SD Units Relative to Polynomial Benchmark

Bias in SD Units Relative to LLR

Benchmark

RMSE Relative to Polynomial

Benchmark

RMSE Relative to LLR

Benchmark

Arkansas Pretest RDD Parametric 50 2508 -13 877 0.11 0.10 1203 1149

New Jersey Pretest RDD Parametric 50 2347 73 1176 0.09 0.09 2349 2227

Florida Pretest RDD Parametric 50 -1446 31 738 -0.09 -0.08 1920 1798

Arkansas Pretest RDD LLR 50 2155 44 709 0.06 0.05 884 838

New Jersey Pretest RDD LLR 50 3224 15 1002 0.14 0.13 3023 2890

Florida Pretest RDD LLR 50 -1360 -24 704 -0.09 -0.08 1878 1755

State Model Estimation Cutoff Point

Estimate Bootstrap

Bias SE

Bias in SD Units Relative to Polynomial Benchmark

Bias in SD Units Relative to LLR

Benchmark

RMSE Relative to Polynomial

Benchmark

RMSE Relative to LLR

Benchmark

Arkansas Pretest RDD Parametric 70 2529 18 469 0.09 0.08 822 771

New Jersey Pretest RDD Parametric 70 3540 -5 933 0.14 0.14 3116 3004

Florida Pretest RDD Parametric 70 -1028 -130 891 -0.05 -0.04 1461 1323

Arkansas Pretest RDD LLR 70 2237 5 449 0.05 0.04 582 544

New Jersey Pretest RDD LLR 70 3114 56 886 0.12 0.12 2754 2644

Florida Pretest RDD LLR 70 -970 32 640 -0.05 -0.04 1136 992

33

Figure 1: Average Posttest Medicaid Expenditures By Age For Treatment and Control Participants In Arkansas (RCT Benchmark)

5000

10000

15000

20000

25000

20 40 60 80 100

Arkansas: Parametric vs Non-parametric

34

Figure 2: Average Posttest Medicaid Expenditures By Age For Treatment And Control Participants In New Jersey (RCT Benchmark)

16000

18000

20000

22000

24000

20 40 60 80 100

New Jersey: Parametric vs Non-parametric

35

Figure 3: Average Posttest Medicaid Expenditures By Age For Treatment And Control Participants In Florida (RCT Benchmark)

10000

15000

20000

25000

30000

20 40 60 80 100

Florida: Parametric vs Non-parametric

36

Figure 4: Local Linear Regression Estimates of Mean Pretest and Posttest Medicaid Expenditures By Age In The Experimental Control Groups

4000

6000

8000

10000

12000

14000

16000

20 25 30 35 40 45 50 55 60 65 70 75 80 85 90

Arkansas

16000

17000

18000

19000

20000

21000

22000

20 25 30 35 40 45 50 55 60 65 70 75 80 85 90

New Jersey

12000

14000

16000

18000

20000

22000

24000

20 25 30 35 40 45 50 55 60 65 70 75 80 85 90Age

Pretest Expenditures (Control Group) Posttest Expenditures (Control Group)

Florida

Pretest and Posttest Medicaid Expenditure In The Experimental Control Groups

37

Appendix 1: Selection of Bandwidths and Polynomial Series Lengths

Overview

The models we used to compute treatment effects required selection of smoothing parameters. The

local linear regressions and partially linear regressions required a bandwidth parameter that affects the

value of the Epanechnikov weighting function. And the polynomial series models required a functional

form for the polynomial chain used to approximate the unknown underlying function.

For both types of models, we used leave-one-out least squares cross validation to guide our choice of

the smoothing parameters. The participants in the Cash and Counseling Demonstration ranged in age

from 18 to 90+. We used cross validation to evaluate a grid of candidate bandwidths from 1 year to 90

years in 1 year increments. For the polynomial models we considered a grid of possible functions for the

treatment and control data: linear, quadratic, cubic, and quartic. In the post-test RDD we expanded the

grid to include functions that were fully interacted with the cutoff indicator variable.

When choosing a bandwidth for the local linear and partially linear regressions, we often found that

cross validation led an under smoothed function: the function appeared excessively jagged when

graphed. In these cases, we increased the bandwidth slightly until the estimated conditional mean

function was smooth. This appendix describes the bandwidth and functional forms used for each model

in our analysis.

RCT Benchmarks

Bandwidth Selection To estimate the age specific and above the cutoff average treatment effects using

the benchmark RCTs we estimated local linear regressions of Medicaid Expenditures on age separately

for each treatment and control group in each state.

In Arkansas, the cross validation exercise implied an optimal bandwidth of 11 years for both the

treatment and control groups. But this bandwidth led to a very under-smoothed function and so we

increased the bandwidth to 20 years, which produced a smooth regression but did not change the basic

shape of the regression very much.

In New Jersey, the cross validation led to bandwidths of 13 years for the control group and 9 years for

the treatment group. These were both under-smoothed. In the analysis we worked with a bandwidth of

25 years for the treatment and control groups in the New Jersey data.

In Florida, the cross validation led to a bandwidth of 90 years for the treatment and control groups. With

such a large bandwidth the functions are almost perfectly linear. To ensure that the functions were not

over smoothed, we experimented with smaller bandwidths. The underlying function remains very linear

even with much smaller bandwidths and so we worked with the bandwidth of 90 years in our analysis.

Polynomial Selection To choose functional forms for the polynomial series version of the benchmark

model, we used cross validation to select a model from the following set of 8 candidates: i) an intercept

shift for the treatment group and a linear function of age, ii) an intercept shift for the treatment group

38

and a quadratic function of age, iii) an intercept shift for the treatment group and a cubic function of

age, iv) an intercept shift for the treatment group and a quartic function of age, v) an intercept shift for

the treatment group, a linear function of age and the interaction of the linear age term and the

treatment indicator, vi) an intercept shift for the treatment group, a quadratic function of age and the

interaction of the quadratic age terms and the treatment indicator, vii) an intercept shift for the

treatment group, a cubic function of age and the interaction of the cubic age terms and the treatment

indicator, and viii) an intercept shift for the treatment group, a quartic function of age and the

interaction of the quartic age terms and the treatment indicator. The cross validation exercise led us to

work with the interacted quadratic models in Arkansas, and New Jersey, and the interacted linear model

in Florida.

Posttest RDD

Bandwidth Selection In the posttest RDD analysis, we selected bandwidths for each state and cutoff

separately and we allowed for a different bandwidth below and above the cutoff. We used cross-

validation followed by a visual inspection to choose the bandwidth for the analysis.

In Arkansas the cross validation led to bandwidths of 90 above and below the cutoff for the age 70

cutoff design. For the age 50 design the cross validation led to bandwidths of 90 years above the cutoff

and 2 years below the cutoff. The 2 year bandwidth led to under-smoothing below the cutoff and so in

the empirical work we used a bandwidth of 90 years above the cutoff and 20 years below the cutoff.

Finally, in the age 35 design, we used the cross validation bandwidths of 90 above the cutoff and 19

below the cutoff.

In New Jersey, for the age 70 cutoff, the cross validation bandwidths were 13 above the cutoff and 9

below the cutoff. These were under-smoothed and in the empirical work we used a bandwidth of 19

years above the cutoff and 13 years below the cutoff. In the age 50 design we used the cross validation

bandwidths of 90 years above the cutoff and 15 years below the cutoff. Finally, in the age 35 cutoff

design, we used the cross validation bandwidths of 14 years above the cutoff and 90 years below the

cutoff.

In Florida, for the age 70 cutoff value, we used the cross validation parameters of 90 above the cutoff

and 11 below the cutoff. For the age 50 design we used the cross validation parameter of 90 years

above the cutoff. The cross validation bandwidth was 4 years below the cut; this was under-smoothed

and so we used a bandwidth of 11 below the cutoff. For the age 35 design we used the cross validation

bandwidths of 90 years above the cutoff and 17 years below the cutoff.

Polynomial Selection We used cross validation to choose polynomial using the same basic method used

for the benchmark analysis.

39

In Arkansas, we used a quadratic model with an intercept shift at the cutoff for the age 35 design, a

linear model with an intercept shift for the age 50 design, and a linear model with an intercept shift for

the age 70 design.

In New Jersey, we used a linear model with an intercept shift for the age 35 design, a quartic model with

an intercept shift for the age 50 design, and a quartic model with an intercept shift for the age 70

design.

In Florida, we used a quartic model with an intercept shift for the age 35 design, a cubic model with an

intercept shift for the age 50 design, and an interacted linear model for the age 70 design.

Adding the Pretest RDD

Bandwidth Selection: Partially Linear Model The partially linear model approach to the pretest RDD

required 4 bandwidth parameters. First we needed a bandwidth for the treated observations above the

cutoff and for this we used the same bandwidth used above the cutoff in the posttest RDD. Then – for

the untreated pretest and posttest observations, we needed a bandwidth for the first step pretest

indicator function, a bandwidth for the first step expenditure function, and a bandwidth for the residual

function. In each case, we selected the bandwidths by first using cross validation and then adjusting the

bandwidth manually to compensate for any under-smoothing. Here we discuss each bandwidth and

design in sequence.

We used a pretest indicator bandwidth of 90 years for all three age cutoff designs and all three states.

This was the cross validation selection and it led to a basically linear function that held up to sensitivity

analysis.

For the Medicaid expenditures bandwidth in Arkansas, we used a bandwidth of 90 years for the age 35

and age 50 cutoffs, and we used a bandwidth of 21 years for the age 70 design. In New Jersey the cross

validation led to a bandwidth of 1 year for all three age cutoffs. But this led to under-smoothed

functions and so we worked with a bandwidth of 10 years for the age 35 and age 50 designs and of 13

years for the age 70 design. In Florida, we used a bandwidth of 90 years for the age 50 and age 70

designs. For the age 35 design, the cross validation parameter was 2 years but this was very under

smoothed and we increased the bandwidth to 20 years for the analysis.

Finally, for the residualized expenditures bandwidth, in Arkansas, we used the cross-validation

bandwidths of 13 for the age 35 design, 90 for the age 50 design, and 11 for the age 70 design. In New

Jersey, for the age 35 design, the cross validation bandwidth was 9 but this was under-smoothed and we

used a bandwidth of 18 in the empirical analysis. The cross validation minimum bandwidth was 7 years

for the New Jersey age 50 design but this was under-smoothed and we increased it to 15 for the

empirical work. For the New Jersey age 70 design we used the cross validation selection of 13 years.

Finally, in Florida we used the cross validation minimizing bandwidths of 66 year for the age 35 design,

66 years for the age 50 design, and 67 years for the age 70 design.

40

Polynomial Selection when the Pretest RDD is Added. To estimate the pretest RDD we estimated a

regression of expenditures on a polynomial series in age among the treated (above the cutoff, posttest

period) samples for each state. Then we estimated a regression of expenditures on a polynomial in age

and a posttest dummy variable using the untreated (pretest data for all ages, posttest data from below

the cutoff) samples for each state. We used cross-validation to select the length of the polynomial series

for each model.

In Arkansas, for the untreated samples, we used a quartic model in the age 35 design, a linear model for

the age 50 design, and a quartic model for the age 70 design. For the treated samples in Arkansas, we

used a quartic model for the age 35 design, a cubic model for the age 50 design, and a linear model for

the age 70 design.

In New Jersey, for the untreated samples, we used a quartic for all three designs. For the treated

samples in New Jersey, we used a linear model for the age 35 design, a quadratic model for the age 50

design, and a quartic model for the age 70 design.

In Florida, for the untreated samples, we used a linear model in the age 35 design, a quadratic model for

the age 50 design, and a linear model for the age 70 design. For the treated samples in Florida, we used

a cubic model for the age 35 design, a quadratic model for the age 50 design, and a linear model for the

age 70 design.

41

References

Brown, R. S., & Dale, S. B. (2007). The research design and methodological issues for the Cash and Counseling evaluation. Health Services Research, 42(1p2), 414-445.

Carlson, B. L., Foster, L., Dale, S. B., & Brown, R. (2007). Effects of Cash and Counseling on Personal Care and Well‐Being. Health Services Research, 42(1p2), 467-487.

Cook, T. D., & Campbell, D. T. (1979). Quasi-experimentation: Design and analysis for field settings. Chicago, IL: Rand McNally.

Cook, T. D., Shadish, W. R., & Wong, V. C. (2008). Three conditions under which experiments and observational studies often produce comparable causal estimates: New findings from within-study comparisons. Journal of Policy Analysis and Management, 27(4), 724-750.

Cook, T. D., & Steiner, P. M. (2010). Case Matching and the Reduction of Selection Bias in Quasi-Experiments: The Relative Importance of Covariate Choice, Unreliable Measurement and Mode of Data Analysis. Psychological Methods, 1(1), 56-68.

Cook, T. D., & Wong, V. C. (2008). Empirical Tests of the Validity of the Regression Discontinuity Design. Annals of Economics and Statistics/Annales d'Economie et de Statistique(91/92), 127-150.

Corrin, W., & Cook, T. (1998). Design elements of quasi-experimentation. Advances in Educational Productivity, 7, 35-57.

Dale, S. B., & Brown, R. S. (2007). How does Cash and Counseling affect costs? Health Services Research, 42(1p2), 488-509.

Doty, P., Mahoney, K. J., & Simon‐Rusinowitz, L. (2007). Designing the cash and counseling demonstration and evaluation. Health Services Research, 42(1p2), 378-396.

Goldberger, A. S. (1972a). Selection bias in evaluating treatment effects: Some formal illustrations. Institute for Research on Poverty. Madison, WI.

Goldberger, A. S. (1972b). Selection bias in evaluating treatment effects: The case of interaction. Institute for Research on Poverty. Madison, WI.

Green, D. P., Leong, T. Y., Kern, H. L., Gerber, A. S., & Larimer, C. W. (2009). Testing the accuracy of regression discontinuity analysis using experimental benchmarks. Political Analysis, 17(4), 400.

Hahn, J., Todd, P., & van der Klaauw, W. (2001). Identification and estimation of treatment effects with a regression-discontinuity design. Econometrica, 69(1), 201-209.

LaLonde, R. (1986). Evaluating the econometric evaluations of training with experimental data. The American Economic Review, 76(4), 604-620.

Lee, D.S. (2008). Randomized experiments from non-random selection in U.S. House elections. Journal of Econometrics 142(2) 675-697

Lee, D. S., & Card, D. (2008). Regression discontinuity inference with specification error. Journal of Econometrics, 142(2), 655-674.

Ludwig, J., & Miller, D. L. (2007). Does Head Start Improve Children's Life Chances? Evidence from a Regression Discontinuity Design*. The Quarterly journal of economics, 122(1), 159-208.

Porter, J. (2003) Estimation in the regression discontinuity model. Mimeo. Department of Economics, University of Wisconsin.

Rubin, D. B. (2008). For objective causal inference, design trumps analysis. The Annals of Applied Statistics, 808-840.

Schochet, P. Z. (2009). Statistical power for regression discontinuity designs in education evaluations.

Journal of Educational and Behavioral Statistics, 34(2), 238-266. Shadish, W., Galindo, R., Wong, V., Steiner, P., & Cook, T. (2011). A randomized experiment comparing

random and cutoff-based assignment. Psychological Methods. Thistlewaite, D. L., & Campbell, D. T. (1960). Regression-discontinuity analysis: An alternative to the ex-

post facto experiment. Journal of Educational Psychology, 51, 309-317.

42