Propensity Score

Propensity Score

Propensity ScoreOverview:What do we use a propensity score for?How do we construct the propensity score?How do we implement propensity score estimation in STATA?

1Joke (kind of)Two heart surgeons (Jack and Jill) walk into a bar.Jack: I just finished my 100th heart surgery!Jill: I finished my 100th heart surgery last week. Which probably means Im a better heart surgeon. How many of your patients died within 3 months of surgery? Ive only had 10 die.Jack: Five. So Im probably the better surgeon.Jill: Or maybe mine are older and have a higher risk than your patients. There may be differences in the patients characteristics between Jack and JillWe want to show the difference due to treatment (Jill)We want to compare apples to apples not apples to oranges

There may be important differences in the patients characteristics between treatment group (Jill) and control (Jack). We want to show the difference in outcome that is attributable to treatment, not because we are comparing apples to oranges.2Purpose of propensity scoresIt can produce apples-to-apples comparisons when treatment is non-random (non-ignorable treatment assignment)Provides a way to summarize covariate information about treatment selection into a single number (scalar)Can be used to adjust for differences via study design, or matching, or during estimation of the treatment effect (e.g., subclassification or regression)

Propensity score estimationSome caveatsThis is only relevant for selection on observablesIf you cannot write down a conditioning strategy such that conditioning on X will satisfy the backdoor criterion, then this is not the research design you chooseYou need to identify the confounders, X, that will block all back doors based on economic theory and you will need data on them

Better example: a case in which the propensity score is useful for causal inference Suppose that we are interested in whether a scholarship program caused children in to spend more years in high school (9-12).Suppose every 8th grade graduate is eligible for this programYou have data on every child, including test scores, family income, age, gender, etc.Scholarships are awarded based on some combination of test scores, family income, gender, etc., but you dont know the exact formula.

Say that we are interested in evaluating a scholarship programs effect on the number of years children attend school at the secondary school level. Furthermore, assume that every primary school graduate is eligible for this program, we have data on every primary school graduate (including the scores on their tests), their family income, and demographics. But, we dont know the precise formula that was used to place primary school graduates into this

5Motivation (cont.)Ignorable treatment assignment: Scholarships are assigned to students randomly, independent of how a student is expected to perform in high schoolCalculate ATE by estimating simple difference in mean outcomes:

But what if ignorability is violated?For instance, assume you know that children with higher test scores are more likely to get the scholarship (positive selection), but you dont know how important this and other factors are, you just know that the decision is based on information you have (X) and some randomness. What can you do with this information?

Motivation (cont.)In principle, you could estimate it using OLS controlling for X:

Where X is a matrix of covariates that you think affect the probability of receiving a scholarship. OLS consistently estimates the conditional mean, but if probability of getting a scholarship is not a linear function of X, this conditional mean estimate may not be informative. Usually, we wont know how the selection depended on X, only that it did. For instance, they may use discrete cutoffs rather than a linear function

Motivation (cont.)Suppose your variables are not continuous, but they are categories (somewhat arbitrarily). E.g. family income above or below $50 per week, scores above or below the mean, sex, age, etc. Now, you could put in dummy variables for each category and interaction between all dummies. This would distinguish every group formed by the categories. Or you could run separate regressions for each groupThis is more flexible since it allows the effect of the scholarship to differ by group. These methods are in principle correct, but they are only feasible if you have a lot of data and few categories. Constructing the Propensity ScoreEstimation of average treatment effects based on propensity score estimation can handle sparseness and ignorance about the functional form associated with treatment assignment. You will first need to have a selection into the treatment (in our case the scholarship) that is based on observables, or selection on observables.The following gives a brief overview of how the propensity score is constructed.In practice, you can download a canned Stata command that will do all of this for you.Definition and General IdeaDefinition: The propensity score is the conditional probability of being assigned to the treatment group (e.g., 9-12 grade scholarship), conditional on the particular covariates (X). Pr(D=1|X) is some marginal probability (e.g., 55%)The idea is to compare units who, based solely on their observables, had very similar probabilities of being placed into treatmentIf conditional on X, two units have a similar probability of treatment, then we say they have similar propensity scoresWe then think that all the difference in the outcome variable is due to the treatment. If we compare a unit in the treatment group to a control group unit with two similar propensity scores, then conditional on the propensity score, all remaining variation between these two is randomness if selection on observables

First stageEstimation using this method is a two-stage procedure First stage: estimates the propensity scoreSecond stage: calculate the average causal effect of interest by averaging differences in outcomes over units with similar propensity scores First stage: estimate the propensity score: First, estimate the following equation with binary treatment (D) on the LHS, and covariates (X) that determine selection into treatment on RHS using logit or probit model:

Second, using estimated coefficients, calculate the predicted LHS

The propensity score is just the predicted conditional probability of treatment (using estimated coefficients on X) for each unit

You start with a simple specification (e.g. just linear terms). Then you follow the following algorithm to decide whether this specification is good enough:11AlgorithmSort your data by the propensity score and divide it into blocks (groups) of observations with similar propensity sores. Within each block, test (using a t-test), whether the means of the covariates are equal in the treatment and control group. If so stop, youre done with the first stage

If a particular block has one or more unbalanced covariates, divide that block into finer blocks and re-evaluate If a particular covariate is unbalanced for multiple blocks, modify the initial logit or probit equation by including higher order terms and/or interactions with that covariate and start again. You start with a simple specification (e.g. just linear terms). Then you follow the following algorithm to decide whether this specification is good enough:

12Second StageIn the second stage, we look at the effect of treatment on the outcome (in our example of getting the scholarship on years of schooling), using the propensity score.

Once you have determined your propensity score with the procedure above, there are several ways to use it. Ill present two of them (canned version in Stata for both):

Stratifying on the propensity scoreDivide the data into blocks based on the propensity score (blocks are determined with the algorithm). Run the second stage regression within each block. Calculate the weighted mean of the within-block estimates to get the average treatment effect.

Matching on the propensity scoreMatch each treatment observation with one or more control observations, based on similar propensity scores. You then include a dummy for each matched group, which controls for everything that is common within that group.

Balancing within blocksSort the data by the propensity scoreDivide the data into groups called blocks that have similar propensity scores (e.g., 0.001 to 0.10, 0.10 to 0.20, etc.)For each block, test whether the means of the covariates are equal for treatment and control using a t-testIf they are, you are done with the first stageIf a particular block has one or more unbalanced covariates (X), divide that block into finer blocks and re-evaluateIf a particular covariate is unbalanced for multiple blocks, modify the initial logit or probit equation by including higher order terms and/or interactions with that covariate and start againImplementation in STATAMultiple methods for estimating the propensity scoreDownload psmatch2 from sscssc install psmatch2, replaceFirst stage:pscore treat X1 X2 X3, pscore(scorename)Second stage: attr (for matching) or atts (for stratifying):attr outcome treat, pscore(scorename)

General RemarksThe propensity score approach becomes more appropriate the more we have randomness determining who gets treatment (closer to randomized experiment).

The propensity score doesnt work very well if almost everyone with a high propensity score gets treatment and almost everyone with a low score doesnt: we need to be able to compare people with similar propensities who did and did not get treatment.

The propensity score approach doesnt correct for unobservable variables that affect whether observations receive treatment.

NSW exampleComparison of propensity score matching with experimental resultsNSW programDuring the mid-1970s, Manpower Demonstration Research Corporation (MDRC) operated the National Supported Work Demonstration (NSW)NSW was a temporary employment program designed to help disadvantaged workers lacking basic job skills move into the labor market by giving them work experience and counseling in a sheltered environmentUnlike other federally sponsored employment and training programs, though, the NSW program assigned qualified applicants to training positions randomlyTreatment group: received all the benefits of the NSW programControl group: left to fend for themselvesNSW admitted into the program AFDC women, ex-drug addicts, ex-criminal offenders, and high school dropouts of both sexesNSW ProgramTreatment group members were:guaranteed a job for 9-18 months depending on the target group and sitedivided into crews of 3-5 participants who worked together and met frequently with an NSW counselor to discuss grievances and performancepaid for their workWage schedule offered the trainees lower wage rates than they wouldve received on a regular job, but allowed their earnings to increase for satisfactory performance and attendanceAfter their term expired, they were forced to find regular employmentThe type of work varied within sites gas station attendant, working at a printer shop and males and females were frequently performing different kinds of workThis was why the program costs varied across sites and target groupsThe program cost $9,100 per AFDC participant and approximately $6,800 for other target groups trainees in 1982 dollars (US)NSW ProgramMDRC collected earnings and demographic information from both treatment and control at baseline and every 9 months thereafterConducted up to 4 post-baseline interviewsLaLonde (1986) studyLaLonde, Robert J. (1986). Evaluating the Econometric Evaluations of Training Programs with Experimental Data. American Economic Review. 76(4): 604-620.LaLondes ideas:Outcome variable: Annual earnings in 1978Get unbiased estimate of the job training programs effects using randomized control groupCompare that with what you get by selecting a control group from the entire population that looks like the treatment group using various causal inference methods

Need for a control groupThe fundamental problem of causal inference is causality is defined as the difference between two potential outcomes states, but for each individual, we only observe one of these. We are missing data on each trainees counterfactual what they wouldve earned had they not been in the NSW experimentChoice of a control groupBest option: Randomize so that independence is satisfiedControl group and treatment group are different only by random chance Eliminates bias due to baseline differences between the two groups and the heterogeneous treatment effects biasOftentimes these kinds of randomized controls arent available so labor economists would instead sample from various datasets to create (non-experimental) control groupsSo LaLonde sampled a non-experimental control group from two surveys: the Current Population Survey (CPS) and the Panel Study of Income Dynamics (PSID)Sampled the entire working populationSampled those not working in 1976Sampled those not working in 1975 or 1976Similarity of treatment and control groupsTreatment and control groups need to be similar. But in what way should they be similar? Most importantly, they need to be similar with regards to income pre-treatment since income is what well be examining post-treatment.So what did LaLonde find? First column is treatment group earnings in 1978Second column is randomized control groupEverything else are the non-random control groups

Annual earnings of treatment and control group members were the same in 1975. They diverged during the employment program and converged to some extend after the program ended. The post-training year was 1979 for the AFDC females and 1978 for the males. $1,641 higher earnings in 1978; $851 higher earnings in 1979. He uses four PSID samples25

In Table 3, they look just at the male participants. Notice the randomized sample is 297 treatment group units and 425 control units. There are also 2,49326

LessonsWhat were the take-aways?Fairly pessimistic findings observational data and causal inference methods available at that time performed poorly when trying to reproduce the known ATE from the randomizationWhat did he do?Linear regression, fixed effects, latent variable selection modelingHis estimated treatment effect for women tended to overestimate the impact of the program positive self-selectionBut it tended to underestimate the impact of the program for men negative self-selectionWhy should you care?Even though the control group might seem like a good guess for the treatment group, your answers may still be significantly biasedDehija and Wahba (1999; 2002)Dehejia, Rajeev H. and Sadek Wahba (1999). Causal Effects in Nonexperimental Studies: Reevaluating the Evaluation of Training Programs. Journal of the American Statistical Association, vol. 94(448): 1053-1062

Dehejia, Rajeev H. and Sadek Wahba (2002). Propensity Score-Matching Methods for Nonexperimental Causal Studies. The Review of Economics and Statistics. February, 84(1): 151-161.

These two studies introduce propensity score matching methods to economists and perform a kind of replication of LaLondes studyDehejia and Wahba (1999)DW (1999) re-analyze the data using propensity score matching and stratificationThese were new at the time to economists, although the method was first established in Rosenbaum and Rubin (1983)Identifying assumptions:(Y0,Y1) || D|p(X) p(X) is propensity score0

Documents

Propensity Score