Pitfalls in Analysis of Survey Data
Ec798, Lecture 4Dilip Mookherjee
Fall 2010
1. IntroductionPurpose is to alert you to the key practical
issues in analysis of survey dataBy `analysis’, I mean making inferences
concerning effectiveness of particular policies or programs, or behavioral patterns
Effectiveness assessment requires comparison of observed outcomes with a counterfactual: what would have happened in the absence of the program
Intro, contd.Assessing counterfactuals requires appropriate
benchmarks of comparison, and/or a theory which predicts how people and institutions would have behaved if the program had not been instituted, and how this would have changed the observed outcomes
Requires considerable creativity, ingenuity in addition to understanding local context and institutions
Intro, contd.Most people are prone to drawing inferences
based on cross-sectional (comparing areas with and without the program) or time-series evidence (comparing outcomes before and after the program), without being careful about assessing counterfactuals
Important in how you react and learn from almost any data pertaining to effects of a given development program --- i.e., how you evaluate work done by others to make claims about effectiveness based on their analysis
Intro, contd.Courses on statistics or econometrics will
emphasize the assumptions needed to make valid inferences from data, how to assess validity of these assumptions, and what to do if there is substantial doubt about their validity
In this session I will try to give you a practitioner’s perspective on this, based on my own experience
Will eschew technicalities, and provide an intuitive common-sense account
Pitfalls and Qualifications to Statistical Inference
I will try to give you a laundry list of the most common pitfalls and qualifications to what can be learned from analysis of statistical data concerning program effectiveness
And the most common techniques available for overcoming these
Even if you are not going to do this kind of analysis, its important for you to understand what others are doing, to review it critically, and raise appropriate questions
Laundry List Pitfalls, concerning bias (of estimates):
Selection Bias and Endogeneity (reverse causality, omitted variables)
Measurement Error Functional Form (non-linearity, censoring, truncations)
Qualifications, mainly concerning precision (calculating standard errors correctly): Heteroscedasticity Clustering Serial Correlation
Selection BiasLets say you compare outcomes of a program
between areas which had and didn’t have it: e.g., decentralization of forest management to local user groups: how does forest degradation vary between areas with and without such decentralization? Or compare health of children in villages that received a sanitation program with villages lacking such a program
Selection Problem, contd.Problem is that areas with more degraded forests
may have been more likely to have forest user groups in the first place; sanitation program likely to have been targeted to villages with dirtier water and greater poverty
If so, your cross-sectional differences will underestimate the true effect of the program
However you cannot be sure of the direction of the bias: maybe the communities more concerned about deforestation or health may have lobbied harder to get these programs
Selection Problem, contd.Maybe you can get around this by looking at
effects of the program before and after its implementation in the areas in which it was implemented
Lets say you have a panel data-set and see an improvement after
But what if the areas which didn’t receive the program also witnessed an improvement? Maybe there was something else that was going on that explains the improvement in both sorts of areas?
Selection Problem, contd.Then maybe you can compare the changes
before and after in the treatment and control areas? (The diff-of-diff estimate)
Can we stop here? Can we trust/test the diff-of-diff estimate? What assumptions are needed? And so on…
Are there contexts where cross-sectional comparisons yield valid (unbiased) estimates? When might they be better than the panel data based diff-of-diff estimate?
Pitfall No. 1: EndogeneitySelection problems part of a wider concern about endogeneity of program placement
One form of endogeneity: reverse causality (is forest degradation affecting creation of forest user groups? Is health driving placement of the sanitation program or the other way around?
The other form is omitted variable bias: maybe some third, unobserved variable such as underlying social capital of the community driving both deforestation and user group formation?
Other Examples of Endogeneity Problems
Suppose you are interested in effectiveness of a price subsidy program for rice on rice consumption: does consumption cause price or the other way around? Does underlying tastes for rice affect both price and consumption?
Are small farms more productive than large farms? Or are more productive farms tend to be smaller (owing to greater subdivision)? Is unobserved soil quality driving both size and productivity?
Endogeneity Examples, contd.
What is the effectiveness of a fertilizer distribution program on farm productivity? Does fertilizer application drive productivity? Or is it the case that more hardworking, motivated farmers tend to respond to the program more actively and apply more fertilizer?
Does under-nutrition cause low productivity/earnings, or the other way around?
Pitfall No. 2: Measurement Error
Is the independent variable measured accurately?
Problems measuring income, consumption based on survey responses (recall, aggregation, purposive..)
May not have data concerning program implementation at a disaggregated enough scale (e.g., interested in village-level effects but only have program intensity at province level)
Measurement Error, contd.`Iron Law of Econometrics’: measurement error
in independent variable (only) causes under-estimate of program effect (attenuation bias)
Intuitively this is because estimate of the effect is based on how independent and dependent variable co-vary, relative to the variation of the independent variable
Example of Attenuation Bias
Suppose you over-estimated placement of an effective fertilizer distribution program: some villages that didn’t get the program are mistakenly believed to have got it
Then you would be assessing program effectiveness by comparing mean farm yields in villages that are thought to have got the program, with those that appear not to have
You would under-estimate the effectiveness as some low-yield villages are mistakenly believed to have got the program
But Note That:Measurement error in dependent variable does not
matter (for bias): if productivity is measured with error this pertains to both kinds of villages equally
Not all kinds of independent variable errors matter (eg when data is at a higher level of aggregation: the measurement error is orthogonal to the measured value of the independent variable so it washes out in the aggregate)
Measurement error cannot reverse sign of the effect, or raise its quantitative magnitude (unlike endogeneity problems)
How Can You Tell How Serious Endogeneity or Measurement Error is?
Have to rely on your understanding of the situation, and your prior expectations based on theory
There is no easy test or measureWhat you can do is to analyze the data
differently so as to correct for the problems, and see how much of a difference this makes
How to Correct for Endogeneity
Approach 1: Control for possible omitted variables: collect data on those and include them in the regression
What about unobserved omitted variables: here panel data can come in very useful: use of fixed effects to control for unobserved heterogeneity
E.g., in the analysis of user groups and deforestation, unobserved `social capital’ which potentially affects both formation of user groups and deforestation is effectively controlled for, by looking at effects of formation of user groups on changes in forest quality
No longer comparing levels across areas, but changes over time --- the diff-of-diff estimate
Other Examples of Diff-of-Diff
Productivity variation by farm size or fertilizer application: control for soil fertility as best as you can, control for farmer ability/motivation with farmer fixed effects, for unobserved plot quality with plot fixed effects
Need data for same farmer over time as he changes scale of cultivation (for farmer fixed effects), for productivity of separate plots (for plot fixed effects) with differential fertilizer application
Assumptions Underlying Diff-of-Diff
Have to still assume that program placement or its timing was exogenous (ie uncorrelated with independent variable)
At the level of changes over time, placement was not purposive (e.g., can you rule out the possibility that the creation of user groups was one of many other changes taking place, one of which was really effective)
Test by looking at pre-program trends, other policies etc.
Other Assumptions underlying D-o-D
Effects of unobserved omitted variables are linear and additive so they can be washed out by looking at changes over time
No significant increase in measurement error when looking at changes over time (if panel responses based on recall, lot of the reported changes may just be the result of recall errors)
If this is the case, the cross-sectional estimate may involve less bias
Often significant cross-country regression results disappear in panel data: don’t know whether to interpret this as evidence of significant OV bias in the cross-section, or significant attenuation bias in the panel
Instrumental VariablesAnother qualification: DoD deals with
unobserved heterogeneity, but not reverse causality (nutrition-earnings example)
IV estimator: the most commonly used method to deal with endogeneity problems and measurement error
Idea is to find an instrument for the independent variable: a source of variation in the independent variable which logically cannot have a direct impact on the dependent variable
Examples of IV/Natural Experiments
UK water quality-mortality study (1853 London cholera epidemic): which of two companies was supplying water to any given street
Cuban boatlift effect on labor supply in MiamiMiddle East events that affect international oil
prices, which shifts the price of rice owing to higher transport costs , but not the consumption demand for rice
Regression discontinuity: class-size effects on learning; minimum wage laws across state borders
IV AssumptionsTwo key assumptions for an instrument to be
valid: It has to predict significant variation in the
independent variable in question (water quality/labor supply/rice price/class-size): first stage F
Exclusion restriction: conditional on the effect on the independent variable (and other controls) there is no direct effect on the dependent variable
No statistical tests for the exclusion restriction; based on theory and institutional knowledge
Pitfall No. 3: Functional Form
Regression estimates of effects of continuous treatment effects based on hypothesis of linear relationship between independent and dependent variable (eg effect of a drug does not depend on dosage)
In many cases, may expect this to be wrong (water on productivity, age on earnings, community heterogeneity on collective action, gender empowerment on ROSCA participation)
In other cases, may not know what pattern to expectAdditional problem: program effect may be
heterogeneous (very serious practical problem)
Testing LinearityInclude higher order terms, take log
transformations, interaction effects etc.Non-parametric analysis
Both have practical problems which can be resolved only with sufficient data
Can do only with respect to one variable at a time
Censoring and Truncation Bias
Particular form of problem with functional form: limited dependent variable Sometimes it is zero or one (eg member of a group
or not, road built or not)Sometimes it is endogenously truncated: you
cannot work negative number of hours, or collect negative quantity of firewood
Ignoring the inherent nonlinearity of the data can give rise to significantly biased estimates
Censoring and Truncation Biases
What can you do?Assume functional form of distribution of errors: e.g.,
probit or logit regressions for 0-1 variables, tobits for truncated variables
Results could be sensitive to what you assume hereSome new methods that don’t depend so much on
error distributions (semi-parametric methods, such as LAD)
Warning: cannot easily extend to panel estimators such as diff-of-diff!
Qualifications Laundry ListProblems emphasized in many econometrics texts
concern correct assessment of precision of estimates (how to calculate standard errors):HeteroscedasticitySerial correlation Clustering (more important, less often discussed in
textbooks)
Ignoring these may cause you to overlook more precise estimates, and more importantly overestimate your precision/statistical significance (thus biasing inferences)
HeteroscedsaticityWhere precision varies with `size’ of the
independent variable, OLS is not the most precise estimator (data needs to be re-weighted), and the standard errors are incorrectly calculated
STATA can make these corrections for you (`robust’ or White-corrected standard errors)
Case for quantile regressions
Serial CorrelationProblem when you have repeated observations
for the same agent or unit over time: if they are not independent, treating them as such means you overestimate the precision of your estimates
Problem with macro time series data, also with panel data
Can test for severity (eg Durbin-Watson stat.) and correct standard error estimates
ClusteringMost serious problem is when the data is
clustered (by village, industry, location etc.) and different observations in each cluster are not independent: again results in overestimate of precision (underestimation of s.e.’s)
STATA cluster command can correct your s.e. estimate (you have to specify the `level’ of clustering)
Can often blow them sky high, whence statistical significance of all your results can disappear
Concluding CommentsMany filters and pinches of salt involved here, but these
are absolutely fundamental to separate garbage from real evidence
Pitfalls (concerning bias) and Qualifications (precision), but both can result in biased inference
Lot of techniques for detecting and correcting problemsCannot rely on `technical’ fixes alone: no substitute for
good and sufficient data, common-sense, intuition, theory and institutional knowledge
Ultimately to be useful and compelling, the analysis must be simple and clear