Introduction to Panel Data Analysis
Youngki ShinDepartment of Economics
Email: [email protected]
Statistics and Data Series at WesternNovember 21, 2012
1 / 40
Motivation
More observations mean more information.
More observations with a certain structure mean much moreinformation: pooled cross sections and panel data
How can we extract additional information from pooled cross sectionsor panel data?
2 / 40
Motivation
More observations mean more information.
More observations with a certain structure mean much moreinformation: pooled cross sections and panel data
How can we extract additional information from pooled cross sectionsor panel data?
2 / 40
Motivation
More observations mean more information.
More observations with a certain structure mean much moreinformation: pooled cross sections and panel data
How can we extract additional information from pooled cross sectionsor panel data?
2 / 40
Example
Effect of an Incinerator on Housing Prices
With cross-sectional data in 1981, we have
rprice = 101, 307.5− 30, 688.27nearinc
(3, 093.0) (5, 827.71)
n = 142
With another cross-sectional data in 1978 when there were no incinerator,we have
rprice = 82, 517.23− 18, 824.37nearinc
(2, 653.79) (4, 744.59)
n = 179
Therefore, the true effect of the incinerator in not −30, 688.27 but−30, 688.27− (−18, 824.37) = −11, 863.90.
3 / 40
Example
Effect of an Incinerator on Housing Prices
With cross-sectional data in 1981, we have
rprice = 101, 307.5− 30, 688.27nearinc
(3, 093.0) (5, 827.71)
n = 142
With another cross-sectional data in 1978 when there were no incinerator,we have
rprice = 82, 517.23− 18, 824.37nearinc
(2, 653.79) (4, 744.59)
n = 179
Therefore, the true effect of the incinerator in not −30, 688.27 but−30, 688.27− (−18, 824.37) = −11, 863.90.
3 / 40
Example
Effect of an Incinerator on Housing Prices
With cross-sectional data in 1981, we have
rprice = 101, 307.5− 30, 688.27nearinc
(3, 093.0) (5, 827.71)
n = 142
With another cross-sectional data in 1978 when there were no incinerator,we have
rprice = 82, 517.23− 18, 824.37nearinc
(2, 653.79) (4, 744.59)
n = 179
Therefore, the true effect of the incinerator in not −30, 688.27 but−30, 688.27− (−18, 824.37) = −11, 863.90.
3 / 40
Outline
Data Structure
Policy Evaluation with Pooled Cross Sections
Three Approaches in Panel Data Estimation
First Difference (FD) EstimatorFixed Effect (FE) EstimatorRandom Effect (RE) Estimator
Empirical Application: Smoking on Birth Outcomes
Concluding Remarks
4 / 40
Outline
Data Structure
Policy Evaluation with Pooled Cross Sections
Three Approaches in Panel Data Estimation
First Difference (FD) EstimatorFixed Effect (FE) EstimatorRandom Effect (RE) Estimator
Empirical Application: Smoking on Birth Outcomes
Concluding Remarks
5 / 40
Data Structure (cont.)
A set of pooled cross sections is obtained by sampling randomly froma large population at different time points.
A (typical) panel data set follow the same individuals over time.
For example, consider that I sample three individuals from this roomat two time points:
Time Pooled Panel
t=1 John, Jane, Evelyn Eric, Andrew, Rachelt=2 Kyle, Justin, Lisa Eric, Andrew, Rachel
6 / 40
Data Structure (cont.)A Snapshot of Data
Table: Pooled Data
year rprice nearinc y81
1978 60000 0 01978 54000 1 01978 38000 1 0
......
...1981 82000 1 11981 52000 0 11981 97000 0 1
Table: Panel Data
id year inf unem
12 1950 7.3 3.512 1951 9.1 2.716 1950 5.3 5.416 1951 4.6 6.7...
......
...43 1950 7.1 4.243 1951 8.5 3.247 1950 6.7 5.447 1951 2.6 9.4
7 / 40
Data Structure (cont.)
There are also very useful panel structures other than theindividual-time combination.
1 Twins data: i is for twins id, and t is for the individual among thespecific twins. Control for unobserved generic factors.
2 School data: students sampled from many schools (or classrooms).Then, i is for school id, and t is for the student in school i .
8 / 40
Data Structure (cont.)
Examples of pooled cross sections:
Current Population Survey (CPS), USA
Examples of a panel data:
Labor Market Activity Survey (LMAS), CanadaPanel Study of Income Dynamics (PSID), USANational Longitudinal Survey of Youth (NLSY), USAA time series of provincial (or country) level data.ex) inflation and unemployment rate of 50 countries in 1950–2010.
It is usually easier to collect pooled cross sections than to do paneldata.
9 / 40
Outline
Data Structure
Policy Evaluation with Pooled Cross Sections
Three Approaches in Panel Data Estimation
First Difference (FD) EstimatorFixed Effect (FE) EstimatorRandom Effect (RE) Estimator
Empirical Application: Smoking on Birth Outcomes
Concluding Remarks
10 / 40
Policy Evaluation with Pooled Cross SectionsDifference-in-Difference Estimator
Terminology:
Treatment Group: those who are affected by a policy (a treatment)Control Group: those who are not.
The object of policy evaluation is to measure the (mean) difference ofoutcomes between the treatment group and the control group. Thismeasure is also called the average treatment effect.
Consider that you are testing the effect of a new drug. How can youdesign the experiment? Randomization.
Recall the incinerator and housing prices example. Is randomizationpossible?
11 / 40
Policy Evaluation with Pooled Cross SectionsDifference-in-Difference Estimator
Consider an example of a drug test:
blprsi = β0 + β1treati + ui
If you randomized the control/treatment groups well, i.e.Cov(treati , ui ) = 0, then you can estimate the effect of the drug by asingle cross section.
In policy evaluation in social sciences, treati and ui are easilycorrelated:
log(wagei ) = β0 + β1jbtrni + ui
12 / 40
Policy Evaluation with Pooled Cross SectionsDifference-in-Difference Estimator
Pooled cross sections help us to evaluate the policy effect correctly bymeasuring the difference twice (before and after the policyimplementation.)
Recall the two regressions in the incinerator example:
rprice = γ0 + γ1nearinc + u in years 1978 and 1981
δ1 = γ1,81 − γ1,78=(rprice81,nr − rprice81,fr
)−(rprice78,nr − rprice78,fr
)If perfectly randomized, the second term is 0.
This estimator is called the Difference-in-Difference estimator.
13 / 40
Policy Evaluation with a Pooled Cross SectionDifference-in-Difference Estimator
The effect can be estimated just by a single regression with somedummy variable.
rprice = β0 + δ0y81 + β1nearinc + δ1y81 · nearinc + u
This result is not intuitive. Just follow the logic:
Before (y81 = 0) After (y81 = 1) After-Before
Control (nearinc = 0) β0 β0 + δ0 δ0Treatment (nearinc = 1) β0 + β1 β0 + δ0 + β1 + δ1 δ0 + δ1Treatment-Control β1 β1 + δ1 δ1
Therefore, δ1 in the above regression gives the same estimate of theDifference-in-Difference estimator.
14 / 40
Outline
Data Structure
Policy Evaluation with Pooled Cross Sections
Three Approaches in Panel Data Estimation
First Difference (FD) EstimatorFixed Effect (FE) EstimatorRandom Effect (RE) Estimator
Empirical Application: Smoking on Birth Outcomes
Concluding Remarks
15 / 40
Panel Data and the First Difference (FD) Estimator
In panel data, we follow the same individual over time. This specificstructure enables us to conduct a better analysis.
Specifically, we can control for certain types of omitted variablescalled unobserved heterogeneity.
Let us think about some examples:
log(wageit) = β0 + δ0d2t + β1educit + ai + uit︸ ︷︷ ︸vit
Notation: now we have two subscripts, i and t.
Both ai and uit are unobservables called a fixed effect and anidiosyncratic error, respectively.
16 / 40
Panel Data and the First Difference (FD) Estimator
For simplicity, consider two periods model:
yit = β0 + δ0d2t + β1xit + ai + uit t = 1, 2.
The pooled OLS does not work well since ai is usually correlated withxit , i.e. Cov(vit , xit) 6= 0.
A simple solution is the First-Difference (FD) estimator.
yi2 = (β0 + δ0) + β1xi2 + ai + ui2 t = 2
yi1 = β0 + β1xi1 + ai + ui1 t = 1
Taking a difference gives
yi2 − yi1 = δ0 + β1 (xi2 − xi1) + (ui2 − ui1)
or
∆yi = δ0 + β1∆xi + ∆ui .
17 / 40
Panel Data and the First Difference (FD) Estimator
The (pooled) OLS works in the new regression,
∆yi = δ0 + β1∆xi + ∆ui , if
1 ∆ui and ∆xi are uncorrelated;2 ∆xi has some variation.
The second condition is violated if xit does not change over time:ex) gender, race, etc.. Then, ∆xi = 0.
Even in the wage equation example,
log(wageit) = β0 + δ0d2t + β1educit + ai + uit ,
Most working population do not increase the years of educ .
18 / 40
Panel Data and the First Difference (FD) EstimatorMore than Two Time Periods
When panel data contain more than two time periods, we can stillapply the FD estimator to control for unobserved heterogeneity.
The sufficient condition for the estimator to be valid is
Cov(xit , uis) = 0 for all t and s.
This condition is violated when1 Future regressors react to the past dependent variable (feedback);2 Regressors contain a lagged dependent variable;3 An important (i.e. related to xit) time-varying regressor is omitted.
Take differences with adjacent time periods and run the followingregression when t = 1, 2, and 3:
∆yit = α0 + α3d3t + β1∆xit + ∆uit for t = 2, 3.
19 / 40
Additional Remarks on FD Estimator
Due to the expansion over the time dimension, serial correlation mayarise.
Also, we cannot exclude the heteroskedasticity problem.
Since we use the OLS estimator, we can apply the White correction orthe HAC estimation method as before.
20 / 40
Fixed Effect Estimator
Consider a simple error component model again:
yit = β1xit + ai + uit , t = 1, . . . ,T and i = 1, . . . , n.
We assume that the idiosyncratic error uit is ‘innocuous’ in the sense:
E (uit |Xi) = 0 or E (uit |xit) = 0.
However, the individual fixed effect ai could be arbitrarily correlatedwith xit .
We have already known that the FD estimator cancels out theunobserved heterogeneity ai .
21 / 40
Fixed Effect Estimator
There is a different way to cancel out unobserved heterogeneity.
First, fix the individual i and take an average over time:
yi = β1xi + ai + ui .
where
yi =1
T
T∑t=1
yit , xi =1
T
T∑t=1
xit , and ui =1
T
T∑t=1
uit .
The point is
ai =1
T
T∑t=1
ai =1
TTai = ai .
22 / 40
Fixed Effect Estimator
Now, take a difference between two equations:
yit = β1xit + ai + uit , t = 1, 2, . . . ,T .
yi = β1xi + ai + ui .
Then, what we have is
yit − yi = β1(xit − xi ) + (uit − ui ), t = 1, 2, . . . ,T
or
yit = β1xit + uit , t = 1, 2, . . . ,T .
We may apply the pooled OLS on the last equation.
23 / 40
Fixed Effect Estimator
The FE estimator uses information from within group (i) variation:
yi1 = yi1 − yi
yi2 = yi2 − yi
...
yiT = yiT − yi
For this reason, the FE estimator is also called within estimator.
This can be readily extended to a multiple regression model:
yit = β1x1it + β2x2it + . . .+ βk xkit + uit
24 / 40
Fixed Effect EstimatorFD vs. FE
If T = 2, the FD estimator and the FE estimator are identical:
yi2 ≡ yi2 − yi = yi2 −(yi1 + yi2
2
)=
y1 − y22
≡ 1
2∆yi2.
Therefore,
yi2 = β1xit + uit
⇐⇒ 1
2∆yi2 = β1
1
2∆xi2 +
1
2∆ui2
⇐⇒ ∆yi2 = β1∆xi2 + ∆ui2
However, they are different in a finite sample if T > 2. Unless there isa unit root (or severe serial correlation) problem, you would better usethe FE estimator.
25 / 40
Random Effect Estimator
In the random effect model:
yit = β0 + β1xit + ai + uit ,
we assume that
Cov(xit , ai ) = 0.
Then, we come back to the ‘nice’ world where we don’t need tocancel out ai . Just use the pooled OLS?
No. There is a serial correlation problem.
26 / 40
Random Effect Estimator
In the random effect model:
yit = β0 + β1xit + ai + uit ,
we assume that
Cov(xit , ai ) = 0.
Then, we come back to the ‘nice’ world where we don’t need tocancel out ai . Just use the pooled OLS?
No. There is a serial correlation problem.
26 / 40
Random Effect EstimatorSerial Correlation in the RE model
We have two components in the error term:
vit = ai + uit
Suppose that uit is totally innocuous again:
Cov(ai , uit) = Cov(uit , uis) = 0 for t 6= s.
Now, we calculate Corr(vit , vis) and show that it is not zero:
Var(vit) = Var(ai + uit) = σ2a + σ2u
Cov(vit , vis) = E ((ai + uit)(ai + uis))
= E (a2i + aiuis + aiuit + uituis)
= E (a2i ) = σ2a
27 / 40
Random Effect EstimatorSerial Correlation in the RE model
Therefore,
Corr(vit , vis) =σ2a
σ2a + σ2u6= 0
Any inference based on the pooled OLS would be incorrect.
However, we know how to fix this problem. Do GLS!
We want to transform the original model into
yit = β0 + β1xit + vit
where vit does not have the serial correlation anymore.
28 / 40
Random Effect Estimator
We multiplied ρ and took a difference when there is a AR(1) serialcorrelation. In this case, we multiply
λ = 1−[
σ2uσ2u + Tσ2a
](1/2)and take a difference as
yit − λyi = β0(1− λ) + β1(xit − λxi ) + vit − λvi
We can show that vit(= vit − λvi ) is not serially correlated.
The λ should be estimated by λ.
This specific GLS estimator is called the Random Effect (RE)estimator.
29 / 40
Random Effect Estimator
The RE estimator is something between the pooled OLS and the FEestimator. Note that in Equation:
yit − λyi = β0(1− λ) + β1(xit − λxi ) + vit − λvi ,
it becomes the pooled OLS when λ = 0, and does the FE estimatorwhen λ = 1.
The λ is always between 0 and 1 in the RE model.
As T →∞, the FE and RE estimators are equivalent since λ→ 1.
30 / 40
Random Effect EstimatorRE vs. FE
If you believe that there is obvious endogenous fixed factor, ai , inyour model, you should use the FE estimator.
Otherwise, the RE estimator will tell you more: non time-varyingregressors, efficiency etc.
Keep in mind that the RE estimator is not even consistent ifCov(xit , ai ) 6= 0.
We can test whether Cov(xit , ai ) = 0 or not.
31 / 40
Random Effect EstimatorHausman Test
The idea of the Hausman test is simple. The null hypothesis is
H0 :Cov(xit , ai ) = 0
H1 :Cov(xit , ai ) 6= 0
Under H0, both RE and FE are consistent:
βREp→ β, βFE
p→ β.
Thus, we can expect that βRE ≈ βFE .
However, under H1, only βFE is consistent. Therefore, we reject H0 ifthe difference between βRE and βFE is large enough.
32 / 40
Outline
Data Structure
Policy Evaluation with Pooled Cross Sections
Three Approaches in Panel Data Estimation
First Difference (FD) EstimatorFixed Effect (FE) EstimatorRandom Effect (RE) Estimator
Empirical Application: Smoking on Birth Outcomes
Concluding Remarks
33 / 40
Empirical Application: Smoking on Birth Outcomes
“Infants born to women who smoke during pregnancy have a lower averagebirthweight... Low birthweight is associated with increased risk forneonatal, perinatal, and infant morbidity and mortality.”
(Women and Smoking: A Report of the Surgeon General, 2001, requoted fromAbrevaya (2006))
34 / 40
Empirical Application: Smoking on Birth Outcomes
The direct medical costs: According to the estimates of Lewit et al.(1995), the low-birthweight (LBW) infants (less than 10% of births)account for more than 1/3 of health care costs during the first year of life.
The long-term costs:“Hack et al. (1995) find that LBW babies have developmental problems incognition, attention and neuromotor functioning that persist untiladolescence.” (Abrevaya (2006))
35 / 40
Empirical Application: Smoking on Birth OutcomesHow to Estimate
The OLS estimates would be biased into the negative direction due toendogeneity.
IV estimation?
Comparison between OLS and IV estimatesfrom Abrevaya (2006)
492 J. ABREVAYA
Figure 1. Previous estimates of smoking’s effect on birthweight
The remainder of the paper is outlined as follows. Section 2 describes the data and the matchingstrategies used to construct panel data sets for analysis. Section 3 considers the inconsistencycaused by incorrect matching. The direction of this inconsistency is the same as the omittedvariables bias, and a simple model suggests that the mismatched observations, ‘false matches’,have a disproportionate effect on the inconsistency. Section 4 reports the main empirical results.First, fixed effects estimates (based upon the matched panels of Section 2) are reported. Second,using a proxy for correct matching that is available in the earlier part of the sample, the effectof mismatching is empirically investigated. Third, an augmented model specification is used toallow for the possibility of heterogeneous smoking effects. Section 5 considers several possibleviolations of the strict exogeneity assumption made for fixed effects estimation. These potentialviolations include feedback effects (smoking influenced by a prior birth outcome), correlatedmaternal behaviour (smoking changes correlated with unobservable behavioural changes) andmisclassification of smoking status. Section 6 concludes.
2. DATA AND MATCHING STRATEGY
2.1. Federal Natality Data
The data used for this study come from the Natality Data Sets (released by the National Centerfor Health Statistics (NCHS)) from 1990 to 1998. The natality data are based upon birth recordsfrom every live birth that occurs in the United States and contain information on birth outcomes,maternal prenatal behaviour (including smoking behaviour for nearly all states) and demographicattributes. During the time period analysed, there are approximately 4 million births per year inthe natality data.
Unfortunately, the federal natality data does have limitations. There are no unique identifiers(e.g., social security number) for mothers, which makes construction of a panel data set non-trivial.Other (non-unique) identifiers that would make matching far easier (such as mother’s name and
Copyright 2006 John Wiley & Sons, Ltd. J. Appl. Econ. 21: 489–519 (2006)
36 / 40
Empirical Application: Smoking on Birth Outcomes
The fixed-effect (FE) estimation can be used if panel data areavailable.
Abrevaya (2006) constructed a pseudo panel data set and showedthat the FE estimate is smaller than that of the OLS.
yib = x ′ibβ + γsib + ci + uib
where i is Mom’s id and b is the order of a baby from Mom i .
The estimation results for γ by OLS and FE are −243.27(3.20)−144.04(4.75), respectively.
37 / 40
Concluding Remarks
Pooled cross sections are very similar to a single cross section, butobservations across different time points help evaluate the correctpolicy effect.
Extra information contained in panel data enables us to control forthe individual fixed effect by FD and FE estimators.
If the fixed effect is not correlated with regressors, we can apply REestimator, which is a GLS estimator.
Panel data are not restricted to the individual-time structure.
38 / 40
Concluding Remarks
Pooled cross sections are very similar to a single cross section, butobservations across different time points help evaluate the correctpolicy effect.
Extra information contained in panel data enables us to control forthe individual fixed effect by FD and FE estimators.
If the fixed effect is not correlated with regressors, we can apply REestimator, which is a GLS estimator.
Panel data are not restricted to the individual-time structure.
38 / 40
Concluding Remarks
Pooled cross sections are very similar to a single cross section, butobservations across different time points help evaluate the correctpolicy effect.
Extra information contained in panel data enables us to control forthe individual fixed effect by FD and FE estimators.
If the fixed effect is not correlated with regressors, we can apply REestimator, which is a GLS estimator.
Panel data are not restricted to the individual-time structure.
38 / 40
Concluding Remarks
Pooled cross sections are very similar to a single cross section, butobservations across different time points help evaluate the correctpolicy effect.
Extra information contained in panel data enables us to control forthe individual fixed effect by FD and FE estimators.
If the fixed effect is not correlated with regressors, we can apply REestimator, which is a GLS estimator.
Panel data are not restricted to the individual-time structure.
38 / 40
Concluding Remarks
Pooled cross sections are very similar to a single cross section, butobservations across different time points help evaluate the correctpolicy effect.
Extra information contained in panel data enables us to control forthe individual fixed effect by FD and FE estimators.
If the fixed effect is not correlated with regressors, we can apply REestimator, which is a GLS estimator.
Panel data are not restricted to the individual-time structure.
38 / 40
Stata Commands
Load the data set filename.dta.
First, we need to set an id variable and a time variable. Check therelevant variable names.
xtset id time.
Now type xtsum.
The command for the FE estimator isxtreg dep x1 x2 x3, . . ., fe
The command for the RE estimator isxtreg dep x1 x2 x3, . . ., re
39 / 40