20
Data Analysis to Data Analysis to NTA NTA Sang-Hyop Lee Sang-Hyop Lee 41 Summer Seminar 41 Summer Seminar June 8, 2010 June 8, 2010

Data Analysis to NTA Sang-Hyop Lee 41 Summer Seminar June 8, 2010

Embed Size (px)

Citation preview

Data Analysis to NTAData Analysis to NTA

Sang-Hyop LeeSang-Hyop Lee

41 Summer Seminar41 Summer Seminar

June 8, 2010June 8, 2010

DataData Sets for Statistical Sets for Statistical AnalysisAnalysis► Cross section vs. Time seriesCross section vs. Time series► Cross section time series; useful for analysisCross section time series; useful for analysis

of aggregate variables (useful for cohort of aggregate variables (useful for cohort analysis).analysis).

► Panel (longitudinal)Panel (longitudinal) Repeated cross-section design: most commonRepeated cross-section design: most common Rotating panel design (Cote d’Ivore 1985 data)Rotating panel design (Cote d’Ivore 1985 data) Supplemental cross-section design (Kenya & Supplemental cross-section design (Kenya &

Tanzania 1982/83 data, Malaysia and Indonesia LS)Tanzania 1982/83 data, Malaysia and Indonesia LS)► Cross section with retrospective informationCross section with retrospective information► Micro vs. MacroMicro vs. Macro

Quality of Survey DataQuality of Survey Data

►Requires individual or household micro Requires individual or household micro survey data sets.survey data sets.

►Zvi Griliches, “A good survey data set Zvi Griliches, “A good survey data set has the properties of...”has the properties of...” Extent (richnessExtent (richness): it has the ): it has the variablesvariables of of

interest at a certain level of detail.interest at a certain level of detail. ReliabilityReliability: the variables are : the variables are measured measured

without errorwithout error.. ValidityValidity: the data set is : the data set is representativerepresentative..

Data Problem (An example)Data Problem (An example)►FIES (64,433 household with FIES (64,433 household with

233,225 individuals)233,225 individuals) Measured for only urban areaMeasured for only urban areas s

(Valid?)(Valid?) No single person householdNo single person households (Valid?)s (Valid?) No information oNo information onn income for family income for family

owned businessowned business (Rich?) (Rich?) Measured for Measured for up to 8 household up to 8 household

membersmembers: (Reliable? Valid?): (Reliable? Valid?)

Extent (Richness):Extent (Richness):Missing/Change of VariablesMissing/Change of Variables

►Missing variablesMissing variables Not measured or Not measured or measured for a measured for a

certain groupcertain group Labor portion of self-employLabor portion of self-employmentment

incomeincome►Change of variables over timeChange of variables over time

Institutional/policy changeInstitutional/policy change New consumption items, new jobs, etcNew consumption items, new jobs, etc

►Change of survey Change of survey instrument/collapsinginstrument/collapsing

Reliability: Reliability: Measurement Measurement ErrorError

► ResponseResponse/reporting/reporting error error RRespondents do not know what is requiredespondents do not know what is required IIncentive to understatencentive to understate/overstate/overstate RRecall biasecall bias: related with period of survey: related with period of survey Using wrong/different reporting unitsUsing wrong/different reporting units HHeapingeaping OOutliersutliers

► Coding errorCoding error, top coding, missing observation, top coding, missing observation► Overestimate/Overestimate/uunderestimatenderestimate

Parents do not reportParents do not report/register/register their children until the their children until the children have namechildren have name

Detect by checking survival rate of single ageDetect by checking survival rate of single agess► Discrepancy between aggregate and Discrepancy between aggregate and sum of sum of

individual valueindividual valuess

Validity: Validity: CensorCensoringing

►Selection based on characteristicsSelection based on characteristics►Censoring Censoring due to the time of surveydue to the time of survey

Duration of unemploymentDuration of unemployment (left and right (left and right censoring)censoring)

Completed years of schoolingCompleted years of schooling

►Attrition (Attrition (ppanel data)anel data)

Categorical/Qualitative Categorical/Qualitative VariablesVariables

►Converting categorical to single Converting categorical to single continuous variablescontinuous variables Grouped by age (population, public Grouped by age (population, public

education consumption)education consumption) Income category (FPL)Income category (FPL)

► Inconsistency over timeInconsistency over time►Categorical Categorical continuous, and vice continuous, and vice

versaversa

Units, Real vs. NominalUnits, Real vs. Nominal► Be cBe careful about the reporting unitareful about the reporting unit

MeasurMeasurementement units units Reporting period units (reference period, Reporting period units (reference period,

seasonal fluctuation, recall bias) seasonal fluctuation, recall bias)

► Nominal vs. RealNominal vs. Real Aggregation across itemsAggregation across items Quality change (e.g. computer)Quality change (e.g. computer) Where inflation is a substantial problem Where inflation is a substantial problem

Solution for Solution for Missing VariablesMissing Variables► Missing is not usually zero!!Missing is not usually zero!!► Ignore itIgnore it;; random non-response random non-response► Give upGive up;; find other source of data find other source of data sets sets► ImputeImpute; “missing does not mean zero”.; “missing does not mean zero”.

Based on their characteristics or mean valueBased on their characteristics or mean value Based on the value of other peer groupBased on the value of other peer group Modified zero order regressions (y on x)Modified zero order regressions (y on x)

- Create dummy variable for missing Create dummy variable for missing variables of x (z)variables of x (z)

- Replace missing variable with 0 (x’)Replace missing variable with 0 (x’)- Regress y on x’ and z, rather than y on xRegress y on x’ and z, rather than y on x

Individuals, Households, Individuals, Households, RegionsRegions

►Some data sets are individual level, but Some data sets are individual level, but a a lot of data are gathered from lot of data are gathered from householdhousehold What is Hh? Definition of Hhh?What is Hh? Definition of Hhh?

►Regional characteristics are often quite Regional characteristics are often quite importantimportant PSU, clusteringPSU, clustering

►Or data could be regional levelOr data could be regional level

Headship (Thailand, 1996)Headship (Thailand, 1996)

0

10

20

30

40

50

60

70

80

0 10 20 30 40 50 60 70 80 90+

Age

Hea

dsh

ip R

ate

(per

cen

t)

Economic Head

Self-reported Head

`

MeasuringMeasuring Consumption Consumption

► Durables.Durables.► Underestimation: e.g. British FESUnderestimation: e.g. British FES

Using aggregate control mitigates the Using aggregate control mitigates the problem.problem.

► Home produced items: both income and Home produced items: both income and consumption.consumption.

► Allocation across individuals is difficult Allocation across individuals is difficult ► Estimating Estimating some some profilesprofiles, such as health , such as health

expenditure is also difficult,expenditure is also difficult, part partlyly due due to various sourceto various sourcess of financing. of financing.

Measuring IncomeMeasuring Income

►““All of the difficulties of measuring All of the difficulties of measuring consumption apply with greater force to consumption apply with greater force to the measurement of income” (Deaton, the measurement of income” (Deaton, p. 29).p. 29). Need detailed information on “transactions” Need detailed information on “transactions”

(inflow and outflow): an enormous task(inflow and outflow): an enormous task Incentive to understateIncentive to understate Some surveys did not attempt to collect Some surveys did not attempt to collect

information on asset incomeinformation on asset income

Data Data Cleaning and Variable Cleaning and Variable ManipulationManipulation

► Case by caseCase by case► Find out what data sets are available and Find out what data sets are available and

choose the best onechoose the best one► Detect outliers and examine them carefullyDetect outliers and examine them carefully► A serious examination is required when A serious examination is required when

inflation matters to check whether actual inflation matters to check whether actual estimation process generateestimation process generatess a variable a variable

► Make variables consistentMake variables consistent► Convert categorical variableConvert categorical variabless to continuous to continuous

variablevariabless, vice versa., vice versa.► Data merge, data reshape, construct Data merge, data reshape, construct

variables…variables…

Weighting Weighting and Clusteringand Clustering

► Weight should be used in the summary of Weight should be used in the summary of variables/direct variables/direct tabulation/regression/smoothing.tabulation/regression/smoothing.

► Frequency Weights; Frequency Weights; ““fwfw”” indicate replicated indicate replicated data. The weight data. The weight tells us tells us how many how many observations each observation really observations each observation really represents.represents.. tab edu [w=wgt] . tab edu [w=wgt] tab edu [fw=wgt] tab edu [fw=wgt]

► Analytic Weights; Analytic Weights; ““awaw”” are inversely are inversely proportional to the variance of an proportional to the variance of an observation. It is appropriate when you are observation. It is appropriate when you are dealing with dealing with datadata containing averages. containing averages. . su edu [w=wgt] . su edu [w=wgt] su edu [aw=wgt] su edu [aw=wgt]. reg wage edu [w=wgt] . reg wage edu [w=wgt] reg wage edu reg wage edu [aw=wgt][aw=wgt]

Weighting and ClusteringWeighting and Clustering (cont’d)(cont’d)► Probability Weights; Probability Weights; ““pwpw”” are the sample are the sample

weight which is the inverse of the weight which is the inverse of the probability that this observation was probability that this observation was sampled. Free from heteroschedasticity sampled. Free from heteroschedasticity problem.problem.

. reg wage edu [pw=wgt] . reg wage edu [pw=wgt] reg wage edu reg wage edu [(a)w=wgt], robust[(a)w=wgt], robust

. reg wage edu [pw=wgt], cluster(hhid) . reg wage edu [pw=wgt], cluster(hhid)

reg wage edu [(a)w=wgt], cluster(hhid)reg wage edu [(a)w=wgt], cluster(hhid)

Smoothing (by Ivan on Smoothing (by Ivan on Thursday)Thursday)

►Will Shows the pattern more Will Shows the pattern more clearly by reducing sampling clearly by reducing sampling variancevariance

►Type of smoothingType of smoothing ““lowess” smoothing (Stata) does notlowess” smoothing (Stata) does not

allow the incorporation of allow the incorporation of weightweight. . Expanding the data is Expanding the data is computationally burdensome.computationally burdensome.

Friedman’s super smoothing (R) does.Friedman’s super smoothing (R) does.

Rule of smoothingRule of smoothing► Basic components should be smoothed, but not aggregations. For Basic components should be smoothed, but not aggregations. For

example, private health consumption and public health consumption example, private health consumption and public health consumption profiles should be smoothed, but the sum of the two should not be profiles should be smoothed, but the sum of the two should not be smoothed.smoothed.

► The objective is to reduce sampling variance but not eliminate what The objective is to reduce sampling variance but not eliminate what may be “real” features of the data. may be “real” features of the data. Avoid too much smoothing (e.g.Avoid too much smoothing (e.g.,, health health consumption for old ages consumption for old ages)). Use . Use

right bandwidth (span)right bandwidth (span) Public health spending may increase dramatically when individuals reach Public health spending may increase dramatically when individuals reach

an age threshold, e.g., 65. This kind of feature of the data should not be an age threshold, e.g., 65. This kind of feature of the data should not be smoothed away.smoothed away.

Due to unusual high health consumption by newborns, we tend not to Due to unusual high health consumption by newborns, we tend not to smooth health consumption by age 0. This could be done by including smooth health consumption by age 0. This could be done by including estimated unsmoothed health consumption by newborns to the age profile estimated unsmoothed health consumption by newborns to the age profile of smoothed private health consumption by other age groups.of smoothed private health consumption by other age groups.

Only adults (usually ages 15 and older) receive income, pay income taxes Only adults (usually ages 15 and older) receive income, pay income taxes and make familial transfer outflows. Thus, when we smooth these age and make familial transfer outflows. Thus, when we smooth these age profiles, we begin smoothing from the adults, excluding those younger age profiles, we begin smoothing from the adults, excluding those younger age group who do not earn income.group who do not earn income.

However, problem arises when some beginning age group may appear to However, problem arises when some beginning age group may appear to have negative values for these variables. This could be solved by replacing have negative values for these variables. This could be solved by replacing the negative by the unsmoothed values for the beginning age group.the negative by the unsmoothed values for the beginning age group.

The EndThe End