40
Essential economics for data scientists Benjamin S. Skrainka February 10, 2016 Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 1 / 40

Essential econometrics for data scientists

Embed Size (px)

Citation preview

Page 1: Essential econometrics for data scientists

Essential economics for data scientists

Benjamin S. Skrainka

February 10, 2016

Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 1 / 40

Page 2: Essential econometrics for data scientists

Overview

Economics studies allocation of resources under scarcity. Many of thesetools are useful for data scientists:

Econometric methods adapt classical statistics for applied problemsCausal inference

I Experimental designI Regression analysis

Often, require small or ‘medium’ data

Goal of talk: understand (magnitude) of causal relationships

Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 2 / 40

Page 3: Essential econometrics for data scientists

Theory I won’t discuss

Economic theory I won’t discuss:

Understanding individual behaviorUnderstanding firm behaviorStrategic questions: products, pricing, auctions, platforms, incentives,M&A, new productsEstimate demand & forecastingStructural modeling

Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 3 / 40

Page 4: Essential econometrics for data scientists

Applied tools I won’t discuss

Econometric tools I won’t discuss:

Structural vs. reduced formBayesian vs. frequentistCounter-factual & welfare analysisForecasting

Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 4 / 40

Page 5: Essential econometrics for data scientists

Objectives

Today’s goals:

List differences between economic & machine learning approachKnow when to use econometrics or machine learningSurvey alternative types of experimentsOverview of how to estimate causal effects using regression analysis

Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 5 / 40

Page 6: Essential econometrics for data scientists

Agenda

Today’s agenda

1 Econometrics or machine learning?2 Establishing causality3 When A/B tests fail. . .4 Causal regression analysis

Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 6 / 40

Page 7: Essential econometrics for data scientists

References (1/2)

A few references:

Angrist, Joshua D., and Jörn-Steffen Pischke. Mostly harmlesseconometrics: An empiricist’s companion. Princeton university press,2008.Angrist, Joshua D., and Jörn-Steffen Pischke. Mastering ’metrics: Thepath from cause to effect. Princeton University Press, 2014.Breiman, Leo. “Statistical modeling: The two cultures (with commentsand a rejoinder by the author).” Statistical Science 16.3 (2001):199-231.Cameron, A. Colin, and Pravin K. Trivedi. Microeconometrics:methods and applications. Cambridge university press, 2005.Card, David, and Alan B. Krueger. “Minimum Wages and Employment:A Case Study of the Fast-Food Industry in New Jersey andPennsylvania.” The American Economic Review 84.4 (1994): 772-793.

Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 7 / 40

Page 8: Essential econometrics for data scientists

References (2/2)

A few more:

Imbens, Guido W., and Donald B. Rubin. Causal inference in statistics,social, and biomedical sciences. Cambridge University Press, 2015.LaLonde, Robert J. “Evaluating the econometric evaluations of trainingprograms with experimental data.” The American economic review(1986): 604-620.Pearl, Judea. Causality. Cambridge university press, 2009.Wooldridge, Jeffrey M. Econometric analysis of cross section and paneldata. MIT press, 2010.

Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 8 / 40

Page 9: Essential econometrics for data scientists

Econometrics or machine learning?

Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 9 / 40

Page 10: Essential econometrics for data scientists

Econometrics vs. machine learning

Econometrics Machine learning

Approach statistical: datagenerating process

algorithmic model, DGPunknown

Driver theory fitting the dataFocus hypothesis testing &

interpretabilitypredictive accuracy

Modelchoice

parameter significance &in-sample goodness of fit

cross-validation ofpredictive accuracy onpartitions of data

Strength understand causalrelationships & behavior

prediction

See Breiman (2001) and Matt Bogard’s blog

Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 10 / 40

Page 11: Essential econometrics for data scientists

Establishing causality

Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 11 / 40

Page 12: Essential econometrics for data scientists

How economists think about data

Data has a data generating process (DGP):

Dependent and independent variables are stochasticA structure is the statistical & functional relationship that determinesthe observed outcomesTwo structures are observationally equivalent if they produce the sameprocessA structure is identified only if a unique structure can cause the processConsequently:

I Parameter estimates are randomI Can perform inference on them. . .I . . . if the inverse problem is well-posed, i.e., the model is identified

Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 12 / 40

Page 13: Essential econometrics for data scientists

Variation in data

What is the nature of the variation in the data?

Identify exogenous vs. endogenous sources of variationExogenous:

I Variable is determined outside the modelI Example: cost, weather, draft lottery number, parental income

Endogenous:I Variable is determined inside the modelI Caused by E[x · ε] 6= 0 or causal loop between y and xI Example: crime & policing; price & demand, product characteristics

⇒ Mishandling endogenous features/variables almost always causes biasedestimates

Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 13 / 40

Page 14: Essential econometrics for data scientists

Experimental vs. observational data

In the Rubin Causal Model, experimental data satisfies:

1 Individualistic: whether I am assigned to treatment doesn’t affectwhether you are

2 Probabilistic: non-zero probability of assignment to each treatment3 Unconfoundedness: outcome doesn’t affect probability of assignment4 Known, random assignment rule

If 4. is violated, your data is observational

Also need Stable Unit Treatment Value Assumption (SUTVA)

Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 14 / 40

Page 15: Essential econometrics for data scientists

Causality and ceteris paribus

Measuring causality depends on ceteris paribus:

Ceteris paribus means “all else being equal”I.e., compare apples to apples by conditioning on everything other thanthe variable under analysisE.g., data should be as good as randomly assigned

In the terminology of Wooldridge (2010):

Want to understand how w affects yMust condition on other confounding influences or correlation betweenw and c will bias results:

E[y |w , c]

Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 15 / 40

Page 16: Essential econometrics for data scientists

Model InterpretationInterpretation based on:

Partial effects:I Continuous w : βw = ∂E[y |w , c]

∂wI Discrete w : βw = ∆wE[y |w , c]

Elasticities:I η = w · ∂E[y |w , c]

E[y |w , c] · ∂w = ∂ logE[y |w , c]∂ logw

I Captures dimensionless changeI E.g., market power:

price −marginal costprice = − 1

ηDemand

Structural models permit more sophisticated counter-factual analysisBenjamin S. Skrainka Essential economics for data scientists February 10, 2016 16 / 40

Page 17: Essential econometrics for data scientists

Establishing Causality

It is very difficult to establish causality:

Provide strong evidence of causality:I Well-designed experimentationI Careful statistical analysisI Show method controls for sources of possible bias

Randomization is the gold standardFor observational data, must show controls make (regression) analysis‘as good as randomly assigned’See LaLonde (1986) or Angrist & Pischke (2008, 2014)

Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 17 / 40

Page 18: Essential econometrics for data scientists

Example 1: selection bias (1/2)Suppose we measure impact of a policy intervention (e.g., advertising), γ̂:

Yi (Wi ) = µ+ γ ·Wi + εi

Try differencing averages, but we only observe outcomes condition ontreatment status Wi :

observed effect = AVGn[Yi (1)|Wi = 1]− AVGn[Yi (0)|Wi = 0]

But:

observed effect = AVGn[Yi (1)|Wi = 1]− AVGn[Yi (1)|Wi = 0]︸ ︷︷ ︸direct effect

+AVGn[Yi (1)|Wi = 0]− AVGn[Yi (0)|Wi = 0]︸ ︷︷ ︸selection

Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 18 / 40

Page 19: Essential econometrics for data scientists

Example 1: selection bias (2/2)

Selection bias is everywhere in behavioral data:

Randomize to eliminate selection!In the absence of randomization:

I Model selection processI Choose sensible functional form & distributionsI Use control functionalI Condition on suitable controls to compare groups which are as good as

randomly assigned

See Wooldridge (2010)

Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 19 / 40

Page 20: Essential econometrics for data scientists

Endogeneity

Consider a regression model:

yi = x ′i β + εi

If weak exogeneity, E[xi · εi ] = 0, holds (among other assumptions) ⇒β̂ unbiasedEndogeneity occurs if yi and xi are codetermined:

I Weak exogeneity fails, E[xi · εi ] 6= 0I β̂ is biased

Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 20 / 40

Page 21: Essential econometrics for data scientists

Types of endogeneity

There are several types of endogeneity:

SimultaneityOmitted variable bias (OVB)Selection biasMeasurement error

If any are present, E[xi · εi ] 6= 0 and your estimates are biased

Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 21 / 40

Page 22: Essential econometrics for data scientists

Common endogenous variables

Endogeneity is everywhere:

Price & demandProduct characteristicsWages and schooling (or anything affected by ability)Labor-force participationPolicing & crime

Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 22 / 40

Page 23: Essential econometrics for data scientists

Example 2: OLS & endogeneity

Quick review of OLS in one dimension:

yi = β0 + β1 · xi + εi

Cov(yi , xi ) = β1 · Var(xi ) + Cov(εi , xi )

β̂1 = β1 + Cov(εi , xi )Var(xi )

⇒ β̂1 is unbiased only ⇔ Cov(εi , xi ) = 0

Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 23 / 40

Page 24: Essential econometrics for data scientists

Omitted variable bias (OVB)

OVB is a common problem:

yi = β · xi + α · zi + εiBut, zi is omitted from regressionThen, effective error term is ui = εi + α · zi and E[ui , xi ] 6= 0Estimates are biased:

β̂ = Cov(yi , xi )Var(xi )

+ α · Cov(zi , xi )Var(xi )

Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 24 / 40

Page 25: Essential econometrics for data scientists

Individual heterogeneity

Handling individual heterogeneity is a key triumph of modern econometrics:

Unobserved effects from individuals and firms affect behaviorFailure to model, can cause biased or inefficient estimates

Often, can use panel data models to control for an unobserved effect:

Panel data is a data where we observed individuals over timeCan exploit this to eliminate temporal and individual biases:

yit = αi + δt + x ′itβi + uit

Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 25 / 40

Page 26: Essential econometrics for data scientists

Assembling a dataset

Economists think hard about the relationship between features andoutcomes, and what is lurking in the error term:

Think about the error term:I Endogeneity?I Omitted variables?I Valid instruments?

Seek out supplementary data which could explain behavior . . . or proxyfor missing featuresChoose a functional form to control for ignorance yet measure whatmattersCreate a panel to control for individual heterogeneity

Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 26 / 40

Page 27: Essential econometrics for data scientists

Feature engineering à la economics

To handle these problems, economists add supplementary data:

Instrumental variablesProxy variablesdummy variables for individual

Or, use clever tricks:

Panel dataRegression discontinuity designDifference-in-differences

Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 27 / 40

Page 28: Essential econometrics for data scientists

Natural experiments

Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 28 / 40

Page 29: Essential econometrics for data scientists

Overview

Sometimes A/B testing is not possible:

Impossible to run the experimentLook for natural randomization devices which provide as good ‘asrandom assignment to treatment’:

I BirthdaysI Draft lottery numbersI Access decisions

Natural randomization can eliminate selectionAlways check for balance!Requires some cleverness. . .

Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 29 / 40

Page 30: Essential econometrics for data scientists

A natural experiment

Often ‘nature’ provides natural randomization which is as good asexperimental randomization.

Example: needed to measure lift of experiential marketing campaign: * Noexperimental design * Ten treatment units * Matching estimators failed *But, short list had 50 sites. . . * Assume (and test) if access is as good asrandom ⇒ have valid treatment & control groups!

Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 30 / 40

Page 31: Essential econometrics for data scientists

Causal regression analysis

Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 31 / 40

Page 32: Essential econometrics for data scientists

Causal regression analysis

To establish causality, must condition on other variables so ceteris paribusapplies:

Want to estimate E[y |w , c], where:I w is the factor of interestI c are other factors, correlated with w which could confound analysis

Need good measures of y , w , and cBeware of variables determined by equilibrium:

I Must deal with simultaneous equation modelingI Must avoid endogenous variablesI E.g., crime vs. policing

Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 32 / 40

Page 33: Essential econometrics for data scientists

Some common regression tools

Econometricians have developed many methods to assess causalrelationships:

Regression discontinuity design (RDD):Difference-in-differences (DID):Instrumental variables: use exogenous variables to instrument anendogenous variablePanel data:Other methods include:

I Matching estimatorsI Censoring & truncationI Discrete choiceI Discrete/continuous choice

Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 33 / 40

Page 34: Essential econometrics for data scientists

Regression discontinuity design

Can exploit policy or laws which cause a ‘natural’ treatment effect toestablish causality:

Example: impact of drinking age on mortalityPeople on either side of 21 get different treatment but are essentiallyidenticalWi is a function of ageWi is a discontinuous function of ageage is known as the running variable

mortalityi = α0 + α1 · age + γ ·Wi (age) + εi

See Agrist & Pischke (2015)

Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 34 / 40

Page 35: Essential econometrics for data scientists

Difference-in-differences (DID) (1/2)

DID useful when you have individual effects and common time trends:

Must observe data over at least two periodsEliminate bias from individual and time effectsE.g., impact of minimum wage on employment; See, Card & Krueger(1998)

yit = αi + δt + γ ·Wi × PERIODt

γ̂ = (yNJ,1 − yNJ,0)− (yPA,1 − yPA,0)

γ̂ = (∆δt + γ)− (∆δt)

Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 35 / 40

Page 36: Essential econometrics for data scientists

DID regression (2/2)

Can estimate DID using regression:

yit = αi + δt + β ·Wi + γ ·Wi × PERIODt + εit

Wi is treatment statusPERIODt ∈ 0, 1 for periods 0 and 1, respectivelyγ is the treatment effectAdd additional covariates to control as neededVerify common trends holds (e.g., plot data vs. time)

Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 36 / 40

Page 37: Essential econometrics for data scientists

Instrumental variables

An instrumental variable, z , provides a way to correct for endogeneity:

Assumptions:I E[εi · zi ] = 0I E[xi · zi ] 6= 0

Use z in regression:

Cov(yi , zi ) = β1 · Cov(xi , zi ) + Cov(εi , zi )

β̂1IV = β1 + Cov(εi , zi )

Cov(xi , zi )

Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 37 / 40

Page 38: Essential econometrics for data scientists

Panel data

Panel data is a powerful tool to eliminate sources of bias:

yit = αi + δt + x ′itβi + εit

Panel data consists of individuals observed overtime, i.e., xitHas time series and cross-section propertiesCan eliminate individual & time effects:

I With-in estimator (..x it ← xit − xi)I First differences (FD)

Can also handle serial correlation of {εt}

Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 38 / 40

Page 39: Essential econometrics for data scientists

Least squares dummy variable regression (LSDV)

Often panel data is equivalent to LSDV:

Occurs when individual effect could be modeled using dummy variablesfor each individualFrisch-Waugh decomposition:

I Powerful dimension reductionI Use when you need to control for unobserved effect without estimating it

Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 39 / 40

Page 40: Essential econometrics for data scientists

Conclusion

To establish a causal relationship:

Must furnish evidence of correctnessExperiment is gold standardIn absence of random assignment to treatment:

I Natural experiments can provide as good as random assignment totreatment

I Regression analysis can be causal if you condition on confoundinginfluences and control for endogeneity

Benjamin S. Skrainka Essential economics for data scientists February 10, 2016 40 / 40