Machine Learning and Econometrics...• Processing of data requires machine learning . CropYield. What does this have to do with ML? • Processing of data requires machine learning

Machine Learning and Econometrics

Sendhil Mullainathan

Understand OLS

•  The real problem here is minimizing the “wrong” thing: In-sample fit vs out-of-sample fit

AVERAGES NOTATION

Decision Tree Example



Fitting

•  Suppose we fit the best tree we could do to some dataset

•  What would we get?

•  How do we resolve this problem?

OLS vs Subset Selection

• If problem is that we are using too manyvariables what if we…– Looked at functions that only used s of the k

variables?

• Example:– Single best variable that fits best

• Isnt there overfit here too?

Let's do the same thing here

Unconstrained

Constrained: why not do this instead?

fA = arg minf2FA

EHL(f(x), y)

argminf2F EHL(f(x), y)s.t. R(f) c

Complexity measure: tendency to overfit

Constrained minimization

• We could do a constrained minimization

• But notice that this is equivalent to:

• Want the complexity measure to capturetendency to overfit

�R(f)| {z }want: ⇡L(f)�L(f)

fA� = arg minf2FA

EHL(f(x), y)+

Basic insight

• Data has signal and noise

• More complex function classes-– Allow us to pick up more of the signal– But also pick up more of the noise

• So the problem of prediction becomes theproblem of choosing complexity

Overall Structure

• Create a regularizer that:–Measures complexity

• Penalize algorithm for choosing more expressivefunctions– Tuning parameter lambda is the price

• Let it weigh this penalty against in-sample fit

Decision Tree Regularizer

•  What makes a good regularizer? – Depth – Number of data points per leaf – Number of splits

•  What happens as complexity gets higher?

Linear Example

• Linear function class

• Regularized linear regression

Regularizers for Linear Functions

• Linear functions more expressive if use morevariables

• Can weight coefficients

R(f) =Pk

j=1 1�j 6=0

R(�) =kX

j=1

|�j |p

Computationally More Tractable

•  Lasso

•  Ridge

Half the Sauce

• Regularization is one half of the secret sauce

• Gives a single-dimensional way of deciding ofcapturing expressiveness

• Missing ingredient is lambda: how muchcomplexity do we want?

�R(f)| {z }want: ⇡L(f)�L(f)

fA� = arg minf2FA

EHL(f(x), y)+

Choosing lambda

•  How much should we penalize expressiveness?

•  How do you make the over-fit approximation tradeoff?

•  The tuning problem.

The tuning problem

• Back to where we started?

• We have parametrized the tradeoff

• But we still have no way of choosing the level ofcomplexity

Sample SnSample New Data

ESn [L(f(x), y)]| {z }Can measure

E[L(f(x), y)]| {z }Want

Sn

Train Tune

ESTrain [L(f(x), y)]| {z }Can measure

ESTune [L(f(x), y)]| {z }Can measure

Want Out of Sample But only have In Sample

Back to our original problem In-sample: No regularization is best regularization




Sn

Train Tune



Traditional Model Selection: Structural assumptions on DGP Analytically calculate diffference




Sn

Train Tune



Traditional Model Selection: Structural assumptions on DGP Analytically calculate difference




Sn

Train Tune



Empirical Tuning

• But now we can see what level of regularizationdoes best out of sample

• So estimate for many values of lambda�R(f)| {z }

want: ⇡L(f)�L(f)

fA� = arg minf2FA

EHL(f(x), y)+

Now in this case

•  See performance of this in the new “tune” data

•  A few assumptions and… – Simple convex optimization – So choosing between infinitely many procedures

� ⇠ argminEHL(f�(x), y)

Overfit Dominates

Creating Out-of-Sample In Sample

•  Major point: – Not many assumptions – Don’t need to know true model. – Don’t need to know much about algorithm

•  Something profound here: – We use the data itself to choose complexity

•  Aside: What happens as sample goes up?

Why does this work?

1.  Not just because we can split a sample and call it out of sample

–  It’s because the thing we are optimizing is observable

This is more than a trick

•  It illustrates what separates prediction from estimation: –  I cant ‘observe’ my prior.

•  Whether the world is truly drawn from a linear model

– But prediction quality is observable

•  Put simply: – Validity of predictions are measurable – Validity of coefficient estimators require structural

knowledge

This is the essential ingredient to prediction: Prediction quality is an empirical quantity not a theoretical guarantee

Why does this work?

1.  It’s because the thing we are optimizing is observable •  Notice that this works irrespective of number of

variables – This was not directly hard-wired into our

calculations

Why does this work?

1. It’s because the thing we are optimizing is observable 2. By focusing on prediction quality we have reduced dimensionality

To understand this…

•  Suppose you tried to use this to choose coefficients –  Ask which set of coefficients worked well out-of sample.

•  Does this work? •  Problem 1: Estimation quality is unobservable –  Need the same assumptions as algorithm to know whether

you “work” out of sample •  If you just go by fit you are ceding to say you want best predicting

model

–  Coefficients dont exist in the same way predictions do

•  Problem 2: No dimensionality reduction. –  You’ve got as many coefficients as before to search over

We can be more efficient than this

•  Will use a tool called cross-validation

•  Basic insight: – Why not use the hold-out to estimate another

function and see how it does on the train set?

Train

Tune

Cross Validation

Tuning Set = 1/5 of Training Set

Some Notation

•  Cross-validation is used for tuning

•  But after we’ve done that, we cannot use it also to evaluate how well our algorithm is doing

•  Why??

Secret Sauce

•  Key ingredients 1.  Dimensionality reduction through regularization 2.  A focus on predictions means quality observable •  Which means we can empirically tune

Data

Engineering Fitting

Testing

FittingSample

Hold-OutSample

f

Prediction

[L,L]L, f

Loss

Function

Overview of ML Playbook

Use out-of-sample

Performance to

choose

ˆ�

8�: Fitpredictors

testable

out-of-sample

Use

ˆ� to

form

ˆf�

Train Tune Output

Fitting

FittingSample

T

�OLS

28,573

Use out-of-sample

Performance to

choose

ˆ�

8�: Fitpredictors

testable

out-of-sample

{F1, .., Fk}

{f�j� }

Use

ˆ� to

form

ˆf

Divide data T

into k folds:

(yi, xi) 2 F�(i)

8�: Estimate

ˆ

f

�j� on

T \ Fj

Sometimes instead:

1k

Pkj=1

ˆf�j

�

�

Train Tune Output

Fitting

�

Fit on Tusing

ˆ�and output

ˆfA,T

�

Which �

leads to

best prediction

of yi

by

ˆ

f

��(i)� (xi)?

FittingSample

A�

T

f

Data

Engineering Fitting

Testing

FittingSample

Hold-OutSample

f

Prediction

[L,L]L, f

Loss

Function

Overview of Steps

Applications of Machine Learning

•  New Data

•  Prediction in Policy


•  New Data


New Data

•  An Example

Xie et. al. (2016)

What does this have to do with ML?

•  Processing of data requires machine learning

Blumenstock 2015



Crop Yield



•  Two kinds of processing: – Pre-processing

•  Extracting any sort of features from image

– Processing •  Conversion of features to economically meaningful

units

Find Farms

Relate visual features to yield

Considerations

•  Need training data – Hand labeling – Merging to other data sets

•  Don’t be stingy

New Data

•  An Example

•  Kinds of New Data – Satellite Data – Language data

"This class was a religious experience for me... I had to take it all on faith."

"I am convinced that you can learn by osmosis by just siKng in his class."

"Most of us spent the 1st 3 weeks terrified of the class. Then solidarity kicked in."

"The course was very thorough. What wasn't covered in class was covered on the final exam."

TEXT

Language Features

•  Bag of words

Bag of Words



"Most of us spent the 1st 3 weeks terrified of the class. Then solidarity kicked in."

"The course was very thorough. What wasn't covered in class was covered on the final exam."

TEXT Dic(onary

This Class Was A Religious Experience For Me I Had To Take It All On Faith

Am Convinced That You Can Learn By Osmosis Just SiKng In His

Most Of Us Spent The First Three Weeks . . .


This

Am

Most

of

class

convinced

By

osmosis

Three

Weeks

1 0 0 0 1 0 0 0 0 0

0 1 0 0 1 1 1 1 0 0


Can Predict Which Bills Survive Committee

Yano Smith and Wilkerson

Can Predict Which Bills Survive Committee

Financial Information

Kogan et. al.

10-‐k Forms

Predicting Volatility

What predicts?

Language Features

•  Bag of words •  Modifying bag of words: similarity/synonym •  Syntactic Structure •  Meaning: sentiment analysis

Can use sentiment as a features

Language Features

•  Bag of words •  Modifying bag of words: similarity/synonym •  Syntactic Structure •  Meaning: sentiment analysis •  LIWC

New Data

•  An Example

•  Kinds of New Data – Satellite Data – Language data – Digital Exhaust

Google Searches for “iPhone slow”

Choi Varian

New Data

•  An Example

•  Kinds of New Data – Satellite Data – Language data – Digital Exhaust – Network Data – …...


•  New Data



•  New Data


Question

•  Can prediction be directly useful in policy?

•  These decisions seem inherently causal – “Should we do policy X”? –  “What will X do?” –  “What happens with and without X?”

•  In fact decisions seem inherently causal

Two Toy Policy Decisions

•  Rain Dance

•  Umbrella

•  Common Elements – Both are decisions with payoffs – Both rely on data of the type:

•  Y = rain, X = variables correlated with rain

– Both use data to estimate function y = f(x)

Predicòn

Causaòn �

y

Y

X

X0

⇧

Y

X

X0

⇧

Causaòn

Rain Rain Dance

Framework

Decision

Y

X

⇧

Atmospheric Condiòns

Y

X

X0

⇧

Y

X

X0

⇧

Predicòn

Rain


Umbrella

Decision

Y

X

⇧

Framework

Y

X

X0

⇧

Predicòn

Y

X

X0

⇧

Causaòn

Rain


Rain Dance Rain


Umbrella

Experiments Machine Learning

Are there Umbrella Problems?

•  Decisions where predictions matter…

•  Where we can have big social impact

•  And with enough data

•  Prediction policy problems

Prediction

A Policy Problem in the US

•  Each year police make over 12 million arrests

•  Where do people wait for trial?

•  Release vs. detain high stakes –  Pre-trial detention spells avg. 2-3 months (can be up to

9-12 months) – Nearly 750,000 people in jails in US – Consequential for jobs, families as well as crime

Kleinberg Lakkaraju Leskovec Ludwig and Mullainathan

Judge’s Problem

•  Judge must decide whether to release or not (bail)

•  Defendant when out on bail can behave badly: – Fail to appear at case – Commit a crime

•  The judge is making a prediction

CrimeRisk

PastRecord

Release

Social Costs

PREDICTION

PREDICTION

CAUSATION

CrimeRisk

PastRecord

Release

Social Costs

Bracelet

How big is this effect?

•  Effect of another police officer – Chaflin and McCrary 2013 – To get a 4 percentage point reduction in crime…

•  Would need ~ 40,000 officers more nationwide •  Costs more than 4.8 billion dollars per year

– Or just implement this prediction rule •  Some fixed costs and minimal flow cost

•  And we’re not even considering the other benefits

Important Caveat in this Analysis

•  Selective labels – The literature ignores this

•  How do we resolve it?

Bail Not Unique

Prediction Policy Problems

•  Decision aids—not substitute for humans

•  Must resolve important policy considerations

Documents

Machine Learning and Econometrics...• Processing of data requires machine learning . CropYield. What does this have to do with ML? • Processing of data requires machine learning