Upload
others
View
10
Download
0
Embed Size (px)
Citation preview
Machine Learning and Econometrics
Sendhil Mullainathan
Understand OLS
• The real problem here is minimizing the “wrong” thing: In-sample fit vs out-of-sample fit
AVERAGES NOTATION
Decision Tree Example
Decision Tree Example
Decision Tree Example
Fitting
• Suppose we fit the best tree we could do to some dataset
• What would we get?
• How do we resolve this problem?
OLS vs Subset Selection
• If problem is that we are using too manyvariables what if we…– Looked at functions that only used s of the k
variables?
• Example:– Single best variable that fits best
• Isnt there overfit here too?
Let's do the same thing here
Unconstrained
Constrained: why not do this instead?
fA = arg minf2FA
EHL(f(x), y)
argminf2F EHL(f(x), y)s.t. R(f) c
Complexity measure: tendency to overfit
Constrained minimization
• We could do a constrained minimization
• But notice that this is equivalent to:
• Want the complexity measure to capturetendency to overfit
�R(f)| {z }want: ⇡L(f)�L(f)
fA� = arg minf2FA
EHL(f(x), y)+
Basic insight
• Data has signal and noise
• More complex function classes-– Allow us to pick up more of the signal– But also pick up more of the noise
• So the problem of prediction becomes theproblem of choosing complexity
Overall Structure
• Create a regularizer that:–Measures complexity
• Penalize algorithm for choosing more expressivefunctions– Tuning parameter lambda is the price
• Let it weigh this penalty against in-sample fit
Decision Tree Regularizer
• What makes a good regularizer? – Depth – Number of data points per leaf – Number of splits
• What happens as complexity gets higher?
Linear Example
• Linear function class
• Regularized linear regression
Regularizers for Linear Functions
• Linear functions more expressive if use morevariables
• Can weight coefficients
R(f) =Pk
j=1 1�j 6=0
R(�) =kX
j=1
|�j |p
Computationally More Tractable
• Lasso
• Ridge
Half the Sauce
• Regularization is one half of the secret sauce
• Gives a single-dimensional way of deciding ofcapturing expressiveness
• Missing ingredient is lambda: how muchcomplexity do we want?
�R(f)| {z }want: ⇡L(f)�L(f)
fA� = arg minf2FA
EHL(f(x), y)+
Choosing lambda
• How much should we penalize expressiveness?
• How do you make the over-fit approximation tradeoff?
• The tuning problem.
The tuning problem
• Back to where we started?
• We have parametrized the tradeoff
• But we still have no way of choosing the level ofcomplexity
Sample SnSample New Data
ESn [L(f(x), y)]| {z }Can measure
E[L(f(x), y)]| {z }Want
Sn
Train Tune
ESTrain [L(f(x), y)]| {z }Can measure
ESTune [L(f(x), y)]| {z }Can measure
Want Out of Sample But only have In Sample
Back to our original problem In-sample: No regularization is best regularization
Sample SnSample New Data
ESn [L(f(x), y)]| {z }Can measure
E[L(f(x), y)]| {z }Want
Sn
Train Tune
ESTrain [L(f(x), y)]| {z }Can measure
ESTune [L(f(x), y)]| {z }Can measure
Traditional Model Selection: Structural assumptions on DGP Analytically calculate diffference
Sample SnSample New Data
ESn [L(f(x), y)]| {z }Can measure
E[L(f(x), y)]| {z }Want
Sn
Train Tune
ESTrain [L(f(x), y)]| {z }Can measure
ESTune [L(f(x), y)]| {z }Can measure
Traditional Model Selection: Structural assumptions on DGP Analytically calculate difference
Sample SnSample New Data
ESn [L(f(x), y)]| {z }Can measure
E[L(f(x), y)]| {z }Want
Sn
Train Tune
ESTrain [L(f(x), y)]| {z }Can measure
ESTune [L(f(x), y)]| {z }Can measure
Empirical Tuning
• But now we can see what level of regularizationdoes best out of sample
• So estimate for many values of lambda�R(f)| {z }
want: ⇡L(f)�L(f)
fA� = arg minf2FA
EHL(f(x), y)+
Now in this case
• See performance of this in the new “tune” data
• A few assumptions and… – Simple convex optimization – So choosing between infinitely many procedures
� ⇠ argminEHL(f�(x), y)
Overfit Dominates
Creating Out-of-Sample In Sample
• Major point: – Not many assumptions – Don’t need to know true model. – Don’t need to know much about algorithm
• Something profound here: – We use the data itself to choose complexity
• Aside: What happens as sample goes up?
Why does this work?
1. Not just because we can split a sample and call it out of sample
– It’s because the thing we are optimizing is observable
This is more than a trick
• It illustrates what separates prediction from estimation: – I cant ‘observe’ my prior.
• Whether the world is truly drawn from a linear model
– But prediction quality is observable
• Put simply: – Validity of predictions are measurable – Validity of coefficient estimators require structural
knowledge
This is the essential ingredient to prediction: Prediction quality is an empirical quantity not a theoretical guarantee
Why does this work?
1. It’s because the thing we are optimizing is observable • Notice that this works irrespective of number of
variables – This was not directly hard-wired into our
calculations
Why does this work?
1. It’s because the thing we are optimizing is observable 2. By focusing on prediction quality we have reduced dimensionality
To understand this…
• Suppose you tried to use this to choose coefficients – Ask which set of coefficients worked well out-of sample.
• Does this work? • Problem 1: Estimation quality is unobservable – Need the same assumptions as algorithm to know whether
you “work” out of sample • If you just go by fit you are ceding to say you want best predicting
model
– Coefficients dont exist in the same way predictions do
• Problem 2: No dimensionality reduction. – You’ve got as many coefficients as before to search over
We can be more efficient than this
• Will use a tool called cross-validation
• Basic insight: – Why not use the hold-out to estimate another
function and see how it does on the train set?
Train
Tune
Cross Validation
Tuning Set = 1/5 of Training Set
Some Notation
• Cross-validation is used for tuning
• But after we’ve done that, we cannot use it also to evaluate how well our algorithm is doing
• Why??
Secret Sauce
• Key ingredients 1. Dimensionality reduction through regularization 2. A focus on predictions means quality observable • Which means we can empirically tune
Data
Engineering Fitting
Testing
FittingSample
Hold-OutSample
f
Prediction
[L,L]L, f
Loss
Function
Overview of ML Playbook
Use out-of-sample
Performance to
choose
ˆ�
8�: Fitpredictors
testable
out-of-sample
Use
ˆ� to
form
ˆf�
Train Tune Output
Fitting
FittingSample
T
�OLS
28,573
Use out-of-sample
Performance to
choose
ˆ�
8�: Fitpredictors
testable
out-of-sample
{F1, .., Fk}
{f�j� }
Use
ˆ� to
form
ˆf
Divide data T
into k folds:
(yi, xi) 2 F�(i)
8�: Estimate
ˆ
f
�j� on
T \ Fj
Sometimes instead:
1k
Pkj=1
ˆf�j
�
�
Train Tune Output
Fitting
�
Fit on Tusing
ˆ�and output
ˆfA,T
�
Which �
leads to
best prediction
of yi
by
ˆ
f
��(i)� (xi)?
FittingSample
A�
T
f
Data
Engineering Fitting
Testing
FittingSample
Hold-OutSample
f
Prediction
[L,L]L, f
Loss
Function
Overview of Steps
Applications of Machine Learning
• New Data
• Prediction in Policy
Applications of Machine Learning
• New Data
• Prediction in Policy
New Data
• An Example
Xie et. al. (2016)
What does this have to do with ML?
• Processing of data requires machine learning
Blumenstock 2015
What does this have to do with ML?
• Processing of data requires machine learning
Crop Yield
What does this have to do with ML?
• Processing of data requires machine learning
• Two kinds of processing: – Pre-processing
• Extracting any sort of features from image
– Processing • Conversion of features to economically meaningful
units
Find Farms
Relate visual features to yield
Considerations
• Need training data – Hand labeling – Merging to other data sets
• Don’t be stingy
New Data
• An Example
• Kinds of New Data – Satellite Data – Language data
"This class was a religious experience for me... I had to take it all on faith."
"I am convinced that you can learn by osmosis by just siKng in his class."
"Most of us spent the 1st 3 weeks terrified of the class. Then solidarity kicked in."
"The course was very thorough. What wasn't covered in class was covered on the final exam."
TEXT
Language Features
• Bag of words
Bag of Words
"This class was a religious experience for me... I had to take it all on faith."
"I am convinced that you can learn by osmosis by just siKng in his class."
"Most of us spent the 1st 3 weeks terrified of the class. Then solidarity kicked in."
"The course was very thorough. What wasn't covered in class was covered on the final exam."
TEXT Dic(onary
This Class Was A Religious Experience For Me I Had To Take It All On Faith
Am Convinced That You Can Learn By Osmosis Just SiKng In His
Most Of Us Spent The First Three Weeks . . .
"This class was a religious experience for me... I had to take it all on faith."
This
Am
Most
of
class
convinced
By
osmosis
Three
Weeks
1 0 0 0 1 0 0 0 0 0
0 1 0 0 1 1 1 1 0 0
"I am convinced that you can learn by osmosis by just siKng in his class."
Can Predict Which Bills Survive Committee
Yano Smith and Wilkerson
Can Predict Which Bills Survive Committee
Financial Information
Kogan et. al.
10-‐k Forms
Predicting Volatility
What predicts?
Language Features
• Bag of words • Modifying bag of words: similarity/synonym • Syntactic Structure • Meaning: sentiment analysis
Can use sentiment as a features
Language Features
• Bag of words • Modifying bag of words: similarity/synonym • Syntactic Structure • Meaning: sentiment analysis • LIWC
New Data
• An Example
• Kinds of New Data – Satellite Data – Language data – Digital Exhaust
Google Searches for “iPhone slow”
Choi Varian
New Data
• An Example
• Kinds of New Data – Satellite Data – Language data – Digital Exhaust – Network Data – …...
Applications of Machine Learning
• New Data
• Prediction in Policy
Applications of Machine Learning
• New Data
• Prediction in Policy
Question
• Can prediction be directly useful in policy?
• These decisions seem inherently causal – “Should we do policy X”? – “What will X do?” – “What happens with and without X?”
• In fact decisions seem inherently causal
Two Toy Policy Decisions
• Rain Dance
• Umbrella
• Common Elements – Both are decisions with payoffs – Both rely on data of the type:
• Y = rain, X = variables correlated with rain
– Both use data to estimate function y = f(x)
Predic`on
Causa`on �
y
Y
X
X0
⇧
Y
X
X0
⇧
Causa`on
Rain Rain Dance
Framework
Decision
Y
X
⇧
Atmospheric Condi`ons
Y
X
X0
⇧
Y
X
X0
⇧
Predic`on
Rain
Atmospheric Condi`ons
Umbrella
Decision
Y
X
⇧
Framework
Y
X
X0
⇧
Predic`on
Y
X
X0
⇧
Causa`on
Rain
Atmospheric Condi`ons
Rain Dance Rain
Atmospheric Condi`ons
Umbrella
Experiments Machine Learning
Are there Umbrella Problems?
• Decisions where predictions matter…
• Where we can have big social impact
• And with enough data
• Prediction policy problems
Prediction
A Policy Problem in the US
• Each year police make over 12 million arrests
• Where do people wait for trial?
• Release vs. detain high stakes – Pre-trial detention spells avg. 2-3 months (can be up to
9-12 months) – Nearly 750,000 people in jails in US – Consequential for jobs, families as well as crime
Kleinberg Lakkaraju Leskovec Ludwig and Mullainathan
Judge’s Problem
• Judge must decide whether to release or not (bail)
• Defendant when out on bail can behave badly: – Fail to appear at case – Commit a crime
• The judge is making a prediction
CrimeRisk
PastRecord
Release
Social Costs
PREDICTION
PREDICTION
CAUSATION
CrimeRisk
PastRecord
Release
Social Costs
Bracelet
How big is this effect?
• Effect of another police officer – Chaflin and McCrary 2013 – To get a 4 percentage point reduction in crime…
• Would need ~ 40,000 officers more nationwide • Costs more than 4.8 billion dollars per year
– Or just implement this prediction rule • Some fixed costs and minimal flow cost
• And we’re not even considering the other benefits
Important Caveat in this Analysis
• Selective labels – The literature ignores this
• How do we resolve it?
Bail Not Unique
Prediction Policy Problems
• Decision aids—not substitute for humans
• Must resolve important policy considerations