29
Understand the fundamentals of Bayesian statistics Opportunities of Machine Learning in Economics An Example: Shrinkage models BAYESIAN STATISTICS AND MACHINE LEARNING A Conceptual Review Nicolaj Nørgaard Mühlbach, Ph.D. student Department of Economics and Business Economics, Aarhus University November 28, 2017 NNM DSS event part II

A Conceptual Review

  • Upload
    others

  • View
    11

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A Conceptual Review

Understand the fundamentals of Bayesian statisticsOpportunities of Machine Learning in Economics

An Example: Shrinkage models

BAYESIAN STATISTICS AND MACHINE LEARNING

A Conceptual Review

Nicolaj Nørgaard Mühlbach, Ph.D. student

Department of Economics and Business Economics, Aarhus University

November 28, 2017

NNM DSS event part II

Page 2: A Conceptual Review

Understand the fundamentals of Bayesian statisticsOpportunities of Machine Learning in Economics

An Example: Shrinkage models

Overview of methodology and purpose

. We will cover the link between Bayesian statistics and machine learning

Figure: Different methods and different purposes

NNM DSS event part II

Page 3: A Conceptual Review

Understand the fundamentals of Bayesian statisticsOpportunities of Machine Learning in Economics

An Example: Shrinkage models

How likely are you to be fan of Bayesian/ML methods?

. Assume data D = {(xi , yi )}Ni=1 are samples of B.Sc. Oecon students where:

. xi ∈ [0,1]p denotes to which degree student i is fan of the p courses

. Assume for simplicity that 1N ∑N

i=1 xij =12 ∀j ∈ {1, . . . ,p}

. yi ∈ [0,1] denotes to which degree student i is fan of the Bayesian ML

. Let f̂D : [0,1]p 7→ [0,1] be learned from D (estimated if you will)

. Now, consider an unseen student x0 ∈ Rp who could be anyone of you

Question: Is x0 likely to be fan of Bayesian ML?

NNM DSS event part II

Page 4: A Conceptual Review

Understand the fundamentals of Bayesian statisticsOpportunities of Machine Learning in Economics

An Example: Shrinkage models

x1(Math/Stat 1st) > 12

x2(Stat 2nd) > 12

x3(Math 3rd) > 12

x4(QQL 4th) > 12

x5(Econometrics 5th) > 12

y = 1(⇒ huge fan) y = 0.8

y = 0.6

y = 0.4

y = 0.1

y = 0(⇒ not fan)

NNM DSS event part II

Page 5: A Conceptual Review

Understand the fundamentals of Bayesian statisticsOpportunities of Machine Learning in Economics

An Example: Shrinkage models

Our goals for this evening

Understand the fundamentals of Bayesian statistics. Wrap up on fundamental Bayesian theory. The Bayesian approach to machine learning (or anything). The exponential-gamma Bayesian model. The computational challenge. Distinctive features of the Bayesian approach. How Bayesian methods differ from machine learning

Understand the opportunities of machine learning in economics. Econometric challenges in terms of prediction. Bias-variance trade-off. What machine learning does differently

Understand an example of shrinkage models. The Ridge estimator. Ridge vs. Lasso from a graphical perspective

NNM DSS event part II

Page 6: A Conceptual Review

Understand the fundamentals of Bayesian statisticsOpportunities of Machine Learning in Economics

An Example: Shrinkage models

Understand The Fundamentals of Bayesian Statistics

“Indeed, if you accept the argument that the false positive rate should behigher for theories that are unlikely, then you have already adopted afundamentally Bayesian line of reasoning.”

Harvey (2017)

NNM DSS event part II

Page 7: A Conceptual Review

Understand the fundamentals of Bayesian statisticsOpportunities of Machine Learning in Economics

An Example: Shrinkage models

Wrap up on fundamental Bayesian theory

. Unit of statistical inference for both frequentists and Bayesians is a family ofprobability densities

F = {fθ (x) ; x ∈ X , θ ∈ Θ} , X input space, Θ parameter space

. Posterior combines the prior and the conditional likelihood for the parameters

. This is done using Bayes’ Rule

P (parameters|data) =P (parameters)×P (data|parameters)

P (data)Posterior ∝ Prior× Likelihood

. data is fixed at its observed value while θ vary over Θ (opposite of frequentist)

. We make predictions by integrating with respect to the posterior

P (new data|data) =∫

parameters

P (new data|parameters)×P (parameters|data)

NNM DSS event part II

Page 8: A Conceptual Review

Understand the fundamentals of Bayesian statisticsOpportunities of Machine Learning in Economics

An Example: Shrinkage models

. This illustrates how uncertainty is generated in the model

Figure: Bayesian inference proceeds vertically given x ; frequentist inference proceeds horizontallygiven µ

NNM DSS event part II

Page 9: A Conceptual Review

Understand the fundamentals of Bayesian statisticsOpportunities of Machine Learning in Economics

An Example: Shrinkage models

The Bayesian approach to machine learning (or anything)

. Bayesian modeling applies Bayes’ rule to the unknown variables in a model. Many machine learning algorithms involve exactly this step

. In a simple, generic form we can write this process as

x ∼ p (x |θ) ← the data-generating distribution. This is called the modelθ ∼ p (θ) ← the a priori knowledge of θ. This is called the model prior

. We want to learn θ from data using

p (θ|x) =︸︷︷︸exact equal to

p (x |θ) p (θ)

p (x)∝︸︷︷︸

proportional to

p (x |θ) p (θ)

. but, unfortunately, p (θ|x) is unknown.

. However, we have defined p (x |θ) p (θ) and Bayes guides us, although non-trivial:. the model p (x |θ) can be quite complicated. p (θ|x) can be intractable due to the integral in the normalizing constant

NNM DSS event part II

Page 10: A Conceptual Review

Understand the fundamentals of Bayesian statisticsOpportunities of Machine Learning in Economics

An Example: Shrinkage models

The exponential-gamma Bayesian model

. Suppose X |λ ∼ E (λ), i.e. data is exponentially distributed with parameter λ.Then,

fX (x) = λe−λx , x > 0,λ > 0 (Probability density function)

. Exponential distribution describes the time between events in a Poisson process

. We want to learn f (λ|x) because E [X ] = 1λ and V [X ] = 1

λ2 .

. Assume λ ∼ G (α, β), i.e. that the constant average rate is gamma-distributed withθ = {α, β}.

. Then,

p (λ) =βα

Γ (α)λα−1e−βλ, λ > 0 (Probability density function)

NNM DSS event part II

Page 11: A Conceptual Review

Understand the fundamentals of Bayesian statisticsOpportunities of Machine Learning in Economics

An Example: Shrinkage models

Posterior in the exponential-gamma model

. Given sample data x ∈ RN , the posterior distribution is then

p (λ|x) ∝ p (λ)× ` (x|λ)

=βα

Γ (α)λα−1e−βλ︸ ︷︷ ︸

prior (gamma PDF)

×N

∏i=1

λe−λxi

︸ ︷︷ ︸likelihood

=βα

Γ (α)λα−1e−βλ × λN

(e−λx1 × e−λx2 × · · · × e−λxN

)=

βα

Γ (α)λα+N−1e−βλ × e−λ(x1+x2+...+xN )

=βα

Γ (α)λα+N−1e−βλ × e−λNx

∝ λα+N−1e−(β+Nx)λ ∼︸︷︷︸distributed as

G (α + N, β + Nx)

. Only sufficient statistics required to transit from prior to posterior

. From this, we may calculate the predictive distribution of a new sample xnew

NNM DSS event part II

Page 12: A Conceptual Review

Understand the fundamentals of Bayesian statisticsOpportunities of Machine Learning in Economics

An Example: Shrinkage models

Conjugate priors in general

Definition 1

Conjugate priorsIf F is a class of sampling distributions f (x |θ) and P is a class of prior distributions forθ, p (θ) we say that P is conjugate to F if

p (θ|x) ∈ P∀f (·|θ) ∈ F ∧ p (·) ∈ P

Example

Let F = {E}, i.e. the family we draw x from is only the exponential distributionLet P = {G}, i.e. the family we draw λ from is only the gamma distributionAs derived above, p (θ|x) ∈ P , i.e. the posterior is also gamma-distributedThen, the gamma distribution is a conjugate prior to the exponential distribution

Likelihood Conjugate PriorBinomial, Negative binomial, Geometric Beta

Poisson Gamma

Table: Discrete distributions

NNM DSS event part II

Page 13: A Conceptual Review

Understand the fundamentals of Bayesian statisticsOpportunities of Machine Learning in Economics

An Example: Shrinkage models

The computational challenge

. Except specifying priors, the big challenge is computing the posterior distribution

. There are four main approaches (but we will not cover variational approximation):

. Analytical integration. Use conjugate priors, which combine nicely with the likelihood. Usually too much to hope for

. Laplace approximation. Treat posterior as being approximately Gaussian by choosing {µ,Σ} such that

p (θ|X ) ≈d N (µ,Σ)

. Works well when there’s a lot of data and low model complexity. Monte Carlo integration (Gibbs sampling or MCMC)

. Simulating a Markov chain that eventually converges to the posteriordistribution. Imagine we are interested in Ep(X ) [f (X )]. The idea follows

Sample X1, . . . ,XN ∼IID p (X ) =⇒ Ep(X ) [f (X )] ≈ 1N

N

∑i=1

f (Xi )

. As N −→ ∞, the approximation converges to the true expectation

. Can be applied to a remarkable variety of problems (dominant approach)

NNM DSS event part II

Page 14: A Conceptual Review

Understand the fundamentals of Bayesian statisticsOpportunities of Machine Learning in Economics

An Example: Shrinkage models

Distinctive features of the Bayesian approach

Probability. Probability is used not only to describe “physical” randomness, but also to

describe uncertainty regarding the true values of the parameters. These prior and posterior probabilities represent degrees of belief

Modeling. The Bayesian approach takes modeling seriously. A Bayesian model includes a suitable prior distribution for model parameters

. If the model/prior are chosen without regard for the actual situation, there isno justification for believing the results of Bayesian inference

Finite Sample Justification. The model and prior are chosen based on our knowledge of the problem. Thus, we do not rely on asymptotic theory, and we do not restrict the complexity of

the model just because we have only a small amount of data

Also, features include model complexity justification and problem breakdownNNM DSS event part II

Page 15: A Conceptual Review

Understand the fundamentals of Bayesian statisticsOpportunities of Machine Learning in Economics

An Example: Shrinkage models

Model complexity justification

. Image a complex and high-dimensional model

. Often, the log-likelihood will be flat along some dimension

. Adding prior knowledge corresponds to adding curvature to the likelihood

Figure: An Illustration of Bayesian Estimation

NNM DSS event part II

Page 16: A Conceptual Review

Understand the fundamentals of Bayesian statisticsOpportunities of Machine Learning in Economics

An Example: Shrinkage models

Problem breakdown

. Imagine a management consultant aiming at increasing the profits for a client

. Image a researcher trying to estimate the marginal propensity to consume

. How would you proceed?

Figure: The Bayesian approach from a business perspective

NNM DSS event part II

Page 17: A Conceptual Review

Understand the fundamentals of Bayesian statisticsOpportunities of Machine Learning in Economics

An Example: Shrinkage models

How Bayesian methods differ from machine learning

. Pure machine learning arises from statistical learning, best illustrated as follows

Figure: An Illustration of Machine Learning

. The learning machine has various knobs, whose settings change the prediction. Learning is about twiddling the knobs to make better predictions

. This differs profoundly from the Bayesian view, as the learning is arbitrary

. Unlike a model, the machine has no meaningful semantics compared to beliefs

. The knobs do not correspond to the parameters of a Bayesian model

NNM DSS event part II

Page 18: A Conceptual Review

Understand the fundamentals of Bayesian statisticsOpportunities of Machine Learning in Economics

An Example: Shrinkage models

Opportunities of Machine Learning in Economics

“There have been very fruitful collaborations between computer scientists andstatisticians in the last decade or so, and I expect collaborations betweencomputer scientists and econometricians will also be productive in the future.”

Varian (2014)

NNM DSS event part II

Page 19: A Conceptual Review

Understand the fundamentals of Bayesian statisticsOpportunities of Machine Learning in Economics

An Example: Shrinkage models

Focus of econometricians is to draw inference ...

. Econometricians rely on statistical properties to draw inference

. Suppose model P = {Pθ : θ ∈ Θ} and data X ∈ RN×p with Pθ (X) = P (X|θ)

. An estimator θ̂ = θ̂ (X) is an unbiased estimator for the parameter θ iff

[θ̂]= θ ∀θ ∈ Θ

. The sequence of estimators θ̂N = θ̂N (x1, . . . ,xN ) is a consistent iff

∀ε > 0 : limN−→∞

(∣∣θ̂N − θ∣∣ < ε

)= 1 ∀θ ∈ Θ

. An efficient estimator is efficient among the unbiased estimators if

(θ̂)= I−1

θ ∀θ ∈ Θ (Cramér-Rao bound)

NNM DSS event part II

Page 20: A Conceptual Review

Understand the fundamentals of Bayesian statisticsOpportunities of Machine Learning in Economics

An Example: Shrinkage models

... whereas the focus in machine learning is prediction

. Let L (f ) = EP(X ,Y ) [` (f (X ) ,Y )] , and let{

f ∗, f̂}

denote optimal and feasible. The objective in machine learning is to minimize the error of prediction given by

L(

f̂D)

︸ ︷︷ ︸prediction error

= L(

f̂D)−L (f ∗)︸ ︷︷ ︸

estimation error

+ L (f ∗)−L∗︸ ︷︷ ︸approximation error

+ L∗︸︷︷︸irreducible error

. In causal inference, the objective is to minimize the in-sample prediction error

L(

f̂D)−L (f ∗)︸ ︷︷ ︸

estimation error

= L̂(

f̂D)− L̂ (f ∗)︸ ︷︷ ︸

in-sample prediction error

+ L(

f̂D)− L̂

(f̂D)

︸ ︷︷ ︸unseen overfit

+ L̂ (f ∗)−L (f ∗)︸ ︷︷ ︸random variation

. Coming at the cost of unseen overfit and often increases in approximation error

NNM DSS event part II

Page 21: A Conceptual Review

Understand the fundamentals of Bayesian statisticsOpportunities of Machine Learning in Economics

An Example: Shrinkage models

Common specification reveals bias-variance trade-off

. Consider input and output space X ,Y , assume (X ,Y ) ∈ X ×Y stochasticvariables distributed according to unknown joint probability distribution P (X ,Y ),

. Specify loss function(` (z) = z2

), common to econometricians

. Observe data D = {(xi , yi )}, xi ∈ Rp, yi ∈ R ∀i = {1, . . . ,N}, D ∼iid P (X ,Y )

. Given D, we estimate a function f̂ ∈ F , f̂ : X 7→ Y that predicts Y from X .

L(

f̂)

= EP(X ,Y )

Y − f ∗ (X ) + f ∗ (X )︸ ︷︷ ︸

artifically added

− f̂ (X )

2

EP(X ,Y )

[(Y − f ∗ (X ))2

]︸ ︷︷ ︸

irreducible noise

+ EP(X ,Y )

[(f ∗ (X )− f̂ (X )

)2]

︸ ︷︷ ︸mean squared error

+2EP(X ,Y )

[(Y − f ∗ (X ))

(f ∗ (X )− f̂ (X )

)]︸ ︷︷ ︸

covariance of noise and bias

.

NNM DSS event part II

Page 22: A Conceptual Review

Understand the fundamentals of Bayesian statisticsOpportunities of Machine Learning in Economics

An Example: Shrinkage models

Irreducible error and covariance of noise and bias

. Denote by σ2ε the irreducible noise EP(X ,Y )

[(Y − f ∗ (X ))2

]. The covariance of noise and bias vanishes due to

EP(X ,Y )

[(Y − f ∗ (X ))

(f ∗ (X )− f̂ (X )

)]=∫ ∫

(Y − f ∗ (X ))(

f ∗ (X )− f̂ (X ))

P (Y |X )P (X )︸ ︷︷ ︸=P(Y ,X )

dYdX =

∫ {EY |X (Y − f ∗ (X ))

}︸ ︷︷ ︸

=0

(f ∗ (X )− f̂ (X )

)P (X ) dX ,

NNM DSS event part II

Page 23: A Conceptual Review

Understand the fundamentals of Bayesian statisticsOpportunities of Machine Learning in Economics

An Example: Shrinkage models

Mean squared error and bias-variance trade-off

MSE = EP(X ,Y )

[(f ∗ (X )−ED

[f̂ (X )

]+ ED

[f̂ (X )

]− f̂ (X )

)2]

= EP(X ,Y )

[(f ∗ (X )−ED

[f̂ (X )

])2+(

ED[f̂ (X )

]− f̂ (X )

)2

+2(

f ∗ (X )−ED[f̂ (X )

]) (ED

[f̂ (X )

]− f̂ (X )

)].

. Expectation with respect to D (Fubini’s Theorem) causes the last term to vanish

ED,P(X ,Y )

[(Y − f̂ (X )

)2]

=

σ2ε (noise variance)

+ED,P(X ,Y )

[(f ∗ (X )−ED

[f̂ (X )

])2]

(expected squared bias)

+ED,P(X ,Y )

[(ED

[f̂ (X )

]− f̂ (X )

)2]

(expected variance) .

NNM DSS event part II

Page 24: A Conceptual Review

Understand the fundamentals of Bayesian statisticsOpportunities of Machine Learning in Economics

An Example: Shrinkage models

The bias-variance trade-off illustrates the different objectives

Figure: Bias-variance trade-off

NNM DSS event part II

Page 25: A Conceptual Review

Understand the fundamentals of Bayesian statisticsOpportunities of Machine Learning in Economics

An Example: Shrinkage models

What does machine learning differently?

. Function class F much wider, F = {f : X 7→ Y} compared to e.g. OLS

. Regularization to avoid overfit, A : (λ,Θ,D) 7→ f̂D ∈ Ff̂D = argmin

f∈FL̂ (f ) + λJ (f ) (Tikhonov regularization)

. The role of λ, and tuning via cross-validation:Step 1 Split the training data D into k equal-sized folds with N

k observations per fold

Step 2 Denote by f̂−κ(i)D (xi ) the prediction for observation (yi ,xi ) on other folds

Step 3 Calculate the CV error as the average prediction loss on the left-out foldj ∈ {1, . . . , k}:

CVj (Θ,λ) =1

Nk∑

{i :κ(i)=j}`(

f̂−κ(i)D

)Step 4 Iterate steps 2-3 for each of the k folds to obtain a k vector of CVj (Θ,λ), which

is to be averaged as:

CV (Θ,λ) =1k

k

∑j=1

1Nk

∑{i :κ(i)=j}

`(

f̂−κ(i)D

) =1N

N

∑i=1

`(

f̂−κ(i)D

)

NNM DSS event part II

Page 26: A Conceptual Review

Understand the fundamentals of Bayesian statisticsOpportunities of Machine Learning in Economics

An Example: Shrinkage models

An example: Shrinkage models

NNM DSS event part II

Page 27: A Conceptual Review

Understand the fundamentals of Bayesian statisticsOpportunities of Machine Learning in Economics

An Example: Shrinkage models

The Ridge estimator

. Observe data D = {(xi , yi )}, xi ∈ Rp, yi ∈ R ∀i = {1, . . . ,N}, D ∼iid P (X ,Y )

L̂Ridge(

β̂,λ)=(

y− Xβ̂)> (

y− Xβ̂)+ λβ̂

>β̂ = y>y− 2β̂

>X>y+ β̂>X>Xβ̂ + λβ̂

>β̂

. FOC implies that

∂L̂ (β,λ)

∂β̂> = −2X>y + 2X>Xβ̂ + 2λβ̂ = 0⇔

(X>X + λIp

)β̂ = X>y

β̂ridge

=(

X>X + λIp

)−1X>y

which essentially means that

E[

β̂ridge|X

]≤ β = E

[β̂

OLS|X]

︸ ︷︷ ︸bias increases

∧ tr(

V[

β̂ridge|X

])≤ tr

(V[

β̂OLS|X

])︸ ︷︷ ︸

variance decreases

. Thus, we intentionally introduce a bias, which decreases the variance. in addition, X>X + λIp is generally non-singular even if X>X is singular

NNM DSS event part II

Page 28: A Conceptual Review

Understand the fundamentals of Bayesian statisticsOpportunities of Machine Learning in Economics

An Example: Shrinkage models

Ridge vs. Lasso from a graphical perspective

. The choice of `2-norm is somewhat arbitrary. An extension is the LASSO

Lasso: β̂ = argminβ‖y− Xβ‖2

2 subject top

∑j=1

∣∣βj∣∣ ≤ t (using `1-norm)

Ridge: β̂ = argminβ‖y− Xβ‖2

2 subject top

∑j=1

β2j ≤ t (using squared `2-norm)

Figure: Lasso (left) and ridge (right) visualizations

NNM DSS event part II

Page 29: A Conceptual Review

Understand the fundamentals of Bayesian statisticsOpportunities of Machine Learning in Economics

An Example: Shrinkage models

Oracle Properties

. Suppose E [Y |X] = β1X1 + . . . + βpXp, and that A ={

j : βj 6= 0}, |A| < p.

. One can show that the Lasso enjoys the first Oracle property (but not the second!)

P(

β̂LassoAc = 0

)−→ 1

NNM DSS event part II