A Conceptual Review

Understand the fundamentals of Bayesian statisticsOpportunities of Machine Learning in Economics

An Example: Shrinkage models

BAYESIAN STATISTICS AND MACHINE LEARNING

A Conceptual Review

Nicolaj Nørgaard Mühlbach, Ph.D. student

Department of Economics and Business Economics, Aarhus University

November 28, 2017

NNM DSS event part II



Overview of methodology and purpose

. We will cover the link between Bayesian statistics and machine learning

Figure: Different methods and different purposes




How likely are you to be fan of Bayesian/ML methods?

. Assume data D = {(xi , yi )}Ni=1 are samples of B.Sc. Oecon students where:

. xi ∈ [0,1]p denotes to which degree student i is fan of the p courses

. Assume for simplicity that 1N ∑N

i=1 xij =12 ∀j ∈ {1, . . . ,p}

. yi ∈ [0,1] denotes to which degree student i is fan of the Bayesian ML

. Let f̂D : [0,1]p 7→ [0,1] be learned from D (estimated if you will)

. Now, consider an unseen student x0 ∈ Rp who could be anyone of you

Question: Is x0 likely to be fan of Bayesian ML?




x1(Math/Stat 1st) > 12

x2(Stat 2nd) > 12

x3(Math 3rd) > 12

x4(QQL 4th) > 12

x5(Econometrics 5th) > 12

y = 1(⇒ huge fan) y = 0.8

y = 0.6

y = 0.4

y = 0.1

y = 0(⇒ not fan)




Our goals for this evening

Understand the fundamentals of Bayesian statistics. Wrap up on fundamental Bayesian theory. The Bayesian approach to machine learning (or anything). The exponential-gamma Bayesian model. The computational challenge. Distinctive features of the Bayesian approach. How Bayesian methods differ from machine learning

Understand the opportunities of machine learning in economics. Econometric challenges in terms of prediction. Bias-variance trade-off. What machine learning does differently

Understand an example of shrinkage models. The Ridge estimator. Ridge vs. Lasso from a graphical perspective




Understand The Fundamentals of Bayesian Statistics

“Indeed, if you accept the argument that the false positive rate should behigher for theories that are unlikely, then you have already adopted afundamentally Bayesian line of reasoning.”

Harvey (2017)




Wrap up on fundamental Bayesian theory

. Unit of statistical inference for both frequentists and Bayesians is a family ofprobability densities

F = {fθ (x) ; x ∈ X , θ ∈ Θ} , X input space, Θ parameter space

. Posterior combines the prior and the conditional likelihood for the parameters

. This is done using Bayes’ Rule

P (parameters|data) =P (parameters)×P (data|parameters)

P (data)Posterior ∝ Prior× Likelihood

. data is fixed at its observed value while θ vary over Θ (opposite of frequentist)

. We make predictions by integrating with respect to the posterior

P (new data|data) =∫

parameters

P (new data|parameters)×P (parameters|data)




. This illustrates how uncertainty is generated in the model

Figure: Bayesian inference proceeds vertically given x ; frequentist inference proceeds horizontallygiven µ




The Bayesian approach to machine learning (or anything)

. Bayesian modeling applies Bayes’ rule to the unknown variables in a model. Many machine learning algorithms involve exactly this step

. In a simple, generic form we can write this process as

x ∼ p (x |θ) ← the data-generating distribution. This is called the modelθ ∼ p (θ) ← the a priori knowledge of θ. This is called the model prior

. We want to learn θ from data using

p (θ|x) =︸︷︷︸exact equal to

p (x |θ) p (θ)

p (x)∝︸︷︷︸

proportional to

p (x |θ) p (θ)

. but, unfortunately, p (θ|x) is unknown.

. However, we have defined p (x |θ) p (θ) and Bayes guides us, although non-trivial:. the model p (x |θ) can be quite complicated. p (θ|x) can be intractable due to the integral in the normalizing constant




The exponential-gamma Bayesian model

. Suppose X |λ ∼ E (λ), i.e. data is exponentially distributed with parameter λ.Then,

fX (x) = λe−λx , x > 0,λ > 0 (Probability density function)

. Exponential distribution describes the time between events in a Poisson process

. We want to learn f (λ|x) because E [X ] = 1λ and V [X ] = 1

λ2 .

. Assume λ ∼ G (α, β), i.e. that the constant average rate is gamma-distributed withθ = {α, β}.

. Then,

p (λ) =βα

Γ (α)λα−1e−βλ, λ > 0 (Probability density function)




Posterior in the exponential-gamma model

. Given sample data x ∈ RN , the posterior distribution is then

p (λ|x) ∝ p (λ)× ` (x|λ)

=βα

Γ (α)λα−1e−βλ︸︷︷︸

prior (gamma PDF)

×N

∏i=1

λe−λxi

︸︷︷︸likelihood

=βα

Γ (α)λα−1e−βλ × λN

(e−λx1 × e−λx2 × · · · × e−λxN

)=

βα

Γ (α)λα+N−1e−βλ × e−λ(x1+x2+...+xN )

=βα

Γ (α)λα+N−1e−βλ × e−λNx

∝ λα+N−1e−(β+Nx)λ ∼︸︷︷︸distributed as

G (α + N, β + Nx)

. Only sufficient statistics required to transit from prior to posterior

. From this, we may calculate the predictive distribution of a new sample xnew




Conjugate priors in general

Definition 1

Conjugate priorsIf F is a class of sampling distributions f (x |θ) and P is a class of prior distributions forθ, p (θ) we say that P is conjugate to F if

p (θ|x) ∈ P∀f (·|θ) ∈ F ∧ p (·) ∈ P

Example

Let F = {E}, i.e. the family we draw x from is only the exponential distributionLet P = {G}, i.e. the family we draw λ from is only the gamma distributionAs derived above, p (θ|x) ∈ P , i.e. the posterior is also gamma-distributedThen, the gamma distribution is a conjugate prior to the exponential distribution

Likelihood Conjugate PriorBinomial, Negative binomial, Geometric Beta

Poisson Gamma

Table: Discrete distributions




The computational challenge

. Except specifying priors, the big challenge is computing the posterior distribution

. There are four main approaches (but we will not cover variational approximation):

. Analytical integration. Use conjugate priors, which combine nicely with the likelihood. Usually too much to hope for

. Laplace approximation. Treat posterior as being approximately Gaussian by choosing {µ,Σ} such that

p (θ|X ) ≈d N (µ,Σ)

. Works well when there’s a lot of data and low model complexity. Monte Carlo integration (Gibbs sampling or MCMC)

. Simulating a Markov chain that eventually converges to the posteriordistribution. Imagine we are interested in Ep(X ) [f (X )]. The idea follows

Sample X1, . . . ,XN ∼IID p (X ) =⇒ Ep(X ) [f (X )] ≈ 1N

N

∑i=1

f (Xi )

. As N −→ ∞, the approximation converges to the true expectation

. Can be applied to a remarkable variety of problems (dominant approach)




Distinctive features of the Bayesian approach

Probability. Probability is used not only to describe “physical” randomness, but also to

describe uncertainty regarding the true values of the parameters. These prior and posterior probabilities represent degrees of belief

Modeling. The Bayesian approach takes modeling seriously. A Bayesian model includes a suitable prior distribution for model parameters

. If the model/prior are chosen without regard for the actual situation, there isno justification for believing the results of Bayesian inference

Finite Sample Justification. The model and prior are chosen based on our knowledge of the problem. Thus, we do not rely on asymptotic theory, and we do not restrict the complexity of

the model just because we have only a small amount of data

Also, features include model complexity justification and problem breakdownNNM DSS event part II



Model complexity justification

. Image a complex and high-dimensional model

. Often, the log-likelihood will be flat along some dimension

. Adding prior knowledge corresponds to adding curvature to the likelihood

Figure: An Illustration of Bayesian Estimation




Problem breakdown

. Imagine a management consultant aiming at increasing the profits for a client

. Image a researcher trying to estimate the marginal propensity to consume

. How would you proceed?

Figure: The Bayesian approach from a business perspective




How Bayesian methods differ from machine learning

. Pure machine learning arises from statistical learning, best illustrated as follows

Figure: An Illustration of Machine Learning

. The learning machine has various knobs, whose settings change the prediction. Learning is about twiddling the knobs to make better predictions

. This differs profoundly from the Bayesian view, as the learning is arbitrary

. Unlike a model, the machine has no meaningful semantics compared to beliefs

. The knobs do not correspond to the parameters of a Bayesian model




Opportunities of Machine Learning in Economics

“There have been very fruitful collaborations between computer scientists andstatisticians in the last decade or so, and I expect collaborations betweencomputer scientists and econometricians will also be productive in the future.”

Varian (2014)




Focus of econometricians is to draw inference ...

. Econometricians rely on statistical properties to draw inference

. Suppose model P = {Pθ : θ ∈ Θ} and data X ∈ RN×p with Pθ (X) = P (X|θ)

. An estimator θ̂ = θ̂ (X) is an unbiased estimator for the parameter θ iff

Eθ

[θ̂]= θ ∀θ ∈ Θ

. The sequence of estimators θ̂N = θ̂N (x1, . . . ,xN ) is a consistent iff

∀ε > 0 : limN−→∞

Pθ

(∣∣θ̂N − θ∣∣ < ε

)= 1 ∀θ ∈ Θ

. An efficient estimator is efficient among the unbiased estimators if

Vθ

(θ̂)= I−1

θ ∀θ ∈ Θ (Cramér-Rao bound)




... whereas the focus in machine learning is prediction

. Let L (f ) = EP(X ,Y ) [` (f (X ) ,Y )] , and let{

f ∗, f̂}

denote optimal and feasible. The objective in machine learning is to minimize the error of prediction given by

L(

f̂D)

︸︷︷︸prediction error

= L(

f̂D)−L (f ∗)︸︷︷︸

estimation error

+ L (f ∗)−L∗︸︷︷︸approximation error

+ L∗︸︷︷︸irreducible error

. In causal inference, the objective is to minimize the in-sample prediction error

L(

f̂D)−L (f ∗)︸︷︷︸

estimation error

= L̂(

f̂D)− L̂ (f ∗)︸︷︷︸

in-sample prediction error

+ L(

f̂D)− L̂

(f̂D)

︸︷︷︸unseen overfit

+ L̂ (f ∗)−L (f ∗)︸︷︷︸random variation

. Coming at the cost of unseen overfit and often increases in approximation error




Common specification reveals bias-variance trade-off

. Consider input and output space X ,Y , assume (X ,Y ) ∈ X ×Y stochasticvariables distributed according to unknown joint probability distribution P (X ,Y ),

. Specify loss function(` (z) = z2

), common to econometricians

. Observe data D = {(xi , yi )}, xi ∈ Rp, yi ∈ R ∀i = {1, . . . ,N}, D ∼iid P (X ,Y )

. Given D, we estimate a function f̂ ∈ F , f̂ : X 7→ Y that predicts Y from X .

L(

f̂)

= EP(X ,Y )

Y − f ∗ (X ) + f ∗ (X )︸︷︷︸

artifically added

− f̂ (X )

2

EP(X ,Y )

[(Y − f ∗ (X ))2

]︸︷︷︸

irreducible noise

+ EP(X ,Y )

[(f ∗ (X )− f̂ (X )

)2]

︸︷︷︸mean squared error

+2EP(X ,Y )

[(Y − f ∗ (X ))

(f ∗ (X )− f̂ (X )

)]︸︷︷︸

covariance of noise and bias

.




Irreducible error and covariance of noise and bias

. Denote by σ2ε the irreducible noise EP(X ,Y )

[(Y − f ∗ (X ))2

]. The covariance of noise and bias vanishes due to

EP(X ,Y )

[(Y − f ∗ (X ))

(f ∗ (X )− f̂ (X )

)]=∫ ∫

(Y − f ∗ (X ))(

f ∗ (X )− f̂ (X ))

P (Y |X )P (X )︸︷︷︸=P(Y ,X )

dYdX =

∫ {EY |X (Y − f ∗ (X ))

}︸︷︷︸

=0

(f ∗ (X )− f̂ (X )

)P (X ) dX ,




Mean squared error and bias-variance trade-off

MSE = EP(X ,Y )

[(f ∗ (X )−ED

[f̂ (X )

]+ ED

[f̂ (X )

]− f̂ (X )

)2]

= EP(X ,Y )

[(f ∗ (X )−ED

[f̂ (X )

])2+(

ED[f̂ (X )

]− f̂ (X )

)2

+2(

f ∗ (X )−ED[f̂ (X )

]) (ED

[f̂ (X )

]− f̂ (X )

)].

. Expectation with respect to D (Fubini’s Theorem) causes the last term to vanish

ED,P(X ,Y )

[(Y − f̂ (X )

)2]

=

σ2ε (noise variance)

+ED,P(X ,Y )

[(f ∗ (X )−ED

[f̂ (X )

])2]

(expected squared bias)

+ED,P(X ,Y )

[(ED

[f̂ (X )

]− f̂ (X )

)2]

(expected variance) .




The bias-variance trade-off illustrates the different objectives

Figure: Bias-variance trade-off




What does machine learning differently?

. Function class F much wider, F = {f : X 7→ Y} compared to e.g. OLS

. Regularization to avoid overfit, A : (λ,Θ,D) 7→ f̂D ∈ Ff̂D = argmin

f∈FL̂ (f ) + λJ (f ) (Tikhonov regularization)

. The role of λ, and tuning via cross-validation:Step 1 Split the training data D into k equal-sized folds with N

k observations per fold

Step 2 Denote by f̂−κ(i)D (xi ) the prediction for observation (yi ,xi ) on other folds

Step 3 Calculate the CV error as the average prediction loss on the left-out foldj ∈ {1, . . . , k}:

CVj (Θ,λ) =1

Nk∑

{i :κ(i)=j}`(

f̂−κ(i)D

)Step 4 Iterate steps 2-3 for each of the k folds to obtain a k vector of CVj (Θ,λ), which

is to be averaged as:

CV (Θ,λ) =1k

k

∑j=1

1Nk

∑{i :κ(i)=j}

`(

f̂−κ(i)D

) =1N

N

∑i=1

`(

f̂−κ(i)D

)




An example: Shrinkage models




The Ridge estimator

. Observe data D = {(xi , yi )}, xi ∈ Rp, yi ∈ R ∀i = {1, . . . ,N}, D ∼iid P (X ,Y )

L̂Ridge(

β̂,λ)=(

y− Xβ̂)> (

y− Xβ̂)+ λβ̂

>β̂ = y>y− 2β̂

>X>y+ β̂>X>Xβ̂ + λβ̂

>β̂

. FOC implies that

∂L̂ (β,λ)

∂β̂> = −2X>y + 2X>Xβ̂ + 2λβ̂ = 0⇔

(X>X + λIp

)β̂ = X>y

β̂ridge

=(

X>X + λIp

)−1X>y

which essentially means that

E[

β̂ridge|X

]≤ β = E

[β̂

OLS|X]

︸︷︷︸bias increases

∧ tr(

V[

β̂ridge|X

])≤ tr

(V[

β̂OLS|X

])︸︷︷︸

variance decreases

. Thus, we intentionally introduce a bias, which decreases the variance. in addition, X>X + λIp is generally non-singular even if X>X is singular




Ridge vs. Lasso from a graphical perspective

. The choice of `2-norm is somewhat arbitrary. An extension is the LASSO

Lasso: β̂ = argminβ‖y− Xβ‖2

2 subject top

∑j=1

∣∣βj∣∣ ≤ t (using `1-norm)

Ridge: β̂ = argminβ‖y− Xβ‖2

2 subject top

∑j=1

β2j ≤ t (using squared `2-norm)

Figure: Lasso (left) and ridge (right) visualizations




Oracle Properties

. Suppose E [Y |X] = β1X1 + . . . + βpXp, and that A ={

j : βj 6= 0}, |A| < p.

. One can show that the Lasso enjoys the first Oracle property (but not the second!)

P(

β̂LassoAc = 0

)−→ 1


Documents

A Conceptual Review