On the sparse estimation - CNS 1. Why sparse estimation is necessary? Generalization ability for ill...

Preview:

Citation preview

On the sparse estimation

ATR Computational Neuroscience Laboratories

Masa-aki Sato

Contents

1. Why sparse estimation is necessary?Generalization ability for ill posed problem

2. Why sparse estimation can be achieved?Role of Bayesian estimation

3. Example of sparse estimation for real problem

Ill-posed problem

Large number of parameters are estimated from small number of data

• Maximum Likelihood (Minimum Squared Error)– Estimate optimal parameter which maximize likelihood– Overfitting:

Complex models with many adjustable parameters tend to fit noise in training data and degrade generalization ability

         

• Sparse estimation  Extract relevant features and discard irrelevant features

to attain good generalization ability

Function approximation by polynomial

50 data points are randomly generated from quadratic function

22y x x noise= − +

20 1 2

21, , , ,

NN

N

y w w x w x w x

W X

X x x x

= + + + +

= ⋅

⎡ ⎤= ⎢ ⎥⎣ ⎦

… Input (feature) variable

Linear parameter model

0 1 2, , , , NW w w w w⎡ ⎤= ⎢ ⎥⎣ ⎦… Weight parameter vector

Maximum likelihood (Quadratic function)

Find optimal W( ) ( )( )21

T

terror y t W X t

== − ⋅∑

Minimize Squared Error

Maximum likelihood (20 degree polynomial)

Overfitting : Optimal W fits noise in training dataGeneralization is not good

Regularization method (20 degree polynomial)

( ) ( )( )2 2

1

T

terror y t W X t Wα

== − ⋅ + ⋅∑ Minimization

( ) ( )21exp2

P W Wα∝ − ⋅Prior :

Model selection• Search best model which gives best generalization error

by changing number pf parameters (polynomial degree)• Combinatorial serach is almost impossible

for large degree of freedom

Training error & Generalization error

Number pf parameters (Polynomial degree)

Sparse estimation by Bayesian method• Parameters are considered as random variable• Posterior probability is calculated for possible parameter value• Estimation is done by integrated over possible value

according to posterior probability distribution

Sparse estimation (20 degree polynomial)• Precision parameter is introduced

for each weight component and estimated from observed datanwnα

Posterior for 2nd order weight

Posterior for 3rd order weight

( ) ( )21exp ,2n n n nP w wα α∝ − ⋅ = 不定Prior

Prunes irrelevant features in the modeland increase generalization ability

Sparse Estimation

( )( ) ( )

( )0|

|P P

PP

=X

XXθ θ

θ

( )|P Xθ Regular Model

MAP / ML Estimation(Maximum a Posteriori / Maximum Likelihood)

• Posterior

Liklihood prior

Marginal likelihood

( ) ( )( )MAP 0arg max log |P P= Xθ θ θ• Estimate optimal parameter

1θ2θ

Full Bayesian Estimation

• Estimate posterior parameter distributionand integrate over parameters according to the posterior.

• Posterior

Liklihood prior

Marginal likelihood

( )( ) ( )

( )0|

|P P

PP

=X

XXθ θ

θ

( ) ( ) ( )0|P d P P= ∫X Xθ θ θ

( )|d P= ∫ Xθ θ θ θ

Marginal likelihood

Estimated parameter

Model reductionby Bayesian method

Mixture of Gaussian example

Redundant Model (ill-posed problem)Estimation model is a Mixture of two Gaussian units

Assume data is generated by a single Gaussian

Single unit model correspond to three cases :

( ) ( ) ( ) ( )0 1 0 2| | 1 |P g N g Nθ θ= ⋅ + − ⋅x x xθ

( ) ( )( ) ( )( ) ( )

1 0 2

2 0 1

1 1 2 0

| | , 1 , arbitrary

| | , 0 , arbitrary

| | , , arbitrary

P N g

P N g

P N g

θ θ

θ θ

θ θ θ

= = =

= = =

= = =

x x

x x

x x

θ

θ

θ

Posterior distribution for redundant model

( )|P Xθ

1θ2θ2θ 1θ

( )|P Xθ

Fisher Information matrix becomes singular

0 1g = 0 0g =

Pruning of redundant parameters

1θ2θ

• Complex models explain a given data better than simpler models and give higher posterior value

• Then all parameters are used for prediction

( )|P Xθ

MAP

Full Bayesian • Reduced simpler model dominates by integration over parameters

MAP optimal Reduced modelReduced model

Posterior distribution for 100 sample data generated by a single Gaussian model

0 0g =0 1g =

Parameter pruningin Sparse Estimation

Prior in Sparse Estimation Model

( ) ( )( )( )

21| exp2

log .

n n n n n

n

P w w

P const

α α α

α

∝ ⋅ − ⋅

= (Non-informative prior)

controles a precision (width) ofweight parameter distribution

Arbitrary

Prior

0 ,n nw α= =

Posterior parameter distribution for

Posterior for relevant parameter

Posterior for irrelevant parameter

Other parameters are integrated outAnd their effects are taken into account

( ),n nw α

nwnα

= 0= arbitrary

Calculation of posterior distribution

Free energy maximization Posterior calculation

Distance between trial posterior Q(J,α) and true posterior P(J,α|B)

( )[ ] ( ) ( ) ( )[ ]BαJ,αJ,BαJ, PQKLPQF −= ln

Log marginal likelihood (Evidence)

( ) ( ) ( ) ( )( )B

ααJJBBαJ

PPPP

P 00, =• Posterior

Liklihood (Hierarhical) prior

Marginal likelihood( )∫= αJBαJJJ ddP ,

( ) ( ) ( ),Q Q Q= JJ J αα αFactorization assumption :

Repeated until convergence

Posterior distribution

Log marginal likelihood

Variational Bayesian (VB) method

Maximization of F(Q)Maximization of F(Q) w.r.t.

Maximization of F(Q) w.r.t. QJ(J)

( ) ( )| at the maximumP Q≈ JJ B J

( )( ) ( )log maximized P F Q≈B

( )Qα α

( )11log

2 VB VBF Tr −⎡ ⎤′= − ⋅ ⋅ +⎢ ⎥⎣ ⎦B BΣ Σ

0 0 0

VB

σ

σ

′′⋅ = ⋅ ⋅ ⋅ +′

′ ′= ⋅ ⋅ ⋅ ⋅ +

B B G J J G I

G W W G IΣ α

MEG covariance matrix

Estimated MEG covariance

Free energyFree energy after integration of current distribution

Optimal condition (Free energy maximum)

Estimation gain

( )( )

VBn

′⋅=J JG

Variational Bayesian methodPosterior calculation is converted to free energy maximization

Free energy = (Likelihood)+ (Model complexity)

( )

( ) ( ) ( )

2

12 2

1212

L Y W X W W

Y Y X X X

α

α −

⎡ ⎤′= − − ⋅ + ⋅ ⋅⎢ ⎥⎣ ⎦

⎡ ⎤′= − − ⋅ ⋅ ⋅ +⎢ ⎥⎣ ⎦

for finite ,T ≫ 1α

Decreasing function of α

α

Likelihood

Model complexity

Error

Increasing function of

BIC( )

11 log 121 log2

H X X

N T

α−′= − ⋅ ⋅ +

→ − ⋅

11 log 12

0

H X X

as

α

α

−′= − ⋅ ⋅ +

→ → ∞

One dimensional case

Genaral dimension

( )

11 log 121 log2 eff

H X X

N T

α−′= − ⋅ ⋅ +

→ − ⋅

Number of finite ,T ≫ 1αeffN

Example of sparse estimationfor real problem

I Nambu, R Osu, M Sato, S Ando, M Kawato, E NaitoSingle-trial reconstruction of !nger-pinch forces from human motor-cortical activation measured by near-

infrared spectroscopy (NIRS)NeuroImage 47 (2009) 628.637

Estimated weight by brute force model search

Estimated weight by sparse estimation

forc

e

time

Sparse estimation

(Nambu et al)

Estimate pinching force from 24ch x 21 (sec) NIRS data

Forc

e

Estimated force from NIRS data

Recommended