On the sparse estimation - CNS 1. Why sparse estimation is necessary? Generalization ability for ill...

On the sparse estimation

ATR Computational Neuroscience Laboratories

Masa-aki Sato

Contents

1. Why sparse estimation is necessary?Generalization ability for ill posed problem

2. Why sparse estimation can be achieved?Role of Bayesian estimation

3. Example of sparse estimation for real problem

Ill-posed problem

Large number of parameters are estimated from small number of data

• Maximum Likelihood (Minimum Squared Error)– Estimate optimal parameter which maximize likelihood– Overfitting:

Complex models with many adjustable parameters tend to fit noise in training data and degrade generalization ability

• Sparse estimation　　Extract relevant features and discard irrelevant features

to attain good generalization ability

Function approximation by polynomial

50 data points are randomly generated from quadratic function

22y x x noise= − +

20 1 2

21, , , ,

y w w x w x w x

X x x x

= + + + +

⎡ ⎤= ⎢ ⎥⎣ ⎦

… Input (feature) variable

Linear parameter model

0 1 2, , , , NW w w w w⎡ ⎤= ⎢ ⎥⎣ ⎦… Weight parameter vector

Maximum likelihood (Quadratic function)

Find optimal W( ) ( )( )21

terror y t W X t

== − ⋅∑

Minimize Squared Error

Maximum likelihood (20 degree polynomial)

Overfitting : Optimal W fits noise in training dataGeneralization is not good

Regularization method (20 degree polynomial)

( ) ( )( )2 2

terror y t W X t Wα

== − ⋅ + ⋅∑ Minimization

( ) ( )21exp2

P W Wα∝ − ⋅Prior :

Model selection• Search best model which gives best generalization error

by changing number pf parameters (polynomial degree)• Combinatorial serach is almost impossible

for large degree of freedom

Training error & Generalization error

Number pf parameters (Polynomial degree)

Sparse estimation by Bayesian method• Parameters are considered as random variable• Posterior probability is calculated for possible parameter value• Estimation is done by integrated over possible value

according to posterior probability distribution

Sparse estimation (20 degree polynomial)•　Precision parameter is introduced

for each weight component and estimated from observed datanwnα

Posterior for 2nd order weight

Posterior for 3rd order weight

( ) ( )21exp ,2n n n nP w wα α∝ − ⋅ = 不定Prior

Prunes irrelevant features in the modeland increase generalization ability

Sparse Estimation

( )( ) ( )

XXθ θ

( )|P Xθ Regular Model

MAP / ML Estimation(Maximum a Posteriori / Maximum Likelihood)

• Posterior

Liklihood prior

Marginal likelihood

( ) ( )( )MAP 0arg max log |P P= Xθ θ θ• Estimate optimal parameter

1θ2θ

Full Bayesian Estimation

• Estimate posterior parameter distributionand integrate over parameters according to the posterior.

• Posterior

Liklihood prior

Marginal likelihood

( )( ) ( )

XXθ θ

( ) ( ) ( )0|P d P P= ∫X Xθ θ θ

( )|d P= ∫ Xθ θ θ θ

Marginal likelihood

Estimated parameter

Model reductionby Bayesian method

Mixture of Gaussian example

Redundant Model (ill-posed problem)Estimation model is a Mixture of two Gaussian units

Assume data is generated by a single Gaussian

Single unit model correspond to three cases :

( ) ( ) ( ) ( )0 1 0 2| | 1 |P g N g Nθ θ= ⋅ + − ⋅x x xθ

( ) ( )( ) ( )( ) ( )

1 1 2 0

| | , 1 , arbitrary

| | , 0 , arbitrary

| | , , arbitrary

θ θ θ

Posterior distribution for redundant model

( )|P Xθ

1θ2θ2θ 1θ

( )|P Xθ

Fisher Information matrix becomes singular

0 1g = 0 0g =

Pruning of redundant parameters

1θ2θ

• Complex models explain a given data better than simpler models and give higher posterior value

• Then all parameters are used for prediction

( )|P Xθ

Full Bayesian • Reduced simpler model dominates by integration over parameters

MAP optimal Reduced modelReduced model

Posterior distribution for 100 sample data generated by a single Gaussian model

0 0g =0 1g =

Parameter pruningin Sparse Estimation

Prior in Sparse Estimation Model

( ) ( )( )( )

21| exp2

n n n n n

P const

α α α

∝ ⋅ − ⋅

= （Non-informative prior）

controles a precision (width) ofweight parameter distribution

Arbitrary

0 ,n nw α= =

Posterior parameter distribution for

Posterior for relevant parameter

Posterior for irrelevant parameter

Other parameters are integrated outAnd their effects are taken into account

( ),n nw α

= 0= arbitrary

Calculation of posterior distribution

Free energy maximization Posterior calculation

Distance between trial posterior Q(J,α) and true posterior P(J,α|B)

( )[ ] ( ) ( ) ( )[ ]BαJ,αJ,BαJ, PQKLPQF −= ln

Log marginal likelihood (Evidence)

( ) ( ) ( ) ( )( )B

ααJJBBαJ

P 00, =• Posterior

Liklihood (Hierarhical) prior

Marginal likelihood( )∫= αJBαJJJ ddP ,

( ) ( ) ( ),Q Q Q= JJ J αα αFactorization assumption ：

Repeated until convergence

Posterior distribution

Log marginal likelihood

Variational Bayesian (VB) method

Maximization of F(Q)Maximization of F(Q) w.r.t.

Maximization of F(Q) w.r.t. QJ(J)

( ) ( )| at the maximumP Q≈ JJ B J

( )( ) ( )log maximized P F Q≈B

( )Qα α

( )11log

2 VB VBF Tr −⎡ ⎤′= − ⋅ ⋅ +⎢ ⎥⎣ ⎦B BΣ Σ

′′⋅ = ⋅ ⋅ ⋅ +′

′ ′= ⋅ ⋅ ⋅ ⋅ +

B B G J J G I

G W W G IΣ α

MEG covariance matrix

Estimated MEG covariance

Free energyFree energy after integration of current distribution

Optimal condition (Free energy maximum)

Estimation gain

( )( )

′⋅=J JG

Variational Bayesian methodPosterior calculation is converted to free energy maximization

Free energy = (Likelihood)+ (Model complexity)

( ) ( ) ( )

L Y W X W W

Y Y X X X

α −

⎡ ⎤′= − − ⋅ + ⋅ ⋅⎢ ⎥⎣ ⎦

⎡ ⎤′= − − ⋅ ⋅ ⋅ +⎢ ⎥⎣ ⎦

for finite ,T ≫ 1α

Decreasing function of α

Likelihood

Model complexity

Increasing function of

BIC( )

11 log 121 log2

α−′= − ⋅ ⋅ +

→ − ⋅

11 log 12

−′= − ⋅ ⋅ +

→ → ∞

One dimensional case

Genaral dimension

11 log 121 log2 eff

α−′= − ⋅ ⋅ +

→ − ⋅

Number of finite ,T ≫ 1αeffN

Example of sparse estimationfor real problem

I Nambu, R Osu, M Sato, S Ando, M Kawato, E NaitoSingle-trial reconstruction of !nger-pinch forces from human motor-cortical activation measured by near-

infrared spectroscopy (NIRS)NeuroImage 47 (2009) 628.637

Estimated weight by brute force model search

Estimated weight by sparse estimation

Sparse estimation

(Nambu et al)

Estimate pinching force from 24ch x 21 (sec) NIRS data

Estimated force from NIRS data

On the sparse estimation - CNS 1. Why sparse estimation is necessary? Generalization ability for ill...

Documents

Damage Estimation and Localization from Sparse Aerial Imagery

Adaptive Minimax Regression Estimation over Sparse -Hulls

Bayesian Models for Structured Sparse Estimation via Set ...tiberiocaetano.com/papers/2014/LiuZhaCae14.pdf · Bayesian Models for Structured Sparse Estimation via ... a Markov random

Structured Sparse Estimation with Network Flow Optimizationlear.inrialpes.fr/people/mairal/resources/pdf/neyman.pdf · 2012-10-05 · Structured Sparse Estimation with Network Flow

People Movement Estimation Using Sparse CDR Datasekilabwp.sekilab.global/wp-content/uploads/hasegawa_01...People Movement Estimation Using Sparse CDR Data Yoko Hasegawa, Yoshihide

Sparse to Dense Scene Flow Estimation from Light Fields · 2020-05-17 · The sparse-to-dense approach naturally handles occlusions: while the sparse estimation can only be obtained

Sparse Estimation Co Variance Matrix

SPARSE CHANNEL ESTIMATION AND CHANNEL CAPACITY

Estimation via Sparse Approximation: Error Bounds and

High-Dimensional Covariance Decomposition into Sparse ... · Keywords: high-dimensional covariance estimation, sparse graphical model selection, sparse covariance models, sparsistency,

Generalized Low-rank plus Sparse Tensor Estimation by Fast

Optimal sparse volatility matrix estimation for high

Convex Optimization Based State Estimation against Sparse ...yilinmo.github.io/public/papers/automatica16.pdf · Convex Optimization Based State Estimation against Sparse Integrity

Sparse Inverse Covariance Estimation with the Graphical Lassoswoh.web.engr.illinois.edu/.../handout/fall2016_slide22.pdf · 2016. 12. 3. · Sparse Inverse Covariance Estimation with

Sparse Nonparametric Density Estimation in High Dimensions …hanliu/slides/drodeo_slides.pdf · 2006. 7. 2. · Outline Sparse Nonparametric Density Estimation in High Dimensions

Realtime Traffic Speed Estimation with Sparse Crowdsourced ...yongxintong.group/static/paper/2018/ICDE2018_Realtime Traffic Spe… · Realtime Trafﬁc Speed Estimation with Sparse

Sparse Estimation of High-dimensional Correlation Matricesmatsundf/SEC5_0529.pdf · Sparse Estimation of High-dimensional Correlation Matrices ... We estimate the sparse correlation

DOA Estimation in Heteroscedastic Noise with sparse

Sparse estimation tutorial 2014

Dimensionality-reduced estimation of primaries by sparse ... · Dimensionality-reduced estimation of primaries by sparse inversion by Bander K. Jumah B.Sc. Software Engineering, King