An efficient method for feature selection in linear regression based on an extended Akaike’s information criterion

ISSN 0965�5425, Computational Mathematics and Mathematical Physics, 2009, Vol. 49, No. 11, pp. 1972–1985. © Pleiades Publishing, Ltd., 2009.Original Russian Text © D.P. Vetrov, D.A. Kropotov, N.O. Ptashko, 2009, published in Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki, 2009, Vol. 49,No. 11, pp. 2066–2080.

1972

1. INTRODUCTION

Bayesian methods are widely used for constructing procedures of automatic model selection. The rel�evance vector machine (RVM) proposed in [1] provides an example of using the Bayesian paradigm in lin�ear regression. In RVM, every weight of the regressor in the linear decision rule is assigned an individualregularization coefficient (L2 regularization). The regularization coefficients are formed automatically bymaximizing the likelihood of the model (its evidence). As a result of this procedure, which is also knownas the automatic relevance determination (ARD) [2], the majority of regularization coefficients becomeinfinite, which corresponds to zero regressor weights and to removing the corresponding regressors fromthe model. The L1 regularization, which uses a common regularization coefficient (see [3]) or the regu�larization based on the improper Jeffrey prior with the subsequent integration with respect to the regular�ization coefficients also demonstrate high sparseness (the number of zero weights) while preserving a goodgeneralization ability (see [4–6]).

The well�known Schwarz–Bayes information criterion [7] can be considered as a coarse approxima�tion of the marginal log�likelihood (evidence) (see [8]). Akaike’s information criterion (AIC [9]) is alsowidely used for modeling the selection problem. It provides an alternative approach based on informationtheory. Although this method was first proposed for selection from a finite set of models, it can be extendedto the case of a continual family of models. In this paper, we propose an extension of Akaike’s informationcriterion for model selection with the subsequent application in linear regression. The selection of the reg�ularization coefficient for every individual weight is performed by maximizing a continual analog ofAkaike’s criterion (EAIC). Of particular interest is the sparseness of the solutions obtained using the pro�posed method and its comparison with the classical RVM.

In Section 2, we derive the continual analog of Akaike’s criterion (EAIC). In Section 3, we demon�strate the application of EAIC to the generalized linear regression problem and derive iterative formulasto recalculate the regularization coefficients. In Section 4, experimental results are discussed and EAIC iscompared with the classical RVM.

An Efficient Method for Feature Selection in Linear Regression Based on an Extended Akaike’s Information Criterion

D. P. Vetrova, D. A. Kropotovb, and N. O. Ptashkoa

a Faculty of Computational Mathematics and Cybernetics, Moscow State University, Moscow, 119992 Russiab Dorodnicyn Computing Center, Russian Academy of Sciences, ul. Vavilova 40, Moscow, 119333 Russia

e�mail: [email protected], [email protected], [email protected] May 12, 2009

Abstract—A method for feature selection in linear regression based on an extension of Akaike’s infor�mation criterion is proposed. The use of classical Akaike’s information criterion (AIC) for featureselection assumes the exhaustive search through all the subsets of features, which has unreasonablyhigh computational and time cost. A new information criterion is proposed that is a continuous exten�sion of AIC. As a result, the feature selection problem is reduced to a smooth optimization problem.An efficient procedure for solving this problem is derived. Experiments show that the proposedmethod enables one to efficiently select features in linear regression. In the experiments, the proposedprocedure is compared with the relevance vector machine, which is a feature selection method basedon Bayesian approach. It is shown that both procedures yield similar results. The main distinction ofthe proposed method is that certain regularization coefficients are identical zeros. This makes it pos�sible to avoid the underfitting effect, which is a characteristic feature of the relevance vector machine.A special case (the so�called nondiagonal regularization) is considered in which both methods areidentical.

DOI: 10.1134/S096554250911013X

Key words: pattern recognition, linear regression, feature selection, Akaike’s information criterion.

COMPUTATIONAL MATHEMATICS AND MATHEMATICAL PHYSICS Vol. 49 No. 11 2009

AN EFFICIENT METHOD FOR FEATURE SELECTION 1973

2. EXTENSION OF AKAIKE’S INFORMATION CRITERION

Here, we describe an extension of AIC to the continuous case. Let a training sample Z = (z1, …, zn) (z ∈

�d) be given. We want to reconstruct the unknown probability density function p(x) on the elements of the

set X, where Z and X are samples of length n from the same probability distribution.In order to describe the generic scheme of reconstructing p(x), we introduce the concept of model.Definition 1. A probability model of algorithms for the reconstruction of densities is a triple ⟨Ω, p(X|w),

p(w)⟩, where Ω = {w} are the values of the parameters of probability distribution functions,p(X |w) = is the likelihood function of the sample X for the fixed w, and p(w) is the a priori

distribution on w.Assume that the a priori distribution depends on a parameter A; i.e., it can be written as p(w |A). Then,

we can vary A to obtain a parametric family of models of density reconstruction algorithms {⟨Ω, p(X |w),p(w |A)⟩, A ∈ �}. In this case, A is called the parameter of the probabilistic model.

Definition 2.The Bayesian estimate of the parameter w is defined as the value wMP(Z, A) that maximizesthe regularized likelihood:

Note that the Bayesian estimate depends both on the training sample Z and on the parameter of theprobabilistic model A. Let

where the expectations are taken over all samples of length n from the given distribution p(x).Definition 3. The Fisher matrix is defined as

Let Fn = nF. Note that Fn = �X∇w∇w logp(X |w) = ∇w∇w�X logp(X |w). Below, the symbol ∇ is interpretedas ∇w.

In the case of a fixed A (i.e., a fixed probabilistic model), we will use p(x) as an estimate ofp(x|wMP(Z, A)).

The parameter of the model can be selected using Akaike’s idea to maximize the Kullback informationwith respect to A

(2.1)

The task of training is to find the value of the parameter A of the probabilistic model that maximizes (2.1).This integral cannot be calculated analytically. We formulate the conditions imposed on the family ofprobabilistic models under which this expression can be simplified.

Theorem 1. Let A be a symmetric nonnegative definite real square matrix. Assume that, for any A, the prob�abilistic model Ω(A) satisfies the following assumptions:

(1) p(X |w) = ; that is, the objects of the training sample are independent identically distrib�

uted random variables.(2) logp(X |w) is a quadratic function in w.

(3) p(w |A) = �(w |0, A–1); that is, w has a normal distribution centered at zero with the covariancematrix A–1.

(4) The random variables wMP(X, A) and ∇∇logp(X |wMP(X, A)) are independent.

Then, it holds that

(2.2)

p xi w( )i 1=n∏

wMP Z A,( ) p Z w( )p w A( ).Δ

wmaxarg=

wn* A( ) �ZwMP Z A,( ),

Δ=

Cn A( ) �Z wMP Z A,( ) wn* A( )–[ ] wMP Z A,( ) wn

* A( )–[ ]T

,Δ=

F ∇w∇w p x w( )p x( )log x.Δ

d∫–=

�X�Z p X wMP Z A,( )( )log p Z( )p X( ) p X wMP Z A,( )( )log Xd Zd∫∫= .A

max

Πi 1=n

p xi w( )

�X�Z p X wMP Z A,( )( )log �X p X wMP X A,( )( )log tr Fn A+( )Cn A( ).–=

1974


VETROV et al.

To simplify the notation, we will drop the dependence of wMP, , and Cn on A (this dependence isassumed). Using the theorem on the change of variables under the Lebesgue integral (see [10, Chapter 2,Section 6]), we can rewrite (2.2) in the form

(2.3)

Expanding the internal expectation in the Tailor series about the point , we obtain

(2.4)

To estimate the first term in (2.4), we expand logp(X | ) in the Tailor series about the point wMP(X).Then, we have

(2.5)

Because logp(X |w) is quadratic in w, it is easy to verify that

Using this fact and the independence condition of the random variables wMP(X) and ∇∇logp(X |wMP(X)),we can simplify the two last terms in (2.5):

Substitute this expression into (2.5) and combine the result with (2.4) to obtain

(2.6)

which proves the theorem.Corollary 1. When criterion (2.1) is used in recognition problems, it can be approximately calculated

by the formula

(2.7)

where H(Z) = ∇∇logp(Z |w) = is the Hessian of the log�likelihood.

We show that Cn ≈ (Fn + A)–1Fn(Fn + A)–1. Denote by wML(X) the maximum likelihood estimate on the

sample X. It is known (see [11]) that wML ~ �(wML|w∗, ), where w∗ = argmax (x)logp(x|w)dx. When

logp(x |w) is quadratic in w, it is easy to verify that

(2.8)

where H(X) = (xi |w).

wn*

�X�Z p X wMP Z( )( )log �wMP�X p X wMP( ).log=

wn*

�wMP�X p X wMP( )log �wMP

�X p X wn*( )log �wMP�X ∇ p X wn*( )log[ ]

T

wMP wn*–( )+=

+ 12��wMP

wMP wn*–( )T

∇∇�X p X wn*( )log[ ] wMP wn*–( ) �X p X wn*( )log 12��trFnCn.–=

wn*

�X p X wn*( )log �X p X wMP X( )( )log �X ∇ p X wMP X( )( )log[ ]T wn* wMP X( )–[ ]{ }+=

+ 12��X wn* wMP X( )–[ ]

T∇∇ p X wMP X( )( ) wn* wMP X( )–[ ]log{ }.

∇ p X wMP( )log AwMP.=

�X ∇ p X wMP X( )( )log[ ]T wn* wMP X( )–[ ]

+ 12��X wn* wMP X( )–[ ]

T∇∇ p X wMP X( )( ) wn* wMP X( )–[ ]log

= �XwMPT

X( )A wn* wMP X( )–[ ]

+ 12��tr �X∇∇ p X wMP X( )( )�X wn* wMP X( )–[ ]

Twn* wMP X( )–[ ]log{ } trACn– 1

2��trFnCn.–=

�Z�X p X wMP Z A,( )( )log

= �X p X wMP X A,( )( )log trACn– 12��trFnCn– 1

2��trFnCn–

= �X p X wMP X A,( )( )log tr Fn A+( )Cn–

�X�Z p X wMP Z A,( )( )log p Z wMP Z A,( )( )log tr H Z( ) A+( ) 1–H Z( ),–≈

Σi 1=n ∇∇ p zi w( )log

Fn1–

p∫

wMP X A,( ) H X( ) A+[ ] 1–H X( )wML X( ),=

Σi 1=n ∇∇ plog



Taking into account the fact that �∇∇logp(xi |w) = F and using the law of large numbers (see [10,Chapter III, Section 3]), we obtain

0 as n ∞. (2.9)

Consider the set In(ε) = . It follows from (2.9) that P(In(ε)) 0 as n ∞;

moreover, on the set �n × d

\In(ε), we have

0 as n ∞.

For fixed ε > 0 and n > 0, we have

(2.10)

Assuming that wMP(X) is bounded, we apply the mean value theorem to the second term in (2.10) to obtain

(2.11)

where is a positive constant. Note that, at sufficiently large n, we have the bound ≤

. Then, we have the expansion (see [12, Chapter 5, Section 6]) [Fn/n + A/n + δ1(n)]–1 =

(Fn/n + A/n)–1 + δ2(n), where 0 as n ∞. Furthermore,

(2.12)

where 0 as n ∞. Therefore, we obtain

(2.13)

ε∀ 0 PH X( ) Fn–

n�� ε≥⎝ ⎠

⎛ ⎞>

��XH X( ) Fn–

n�� ε≥

⎩ ⎭⎨ ⎬⎧ ⎫

H X( )n

��Fn

n�� δ1 n( ), где δ1 n( )+=

�wMP X( ) wMP X( )p X( ) Xd

�n d×

\In ε( )

∫ wMP X( )p X( ) X.d

In ε( )

∫+=

�wMP X( ) H X( ) A+[ ] 1–H X( )wML X( )p X( ) Xd

�n d×

\In ε( )

∫ LwMPP In ε( )( ),+=

LwMPδ1 n( )

Fn/n A/n+1–

δ2 n( )

H X( ) A+[ ] 1–H X( )wML X( )p X( ) Xd

�n d×

\In ε( )

∫

= H X( )n

�� An��+

1–H X( )

n��wML X( )p X( ) Xd

�n d×

\In ε( )

∫

= Fn

n�� A

n�� δ1 n( )+ +⎝ ⎠

⎛ ⎞1– Fn

n�� δ1 n( )+⎝ ⎠

⎛ ⎞ wML X( )p X( ) Xd

�n d×

\In ε( )

∫

= Fn

n�� A

n��+⎝ ⎠

⎛ ⎞1–

δ2 n( )+Fn

n�� δ1 n( )+⎝ ⎠

⎛ ⎞ wML X( )p X( ) Xd

�n d×

\In ε( )

∫

= Fn A+( ) 1– FnwML X( )p X( ) Xd

�n d×

\In ε( )

∫ δ3 n( )p X( ) X,d

�n d×

\In ε( )

∫+

δ3 n( )

�wMP X( ) Fn A+( ) 1–Fn wML X( )p X( ) Xd

�n d×

\In ε( )

∫ δ3 n( )p X( ) Xd

�n d×

\In ε( )

∫ LwMPP In ε( )( ).+ +=

1976


VETROV et al.

Due to the facts that (In(ε)) is a positive constant, P(In(ε)) 0 and ||δ3(n)|| 0 as n ∞, the

set �n × d

\In(ε) tends to the set �n × d

when the size of the sample increases, and the two last terms in (2.13)vanish. Neglecting the two last terms, we obtain the following approximation for �wMP(X):

Similarly, we can obtain an approximation for �wMP(X) (X):

Using the equality = [�wML(X) (X) – (�wML(X))(�wML(X))T], we obtain

(2.14)

Substituting the approximate value of Cn into the second term in (2.2), we obtain

.

Substituting in this expression the training sample Z for X and the unbiased estimate H(Z) for the matrixFn, we obtain the assertion of Corollary 1.

Note that, if A is a diagonal matrix with the entries 0 or +∞, then tr(Fn(Fn + A)–1) = k, which is thenumber of nonzero diagonal entries in A. Then, (2.7) is equivalent to the classical AIC accurate to an infi�nitely small constant. Such a continuous extension of Akaike’s criterion (EAIC) can also be considered asa particular case of the deviance information criterion described in [13].

3. APPLICATION OF EAIC IN THE GENERALIZED LINEAR REGRESSION PROBLEM

3.1. Statement of the Optimization Problem

Consider the classical generalized linear regression problem. Let (X, t) = {(x1, t1), … (xn, tn)} be the

training sample, where xi = ( , …, ) ∈ �d is the vector of observed features of the object and ti ∈ � is

the value of the independent variable. Let us fix a set of basis functions {φ1(x), …, φm(x)}, where φj : �d

�d. We want to find the vector of weights w ∈ � such that the function

approximates the values of t at the objects of the training sample X. Let Φ = (φij)n × m = (φj(xi))n × m be thematrix of the basis functions calculated for each object of the training sample. The classical approach tolearning the linear regression is to optimize the regularized likelihood

(3.1)

where

(3.2)

is the likelihood function and

LwMPP

�wMP X( ) Fn A+( ) 1–Fn�wML X( ).≈

wMPT

�wMP X( )wMPT

X( ) Fn A+( ) 1–Fn�wML X( )wML

TX( )Fn Fn A+( ) 1–

.≈

Fn1–

wMLT

Cn �wMP X( )wMPT

X( ) �wMP X( )( ) �wMP X( )( )T–=

≈ Fn A+( ) 1–Fn �wML X( )wML

TX( ) �wML X( )( ) �wML X( )( )

T–[ ]Fn Fn A+( ) 1–

= Fn A+( ) 1–FnFn

1–Fn Fn A+( ) 1–

Fn A+( ) 1–Fn Fn A+( ) 1–

.=

�X�Z p X wMP Z A,( )( )log �X p X wMP X A,( )( )log tr Fn A+( ) 1–Fn–=

xi1

xid

y x( ) wTφ x( ) wjφj x( )

j 1=

m

∑= =

wMP p t X w,( )p w α( ),w

maxarg=

p t X w,( ) 1

2π( )nσn�� 1

2σ2�� Φw t–

2–⎝ ⎠

⎛ ⎞exp=

p w α( ) α2π��⎝ ⎠

⎛ ⎞m α

2�� w

2–⎝ ⎠

⎛ ⎞exp=



is the a priori distribution of the weights. The a priori distribution acts as a regularizer that penalizes largevalues of w. A more general family of regularizers is considered in the RVM [1], where an individual reg�ularization coefficient is used for each weight wj and the a priori distribution function is

(3.3)

where A = diag(α1, …, αm) is the regularization matrix and αj ≥ 0. Such an a priori distribution makes itpossible to select basis functions. If αj = 0, no additional constraints on the weight wMP, j are imposed, andits value coincides with the maximizer of the likelihood wML, j. If the regularization parameter αj tends to pos�itive infinity, the corresponding basis function φj(·) is removed from the model because its weight wMP, j = 0.Therefore, a priori distribution (3.3), together with likelihood function (3.2) and Bayesian estimationmethod (3.1), enables one to solve the problem of basis function selection. This problem turns into theproblem of feature selection if the original features are used as the basis functions, i.e., when φj(x) = x j. Ifthe kernel or potential functions centered at the objects of the training sample are used as the basis func�tions (φj(x) = K(xj, x)), this approach makes it possible to select relevant objects.

Note that the problem of the regression reconstruction in the statistical statement can be considered asa particular case of the density reconstruction problem. Using the functions introduced above, we describethe learning procedure (the selection of w and α) in terms of the probabilistic density reconstruction algo�rithms. The parametric family of probabilistic models can be written in the form

(3.4)

If the probabilistic model α is fixed, we choose the Bayesian estimate wMP(X, α) of the weight vector w asthe solution of the problem. Let us discuss how α can be selected.

The family of probabilistic models (3.4) satisfies the assumptions of Theorem 1. For that reason, we usethe corollary to this theorem to select the best model α:

(3.5)

Here, H = –∇∇logp(t |X, w) = σ–2ΦTΦ.

3.2. Optimization Procedure

We will seek the solution of optimization problem (3.5) using the coordinate descent method separatelyfor each component αj. To derive the iterative equations for the recalculation of αj, we use the block matrixinversion identity

(3.6)

Here, P ∈ �p × p

, Q ∈ �p × q

, R ∈ �q × p

, and S ∈ �q × q

are certain matrices and B = (S – RP–1Q)–1 is theSchur complement.

Next, we apply this identity to the matrix (H + A) written in the form

To simplify the presentation without loss of generality, we derive the iterative equations of αm. The othercomponents of the vector α are recalculated similarly. Using (3.6), we obtain

p w α( )αj

2π��

αj

2��wj

2–⎝ ⎠

⎛ ⎞exp

j 1=

m

∏det A( )

2π( )m/2�� 1

2��w

тAw–⎝ ⎠

⎛ ⎞ ,exp= =

�m

P t X w,( ) p w α( ), ,⟨ ⟩ α �m

∈,{ }.

α max f α( )arg max p t X wMP,( )log tr H H A+( ) 1–[ ]–{ }arg= =

= max � wMP( ) tr H H A+( ) 1–[ ]–{ }.arg

P Q

R S⎝ ⎠⎜ ⎟⎛ ⎞

1–

P1–

P1–QBRP

1–+ P

1–QB–

BRP1–

– B⎝ ⎠⎜ ⎟⎜ ⎟⎛ ⎞

.=

H A+( )P q

qT

hmm αm+⎝ ⎠⎜ ⎟⎛ ⎞

.=

H A+( ) 1– P1– βmP

1–qq

TP

1–+ βmP

1–q–

βmqT

P1–

– βm⎝ ⎠⎜ ⎟⎜ ⎟⎛ ⎞

,=

1978


VETROV et al.

where βm = (hmm + αm – qTP–1q)–1 is the scalar Schur complement. Then, tr[H(H + A)–1] can be repre�sented as a function of αm under the condition that the other αj are fixed:

Here, the subscript (m) denotes a vector (or matrix) from which the mth row (row and column) is removed.

Note that tr(H(m)P–1) is exactly identical to tr[H(H + A)–1] when αm = +∞, that is, when the mth basis

function is removed from the model.Consider the difference between the maximizer of the a posteriori distribution wMP and the maximizer

of the a posteriori distribution for the infinitely large regularization coefficient at the mth basis function

(the values of the other components of α remain unchanged) = ∈ �m

. Denote ψ = HwML.

Using the relation wMP = (H + A)–1HwML, we obtain

(3.7)

Consider the difference between the values of the log�likelihood at the points wMP and :

Using (2.8), we write the gradient in the form

(3.8)

As a result, the value of the EAIC as a function of βm for fixed αj (j ≠ m) is

(3.9)

This criterion is quadratic in βm; therefore, it has a unique maximizer, which is given by the formula

Using the expressions for ξm (3.7) and for ζm (3.8), a and b are found as

(3.10)

(3.11)

Let us pass from the auxiliary variable βm to the original variable αm:

tr H H A+( ) 1–[ ] tr H m( )P1–( ) βm q

TP

1–H m( )P

1–q 2q

TP

1–q– hmm+[ ].+=

wMP* wMP αm +∞=

wMP wMP*– H A+( ) 1–

H A+( ) 1–αm +∞=–[ ]ψ=

= P1– βmP

1–qq

TP

1–+ βmP

1–q–

βmqT

P1–

– βm⎝ ⎠⎜ ⎟⎜ ⎟⎛ ⎞

P1–

0

0т

0⎝ ⎠⎜ ⎟⎜ ⎟⎛ ⎞

– ψ m( )

ψm⎝ ⎠⎜ ⎟⎛ ⎞

= βmP

1–qq

TPψ m( ) ψmP

1–q–

qT

P1–ψ m( )– ψm+⎝ ⎠

⎜ ⎟⎜ ⎟⎛ ⎞

βmξm.=

wMP*

� wMP( ) � wMP*( )– βm∇� wMP*( )Tξm

βm2

2��ξm

THξm– βmζm

Tξm

βm2

2��ξm

THξm.–= =

ζm ∇� wMP*( ) H wMP* wML–( )–= =

= – H H A+( ) 1–αm +∞= I–[ ]ψ H m( )P

1–I–( )ψ m( )

qT

P1–ψ m( ) ψm–⎝ ⎠

⎜ ⎟⎜ ⎟⎛ ⎞

.–=

f βm( ) f 0( ) 12��βm

2ξm

THξm– βmζm

Tξm βmq

TP

1–H m( )P

1–q– 2βmq

TP

1–q βmhmm.–+ +=

βm*ζm

Tξm q

TP

1–H m( )P

1–q– 2q

TP

1–q hmm–+

ξmT

Hξm

�� ba�� .= =

a qT

P1–ψ m( ) ψm–( )

2q

TP

1–H m( )P

1–q 2q

TP

1–q– hmm+( ),=

b ψ m( )

TP

1–H m( )P

1–q 2ψ m( )

TP

1–q– ψm+( ) q

TP

1–ψ m( ) ψm–( )–=

– qT

P1–H m( )P

1–q 2q

TP

1–q hmm.–+

βm hmm αm qT

P1–q–+( )

1–.=



Therefore,

Using this formula, one must take into account the fact that αm ≥ 0. The dependence of EAIC on αm hasa characteristic shape of an iris shown in Fig. 1. In case (a), the maximum is attained for a positive αj. Incase (b), the criterion monotonically decreases in the region αj ≥ 0; therefore, the optimal value is αj = 0.In case (c), the optimal nonnegative value of αj is +∞. The criterion is minus infinity when the matrix H + Ais singular; that is, αm = qTP–1q – hmm. Due to a property of the Schur complement, qTP–1q – hmm ≤ 0; thatis, the “stem” always corresponds to nonpositive αm. Depending on the mutual location of the maximizerand the stem, αm is recalculated differently:

(3.12)

For αj (j ≠ m), we have similar formulas.

To select the parameter σ2, we differentiate EAIC with respect to σ–2. Equating the derivative to zero,we obtain the recalculation formula

(3.13)

Therefore, we have proved the following result.Theorem 2. The relations

hold, where and σ2(new) are calculated by formulas (3.12) and (3.13), respectively. Formula (3.12)assumes that the criterion is optimized successively in each component αj (j = 1, 2, …, m) when the other com�ponents of α are fixed.

αm* qT

P1–q hmm– 1

βm*�� .+=

αmnew( )

αm*, αm* 0,≥

0, qT

P1–q hmm– αm* 0,< <

+∞, αm* qT

P1–q hmm.–<⎩

⎪⎨⎪⎧

=

σ2 new( ) t ΦwMP–2

n 2tr A H A+( ) 1–H H A+( ) 1–( )–

��.=

f αj( )αj 0≥maxarg αj

new( ), j 1 2 … m,, , ,= =

f σ2( )σ

20≥

maxarg σ2 new( ),=

αjnew( )

−3 −2 −1 0 1 2 3

(a) (b)

(c)

−2 −1 0 1

−2 −1 0 1

Fig. 1.

1980


VETROV et al.

Using this result, we can construct an iterative process for the criterion optimization. At each step, theparameter αj that provides the maximum increase of the criterion is optimized (see Algorithm 1). Such analgorithm is akin to the RVM learning proposed in [14].

Algorithm 1 EAIC RVM

entry The training sample ((X, t) = {xi, (xi ∈ �d, ti ∈ �) and the set of basis

functions .

Initialize αj = +inf ∀j = 1, 2, …, m, σ2 = , Φ = , and A = diag(α1, …, αm).

Find the maximum of the log�likelihood wML = (ΦTΦ)–1ΦTt.repeat

Calculate H = σ–2ΦTΦ and Ψ = HwML

for j = 1, 2, …, m loopCalculate H(j), P

–1 = (H(j) + A(j))–1, and q (column j without the jth element).

Calculate a and b using (3.10) and (3.11).

Calculate the optimal = qTP–1q – hjj + a/b and the current

increment of the EAIC Δj = b2/a.

if < 0 then

if > qTP–1q – hjj then

= 0, β0 = 1/(hjj – qTP–1q), Δj = – /(2a) + β0b.

else

= +∞, Δj = 0.

end ifend if if αj ≠ +∞ then

βold = 1/(hjj – qTP–1q + αj)

= – /(2a) + βoldb, Δj = Δj – .

end if end for

Find j* = arg and set αj* = .

Calculate A = diag(α1, .., αm), wMP = (H + A)–1HwML, and recalculate σ2

using (3.13).until the process converges

exit The decision rule for the new object x: f(x) =

3.3. Nondiagonal Regularization

An alternative approach to determining the regularization coefficients α was proposed in the RVM[15]. This method uses Bayesian paradigm for estimating the parameters, and the regularization coeffi�cients are found by maximizing the likelihood of the model (evidence):

. (3.14)

Here, the likelihood function and the a priori distribution function are chosen as before by formulas (3.2)and (3.3). The evidence is maximized using an iterative procedure in which the integrand is approximatedby a normal distribution.

Another method for optimizing the evidence is based on the nondiagonal regularization approach (see[16]). In this case, it is assumed that the regularization matrix A is a diagonal one in the basis consisting of

ti}i 1=n

φj x( ){ }j 1=m

σ02 φj x( )i{ }i j, 1=

n m,

αj*

αj*

αj*

αj* β02

αj*

Δjold βold

old Δjold

Δjj

max αj**

Σj 1=m

wMP j, φj x( )

EV α( ) p t X w,( )p w α( ) wd∫=α

max



the eigenvectors of the Hessian of the log�likelihood. Then, we can pass to the new variables u that are lin�ear combinations of w such that the matrices H, A, and H + A are diagonal in the new basis. This changeof variables considerably simplifies the optimization of evidence because multidimensional integral (3.14)becomes a product of one�dimensional integrals of which each depends on the single regularizationparameter αj. As a result, the iterative process converges in one iteration step because the optimal valuesof the regularization coefficients are independent of each other. Moreover, in distinction from the RVMand ridge regression, this procedure is invariant under linear transformations of the basis functions (thatis, the result of regression learning is the same under arbitrary nonsingular linear transformation φ(x)).

The experimental comparison of the classical RVM and the method proposed in this paper based onthe EAIC is performed in the next section. In this section, we show that both approaches are equivalentin the case of the nondiagonal regularization.

Assuming that H + A is a diagonal matrix and, therefore, q = 0, we obtain

The optimal β is determined by the formula

Hence, we have

(3.15)

The case corresponding to αj = 0 (see Fig. 1b) is impossible because is always nonnegative.Thus, we have the same expressions for αj as in the case when the optimization of evidence is performedsimultaneously with the adjustment of the regularization coefficients associated with the eigenvectors ofthe Hessian (see [16]). Therefore, the nondiagonal regularization based on Akaike’s criterion is equivalentto the Bayesian nondiagonal regularization in the regression problem.

4. EXPERIMENTS AND DISCUSSION

4.1. Feature Selection

Akaike’s information criterion is widely used for selecting regressors in the classical linear regression.However, this procedure is very costly because all possible subsets of regressors must be examined. Theproposed method (EAIC) considerably facilitates this procedure because the discrete optimization pro�cedure is reduced to a smooth optimization problem for which an efficient iterative procedure can be con�structed.

Consider the toy regression problem with 49 features that have standard Gaussian distributions,100 objects, and the target variable

where ε ~ �(ε|0, 0.5). Performing the EAIC procedure with φj(x) = x j, we obtain 15 relevant features thathave the regularization coefficients αj > 100; among the other coefficients, α2 = 0.93, α6 = 0.10, and α22 = 0.The corresponding weights are shown in Fig. 2. One can see that only three weights are considerablygreater than zero, and they are close to the actual values.

βm1

hmm αm+��,=

ξm0

hmmuML m,⎝ ⎠⎜ ⎟⎛ ⎞

,=

ζmTξm hmmuML m,( )2

.=

βm* hmmuML m,( )2

hmm–

hmm3

uML m,2

��.=

αm

hmm

hmmuML m,2

1–��, hmmuML m,

21,>

+∞, 0 hmmuML m,2

1.< <⎩⎪⎨⎪⎧

=

hmmuML m,2

t x2 3x6 2x22 ε,+ + +=

1982


VETROV et al.

4.2. The Function Sinc

EAIC has the sparseness property, and it can be considered as an alternative of RVM in which the reg�ularization coefficients are formed using the maximum evidence principle:

The difference between the two methods can be seen using the following toy problem as an example.

Consider the noisy function with the uniform noise on the interval [–0.2, 0.2] and 100 objects in

the sample. Figure 3 shows the regressions obtained using the EAIC (Fig. 3a) and the RVM methods

(Fig. 3b). The basis functions φj(x) = exp (j = 1, 2, …, n), where xj are the objects in the train�

ing sample, and the width of the Gaussian is σ = 0.4. The actual dependence is shown in solid line, thedotted lines correspond to the predicted values, and open dots depict the relevant objects. The RVM andEAIC detect six and four relevant objects, respectively. One can see from the plots that the regressionobtained using the RVM is closer to zero, especially near the endpoints of the interval. This can beexplained by the fact that all the regularization coefficients are distinct from zero; therefore, even the rel�evant basis functions are slightly regularized. On the other hand, the solution obtained using the EAIC issparser. Two of the four basis functions have zero regularization coefficients. We see that, compared withthe EAIC, the RVM is characterized by the underfitting effect (oversimplification of the model), which isoften mentioned for this method (see [17]).

α max p t X w,( )p w α( ) w.d∫arg=

x( )sinx

��

xj x–2

2σ2��–

⎝ ⎠⎜ ⎟⎛ ⎞

0

0 5

3.5

50

3.0

2.5

2.0

1.5

1.0

0.5

−0.510 15 20 25 30 35 40 45

Fig. 2.

−0.2

−2.5

1.2

2.0

1.0

0.8

0.6

0.4

0.2

0

−0.4−2.0 −1.5 −1.0 −0.5 0 0.5 1.0 1.5

(a)

−0.2

−2.5

1.2

2.0

1.0

0.8

0.6

0.4

0.2

0

−0.4−2.0 −1.5 −1.0 −0.5 0 0.5 1.0 1.5

(b)

Fig. 3.



4.3. Comparative Evaluation

We compared the methods RVM, EAIC, and linear ridge regression (LR) for eleven problems from the

UCI1 and Regression Toolbox by Heikki Hyotyniemi.

2

The following parameters were used in all the methods. The number of basis functions m = n + 1,

φj(x) = exp , and φn + 1(x) = 1. The width parameter σ was adjusted using the 5 × 2�fold cross

validation (see [18]). For each training sample, the root�mean�square deviation was also estimated usingthe 5 × 2�fold cross validation.

In the ridge regression, the regularization coefficients in all the problems were set to 10–6. For the EAICand RVM, the sparseness (the number of nonzero weights) was additionally calculated. Tables 1 and 2present the experimental results. Figure 4 shows the number of relevant objects for the EAIC in different

1 http://archive.ics.uci.edu/ml/2 http://www.control.hut.fi/Hyotyniemi/publications/01_report125/RegrToolbox

x xj–2

2σ��–⎝ ⎠

⎛ ⎞

Table 1. The root�mean�square deviation for different algorithms

Problem EAIC RVM LR

Auto�mpg 2.95 ± 0.06 2.93 ± 0.04 2.92 ± 0.03

Boston 3.78 ± 0.22 3.75 ± 0.19 3.86 ± 0.11

HeatExchange�1 7.90 ± 0.09 7.88 ± 0.10 9.16 ± 0.68

HeatExchange�2 8.75 ± 0.97 9.27 ± 1.07 8.35 ± 1.17

HeatExchange�3 0.80 ± 0.07 0.82 ± 0.04 0.84 ± 0.05

Pyrimidines 0.10 ± 0.01 0.10 ± 0.01 0.11 ± 0.01

Servo 0.91 ± 0.07 0.95 ± 0.06 0.90 ± 0.02

Triazines 0.16 ± 0.01 0.17 ± 0.00 0.17 ± 0.01

Wisconsin (wdbc) 25.80 ± 2.31 25.27 ± 1.60 29.18 ± 4.52

Autos 0.33 ± 0.06 0.33 ± 0.02 0.46 ± 0.04

Cpu�performance 0.36 ± 0.04 0.40 ± 0.04 0.48 ± 0.20

Rank 19.00 20.50 26.50

Font legend Place1 Place2 Place3

Table 2. Sparseness for different algorithms (the number of relevant objects)

Problem EAIC RVM LR

Auto�mpg 7.10 ± 4.94 8.60 ± 2.99 199.00 ± 0.00

Boston 34.00 ± 5.24 24.50 ± 3.02 253.00 ± 0.00

HeatExchange�1 1.20 ± 0.45 5.60 ± 5.47 45.00 ± 0.00

HeatExchange�2 13.10 ± 10.17 10.10 ± 4.39 45.00 ± 0.00

HeatExchange�3 3.90 ± 2.86 2.60 ± 0.22 45.00 ± 0.00

Pyrimidines 10.10 ± 2.97 9.80 ± 5.53 37.00 ± 0.00

Servo 22.00 ± 11.80 15.80 ± 3.47 83.50 ± 0.00

Triazines 8.90 ± 5.48 31.30 ± 19.84 93.00 ± 0.00

Wisconsin (wdbc) 2.60 ± 1.39 8.30 ± 4.96 23.50 ± 0.00

Autos 36.30 ± 6.23 23.80 ± 3.27 100.50 ± 0.00

Cpu�performance 23.90 ± 0.89 26.70 ± 3.03 104.50 ± 0.00

1984


VETROV et al.

problems. The dark part of each column indicates the number of relevant objects that have zero regular�ization coefficients.

CONCLUSIONS

Note that in the EAIC, as well as in the RVM, the majority of αj tend to infinity thus ensuring thesparseness of the resulting solution. Moreover, these methods often yield similar results. The main con�clusion of this paper is that Akaike’s information criterion can be used to automatically determine the rel�evance interchangeably with Bayesian methods.

In distinction from the RVM, some regularization coefficients in the EAIC become identical zeros.The approach based on the EAIC can be helpful in the problem of feature selection in linear regression,which is usually solved using Akaike’s information criterion. Instead of using the computationally costlyexhaustive search procedure in the discrete feature selection problem, a continuous smooth optimizationproblem can be solved using the EAIC.

The experimental results suggest the conclusion that Bayesian learning and the information approachhave much in common, and it is possible that they are two indirect characteristics of the same phenome�non.

A direction of future research is the application of the EAIC to the classification problem. Here, a pos�sible way is to reduce the classification problem to the regression problem [15].

ACKNOWLEDGMENTS

This work was supported by the Russian Foundation for Basic Research, project nos. 08�01�00405, 08�01�90016, 08�01�90427, and 07�01�00211.

We are grateful to V.V. Mottl’ for useful remarks and the discussion of this study.

REFERENCES

1. M. E. Tipping, “The Relevance Vector Machine,” Advances Neural Inform. Processing Systems 12, 652–658(2000).

2. D. J. C. MacKay, “The Evidence Framework Applied to Classification Networks,” Neural Comput. 4, 720–736(1992).

3. R. Tibshirani, “Regression Shrinkage and Selection via the Lasso,” J. Roy. Stat. Soc. 58, 267–288 (1996).4. M. Figueiredo, “Adaptive Sparseness for Supervised Learning,” IEEE Trans. Pattern Analys. Mach. Intell. 25,

1150–1159 (2003).

5

40

35

30

25

20

15

10

0

Au

to�m

pg

Bo

sto

n

Hea

t�1

Hea

t�2

Hea

t�3

Pyr

imid

ines

Ser

vo

Tri

azin

es

Wis

con

sin

Au

tos

Cp

u�p

erfo

rman

ce

Fig. 4.



5. P. M. Williams, “Bayesian Regularization and Pruning Using a Laplace Prior,” Neural Comput. 7, 117–143(1995).

6. G. C. Cawley, N. L. C. Talbot, and M. Girolami, “Sparse Multinomial Logistic Regression via Bayesian L1 Reg�ularisation,” Advances Neural Inform. Processing Syst. 19, 209–216 (2007).

7. G. Schwarz, “Estimating the Dimension of a Model,” Ann. Statistics 6, 461–464 (1978).8. C. M. Bishop, Pattern Recognition and Machine Learning (Springer, New York, 2006).9. H. Akaike, “A New Look at Statistical Model Identification,” IEEE Trans. Autom. Control 25, 461–464 (1974).

10. A. N. Shiryaev, Probability (Nauka, Moscow, 1979) [in Russian].11. A. A. Borovkov, Mathematical Statistics (Fizmatlit, Moscow, 2007) [in Russian].12. R. A. Horn and C. R. Johnson, Matrix Analysis (Cambridge Univ. Press, Cambridge, 1985; Mir, Moscow, 1989).13. D. Spiegelhalter, N. Best, B. Carlin, and A. van der Linde, “Bayesian Measures of Model Complexity and Fit,”

J. Roy. Statist. Soc. 64, 583–640 (2002).14. A. C. Faul and M. E. Tipping, “Analysis of Sparse Bayesian Learning,” Advances Neural Inform. Process. Syst.

14, 383–389 (2002).15. M. E. Tipping, “Sparse Bayesian Learning and the Relevance Vector Machines,” J. Mach. Learning Res. 1,

211–244 (2001).16. D. A. Kropotov and D. P. Vetrov, “On One Method of Non�Diagonal Regularization in Sparse Bayesian Learn�

ing,” Proc. 24th Int. Conf. on Machine Learning (Omnipress, Corvalis, 2007), pp. 457–464.17. Y. Qi, T. Minka, R. Picard, and Z. Ghahramani, “Predictive Automatic Relevance Determination by Expecta�

tion Propagation,” Proc. 21st Int. Conf. on Machine Learning (Omnipress, Banff, 2004), pp. 671–678.18. T. G. Dietterich, “Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms,”

Neural Comput. 10, 1895–1924 (1998).

Documents

An efficient method for feature selection in linear regression based on an extended Akaike’s information criterion