Enhanced ridge regressions

Mathematical and Computer Modelling 51 (2010) 338–348

Contents lists available at ScienceDirect

Mathematical and Computer Modelling

journal homepage: www.elsevier.com/locate/mcm

Enhanced ridge regressionsStan Lipovetsky ∗GfK Custom Research North America, 8401 Golden Valley Road, Minneapolis, MN 55427, United States

a r t i c l e i n f o

Article history:Received 22 October 2009Accepted 21 December 2009

Keywords:Least squares objectiveModified ridge regressionsMulticollinearityStable solutions

a b s t r a c t

With a simple transformation, the ordinary least squares objective can yield a family ofmodified ridge regressions which outperforms the regular ridgemodel. Thesemodels havemore stable coefficients and a higher quality of fit with the growing profile parameter.With an additional adjustment based on minimization of the residual variance, all thecharacteristics become even better: the coefficients of these regressions do not shrink tozero when the ridge parameter increases, the coefficient of multiple determination stayshigh, while bias and generalized cross-validation are low. In contrast to regular ridgeregression, the modified ridge models yield robust solutions with various values of theridge parameter, encompass interpretable coefficients, and good quality characteristics.

© 2009 Elsevier Ltd. All rights reserved.

1. Introduction

Ridge regression had been originated to overcome the effects of multicollinearity in linear models by Hoerl [1–3], andHoerl and Kennard [4,5]. Multicollinearity can make confidence intervals so wide that coefficients are incorrectly identifiedas insignificant, theoretically important variables receive negligible coefficients, or the coefficients have signs opposite tothose of the corresponding pair relations, so it is hardly possible to identify the individual predictors’ importance in theregression [6,7]. Ridge regression and its modifications have been developed in numerous works, for instance, [8–16], andused in various applications, for example, [17–20]. Among the further innovations the regularization methods based on thequadratic L2-metric, lasso L1-metric, and other Lp-metrics and their combinations have been considered [21–28].Most of theridge model modifications use the least squares objective with different added penalizing and regularizing items to preventinflation of the regression coefficients.This paper can be considered as a further development of the techniques suggested in [29,30]. It shows that instead

of rather arbitrary insertion of a regularizing and penalizing item, it is possible to transform the ordinary least squares(OLS) objective itself so that it produces improved ridge solutions outperforming the regular ridge models. In contrast tothe regular ridge (RR) model, the coefficients of the improved models and their quality of fit do not diminish to zero whenthe profile ridge parameter grows. This permits a high quality of fit and acquires a multiple ridge model with coefficientsof the same signs as the pair correlations of the dependent variable with the predictors that facilitate interpretabilityof the individual regressors in the model. A special further adjustment of the model improves its quality, diminishesthe coefficients’ bias, and such a characteristic of the residual error as generalized cross-validation. Six modified ridgeregressions are considered and compared with the regular ridge model. One of these variants corresponds to the techniqueconstructed in different assumptions in [29,30], but all the other models are newly developed ones with better properties.The enhanced models surpass the regular ridge models, and the best of them is identified.The paper is organized as follows. Section 2 describes the main features of ordinary least squares (OLS) and regular ridge

(RR) regressions, and Section 3 introduces three enhanced ridgemodels. Section 4 considers amodification of eachmodel by

∗ Tel.: +1 763 417 4509; fax: +1 763 542 0864.E-mail address: [email protected].

0895-7177/$ – see front matter© 2009 Elsevier Ltd. All rights reserved.doi:10.1016/j.mcm.2009.12.028

http://www.elsevier.com/locate/mcm

http://www.elsevier.com/locate/mcm

mailto:[email protected]

http://dx.doi.org/10.1016/j.mcm.2009.12.028

S. Lipovetsky / Mathematical and Computer Modelling 51 (2010) 338–348 339

adjusting it to themaximumquality of fit, and Section 5 describes some other characteristics of themodels quality. Section 6presents numerical results, and Section 7 summarizes.

2. OLS and RR models

OLS regression. Consider some properties of the ordinary least squares (OLS) model needed for further analysis. For thestandardized (centered and normalized by standard deviation) variables a model of multiple linear regression is:

yi = β1xi1 + · · · + βnxin + εi, (1)where yi and xij are i-th observations (i = 1, . . . ,N)by the dependent variable y andby each j-th independent variable xj (j =1, . . . , n), βj are the theoretical beta-coefficients, and εi are the deviations of the observed yi from the theoretical model.The least squares (LS) objective for estimation of the coefficients consists of minimizing the sum of squared deviations:

S2 =N∑i=1

ε2i =

N∑i=1

(yi − β1xi1 − · · · − βnxin)2 , (2)

or in matrix form:

S2 = ‖ε‖2 = (y− Xβ)′(y− Xβ) = 1− 2β ′r + β ′Cβ, (3)where X is N by n matrix with elements xij of the observations by the independent variables, y is the N-th order vector-column of observations by the dependent variable, β is the n-th order vector-column of the beta-coefficient estimates, andε is a vector of deviations. The prime used in (3) denotes transposition, the variance of the standardized y equals y′y = 1,notations C and r correspond to the correlation matrix C = X ′X among the x–s, and the vector of correlations r = X ′ybetween the x–s and y.The first order condition ∂S2/∂β = 0 ofminimization (3) by the vector of coefficients yields a normal systemof equations

with the corresponding solution:

CβOLS = r, βOLS = C−1r, (4)where the vector βOLS denotes the vector of the OLS estimates defined via the inverse correlation matrix C−1.Themodel quality is estimated by the residual sum of squares (3), or by the coefficient of multiple determination defined

as:

R2 = 1− S2 = 2β ′r − β ′Cβ = β ′(2r − Cβ). (5)The minimum of the objective (3) corresponds to the equality CβOLS = r (4), with which the coefficient of multipledetermination for OLS regression reaches its maximum and reduces to the following forms:

R2(βOLS) = β ′OLSr = β′

OLSCβOLS. (6)If any x–s are highly correlated or multicollinear, the matrix C (4) becomes ill-conditioned, so its determinant is close to

zero, and with inverse matrix C−1 the OLS solution in (4) has vastly inflated values of the coefficients of regression. Thesecoefficients often have signs opposite to the matching pair correlations of x–s with y. Such a regression can be used forprediction, but is worthless in the analysis and interpretation of the individual predictors’ role in the model.Regular ridge (RR) regression. A regular ridge model is usually constructed by adding to the LS objective (3) a penalizing

function of the square norm of the vector of coefficients:

S2 = ‖ε‖2 + k‖βRR‖2 = 1− 2β ′RRr + β′

RRCβRR + kβ′

RRβRR, (7)where βRR denotes a vector of the RR coefficient estimates, and k is a so called ‘‘ridge profile’’ positive parameter. Minimizingthis objective by the vector βRR yields the following system of equations and its solution:

(C + kI)βRR = r, βRR = (C + kI)−1r, (8)where I is the identity matrix of n-th order. The solution βRR (8) exists even for a singular matrix of correlations C . Fork = 0, the RR model (7)–(8) reduces to the OLS regression model (3)–(4). Multiplying β ′RR by the Eq. (8) yields the relationβ ′RRCβRR + kβ

′

RRβRR = β′

RRr , and using it in the expression (5) shows that the coefficient of multiple determination for theRR solution can be represented in several forms:

R2(βRR) = 2β ′RRr − β′

RRCβRR = β′

RRr + kβ′

RRβRR = β′

RRCβRR + 2kβ′

RRβRR. (9)The last two expressions (9) show that the equality of the type (6) does not hold in this case.With the increase of the profile parameter k, the matrix C + kI (8) approaches a scalar matrix kI , so the inverted matrix

reduces to (C + kI)−1 ≈ k−1I . Then the RR solution (8) and the coefficient of multiple determination (9) go asymptoticallyto the expressions:

βRR = k−1r, R2(βRR) = 2k−1r ′r − k−2r ′Cr. (10)Thus, the RR solution becomes proportional to the pair correlations and keeps their signs, which is convenient forinterpretation of the predictors. But on the other hand, this solution quickly reaches zero with k growth, and the qualityof fit also reduces towards zero.

340 S. Lipovetsky / Mathematical and Computer Modelling 51 (2010) 338–348

3. Enhanced ridge regressions

Consider several straightforward transformations of the LS objective which lead to a family of ridge regressions withbetter properties than a regular ridge model.Ridge Enhanced 1 (RE1). Regrouping and squaring the sum of the items in LS objective (2) yields:

S2 =N∑i=1

(yi − β1xi1 − · · · − βnxin)2 =N∑i=1

[(1nyi − β1xi1

)+ · · · +

(1nyi − βnxin

)]2

=

n∑j=1

N∑i=1

(1nyi − βjxij

)2+ 2

n∑j>k

N∑i=1

(1nyi − βjxij

)(1nyi − βkxik

). (11)

So the LS objective for multiple OLS regression can be represented as the total of the paired regressions of 1/n-th portionof y by each xj separately, plus cross-products of the residuals yi/n− βjxij in each of two pair-wise regressions by xj and xk.Each j-th paired regression has a coefficient βj = ryj/n equal to the pair correlation of the quotient y/n with xj. If we use aterm g with the second part of the objective (11),

S2 =n∑j=1

N∑i=1

(1nyi − βjxij

)2+ g · 2

n∑j>k

N∑i=1

(1nyi − βjxij

)(1nyi − βkxik

), (12)

then for g = 0 this objective reduces to the total of the paired regressions, for g = 1 it coincides with the LS objective (2),and for intermediate g ranging from 0 to 1 it corresponds to themodels between the pair-wise andmultiple OLS regressions.The objective (12) can be represented as:

S2 = g ·N∑i=1

(yi − β1xi1 − · · · − βnxin)2 + (1− g) ·n∑j=1

N∑i=1

(1nyi − βjxij

)2. (13)

Indeed, using (11) in (13) returns it to (12). So the last two expressions are identical, but the latter one is more convenientfor derivations. Let us divide (13) by g and use another parameter k = (1 − g)/g . For g = 1, or k = 0, the objective (13)coincides with the LS objective (2), and if g diminishes to zero, or k grows, it corresponds to reducing the objective (13) tothe total of the pair-wise objectives.Minimizing the objective (13) yields a system of equations:

∂S2

∂βj= −2

N∑i=1

(yi − β1xi1 − · · · − βnxin) xij − 2kN∑i=1

(yin− βjxij

)xij = 0. (14)

This system and its solution can be presented in matrix form as follows:

(C + kI)βRE1 =(1+

kn

)r, βRE1 =

(1+

kn

)(C + kI)−1r =

(1+

kn

)βRR, (15)

where C = X ′X and r = X ′y are the correlations among the x–s and between the x–s and y, βRE1 denotes the vector ofestimates of the enhanced ridge RE1 model in the approach (13)–(14), and it has taken into account that the variance of astandardized xj equals x′jxj = 1. Using the solution (15) in (5) yields the coefficient of multiple determination for RE1model:

R2(βRE1) = 2(1+

kn

)r ′(C + kI)−1r −

(1+

kn

)2r ′(C + kI)−2Cr. (16)

The results (15)–(16) are similar to those of the regular ridge (8), with one exception—a new term 1 + k/n enters intothe βRE1 solution which is proportional to the regular ridge vector βRR. It seems a minor modification of the regular ridge,especially for small k with 1 + k/n close to one. However, this term leads to a very noticeable enhancement in the ridgeregression results. For a large k, the inverted matrix is (C + kI)−1 ≈ k−1I , so the RE1 solution (15) and its quality of fit (16)go to the following non-zero asymptotes, respectively:

βRE1 =

(1k+1n

)r ≈

1nr, R2(βRE1) =

2nr ′r −

1n2r ′Cr. (17)

Thus, in contrast to the regular ridge (10), the enhanced solution does not diminish to zero with increasing k but reachesstable levels (17). So it is possible to increase k until it reaches all the interpretable coefficients of multiple regressionproportional to the pair correlations of x and y, with a high quality of fit.Ridge Enhanced 2 (RE2). Consider a more general partitioning of the items in the objective (11) when in place of the

same shares 1/n some different fractions pj of y are used with each xj. Then in place of the objective (13), its generalizationbecomes:


S2 =N∑i=1

(yi − β1xi1 − · · · − βnxin)2 + k ·n∑j=1

N∑i=1

(pjyi − βjxij

)2 (18)

with the parameter k as is used in (14). Minimizing (18) by the unknown coefficients of regression βj and by the unknownfractions pj yields:

∂S2

∂βj= −2

N∑i=1

(yi − β1xi1 − · · · − βnxin) xij − 2kN∑i=1

(pjyi − βjxij

)xij = 0,

∂S2

∂pj= 2k

N∑i=1

(pjyi − βjxij

)yi = 0.

(19)

This system can be represented in matrix form as:{−r + Cβ − k (diag(r)p− β) = 0p− diag(r)β = 0, (20)

where diag(r) is the diagonalmatrixwith the elements of pair correlations ryj between y and xj, and p is the vector of fractionspj. Substituting p from the second expression into the first one (20) yields the following equation and its solution:(

C + kI − k · diag(r2))βRE2 = r, βRE2 =

(C + kI − k · diag(r2)

)−1r, (21)

where βRE2 denotes the vector of estimates of the enhanced ridge RE2 model with two items depending on the term k in itsmatrix, and diag(r2) is the diagonal matrix of the squared pair correlations of x and y.The obtained solution (21) is similar to the regular ridge (8), the only difference consists of using the diagonal matrix

I − diag(r2) in (21) in place of the identity matrix I . The coefficient of multiple determination R2 (βRE2) for the solution (21)can be constructed similarly to (9), and various numerical runs show that the enhanced RE2 regression always outperformsthe regular RR regression by this characteristic of fit quality. Likewise in the RR model (10), with increasing k the matrix in(21) reduces to the scalar matrix k(I − diag(r2)), so the inverted matrix becomes k−1diag(1/(1 − r2)). Then it is easy toshow that the RE2 solution (21) and its coefficient of multiple determination goes to asymptotic levels:

βRE2 =1kDr, R2(βRE2) =

2kr ′Dr −

1k2r ′DCDr, (22)

where the notation D = diag(1/(1 − r2)) is used. Similar to the RR behavior (10), RE2 (22) attains the signs of the paircorrelations, but the coefficients and quality of fit diminish to zero with k growth.Ridge Enhanced 3 (RE3). Another generalized partitioning of the items in the objective (11) with different fractions pj

restricted by their total equal one can be expressed by the conditional objective:

S2 =N∑i=1

(yi − β1xi1 − · · · − βnxin)2 + k ·n∑j=1

N∑i=1

(pjyi − βjxij

)2− 2λ

(n∑j=1

pj − 1

), (23)

where λ is a Lagrange term. The objective is similar to (18) but contains the normalizing relation∑pj = 1. Minimization

of (23) by βj produces the same first system of equations as in (19), and minimization by pj yields the second system of Eqs.(19) with additional item λ. The total system can be represented in matrix form as:{

−r + Cβ − k (diag(r)p− β) = 0k(p− diag(r)β)− λe = 0, (24)

where e is the n-th order identity vector, and all the other notations are the same as in (20). Multiplying e′ by the secondequation (24) and taking into account the normalization e′p = 1 and relation e′e = n, we obtain the termλ = (1−r ′β)(k/n).Using the latter expression in (24) and substituting p from the second of these equations into the first one, yields thefollowing system and its solution:(

C + kI − k · diag(r2)+knrr ′)βRE3 =

(1+

kn

)r, (25a)

βRE3 =

(1+

kn

)(C + kI − k · diag(r2)+

knrr ′)−1

r. (25b)

This is the enhanced RE3 solution with three items containing the term k in its matrix. In comparison with RE2 (21), thematrix in (25) has one additional item (k/n)rr ′ proportional to the outer product rr ′ of the correlation vector. But the term1+ k/n in the solution (25b) makes its behavior rather similar to that of RE1 (15) than to the RE2 (21) model. Indeed, when


increasing k, the item C in (25) becomes negligible in comparison with the other items, so inversion of the remaining partwith the help of the known Sherman–Morrison formula (see [31]) can be reduced to:

1k

((I − diag(r2)

)+

(r√n

)(r√n

)′)−1=1k

(D−

Drr ′Dn+ r ′Dr

), (26)

where D = diag(1/(1 − r2)) denotes the same diagonal matrix as is used in (22). Using this inverted matrix in (25b) forlarge k leads to the following solution and the corresponding coefficient of multiple determination:

βRE3 =

(1k+1n

)(Dr −

r ′Drn+ r ′Dr

Dr)=

1n+ r ′Dr

Dr, (27a)

R2(βRE3) =2r ′Drn+ r ′Dr

−r ′DCDr

(n+ r ′Dr)2. (27b)

So in contrast to the RR (10) and RE2 (22) asymptotic behaviors, but similar to the RE1 (17)model, the enhanced RE3 solutionand quality of fit converge to the asymptotic levels (27) independent of the ridge parameter k. Thismeans thatwithout losingmuch on the quality of fit when k increases, the RE3model can produce interpretable coefficients ofmultiple regressionwithsigns of the pair correlations of x and y.

4. Adjustment to the best fit

Thenext step in constructing the enhanced ridgemodels consists of the followingprocedure. For any of the ridge solutionsβ – this could be the regular ridge RR (8), or ridge estimates RE1 (15), RE2 (21), and RE3 (25) – it is possible to improve itby adjusting to the maximum possible quality of fit, which is estimated by the residual sum of squares S2 (proportionalto the residual variance), or by the convenient characteristic of the coefficient of multiple determination R2 = 1 − S2 (5)which ranges from zero to one for the worst models to the best ones, respectively. For any given solution β , consider aproportionally modified (adjusted) vector:

βadj = qβ, (28)

and substitute it into the general expression for the coefficient of multiple determination (5):

R2(βadj) = 2qr ′β − q2β ′Cβ. (29)

This is a concave quadratic function by the unknown parameter q, and it reaches its maximum at the point between its rootsat the value:

q = (r ′β)/(β ′Cβ). (30)

Using (30) in (28)–(29) yields the adjusted solution and the maximum of the coefficient of multiple determination whichcan be attained with such an adjusted solution as follows:

βadj =r ′ββ ′Cβ

· β, R2(βadj) =(r ′β)2

β ′Cβ. (31)

This new adjusted solution βadj is easy to find, and it produces a maximum fit for a given vector β at any value of the ridgeparameter k. The coefficient of multiple determination in (31) can be presented in the two following equivalent forms:

R2(βadj) =r ′ββ ′Cβ

(r ′β) = q(r ′β) = r ′βadj, (32a)

R2(βadj) =(r ′ββ ′Cβ

)2(β ′Cβ) = q2(β ′Cβ) = β ′adjCβadj. (32b)

This interesting result shows that the equality R2(βadj) = r ′βadj = β ′adjCβadj holds for any adjusted solution, similar to theOLS model (6).Let us consider explicitly the adjusted solutions for the considered ridge models.Regular Ridge adjusted (RR.adj). For the regular ridge solution (8) the coefficient of adjustment (30) is:

qRR =r ′βRRβ ′RRCβRR

=r ′(C + kI)−1rr ′(C + kI)−2Cr

, (33)

because the matrices C and (C + kI)−1 are commutative. The adjusted solution is defined as:

βRR.adj =r ′(C + kI)−1rr ′(C + kI)−2Cr

(C + kI)−1 · r, (34)


andwith it, the coefficient of multiple determination can be found by (31) or (32). In the limit of large k, as it was consideredfor (10), the matrix inversion is (C + kI)−1 ≈ k−1I , so the coefficient (33) becomes proportional to the ridge parameter:

qRR = kr ′rr ′Cr

. (35)

Then the adjusted solution converges to the asymptote:

βRR.adj = kr ′rr ′Cr· k−1r =

r ′rr ′Cr· r, (36)

which does not depend on k, so it does not reduce to zero as the RR solution (10) does. The coefficient of multipledetermination and its simplification with large k are as follows:

R2(βRR.adj) =(r ′(C + kI)−1r)2

r ′(C + kI)−2Cr≈(r ′r)2

r ′Cr, (37)

so it approaches a constant independent of k, and the quality of fit does not decrease steeply as in the case of the regularridge model (10). The results (33)–(37) for the ridge regression were obtained in [29,30].Ridge Enhanced 1 adjusted (RE1.adj). For the ridge enhanced solution RE1 (15), the coefficient of adjustment (30) equals:

qRE1 =r ′βRE1

β ′RE1CβRE1=

(1

1+ k/n

)·r ′(C + kI)−1rr ′(C + kI)−2Cr

=qRR

1+ k/n, (38)

which is proportional to the coefficient (33). The adjusted solution (15) becomes:

βRE1.adj = qRE1

(1+

kn

)βRR = qRRβRR = βRR.adj, (39)

so it coincides with solution (34). Then the asymptotic behavior of the RE1.adj solution agrees with that of RR.adj (36), andalso R2(βRE1.adj) = R2(βRR.adj) as given in (37). Thus, with the help of the adjustment, the behavior of the regular ridge (10)is significantly improved andmade stable similarly to the enhanced solution (17); and both models RR.adj and RE1.adj havean even higher level of fit due to the attained maximum coefficient of multiple determination in a large span of increasingk.Ridge Enhanced 2 adjusted (RE2.adj). The behavior of the next enhanced ridge model RE2 (21) with undesirable features

(22), can also be drastically recovered by the adjustment. Indeed, the coefficient of adjustment (30) for this model is:

qRE2 =r ′βRE2

β ′RE2CβRE2=

r ′(C + k(I − diag(r2))

)−1 rr ′(C + k(I − diag(r2))

)−1 C (C + k(I − diag(r2)))−1 r . (40)

With coefficient (40) the adjusted solution becomes:

βRE2.adj =

(r ′(C + k(I − diag(r2))

)−1 r) · (C + k(I − diag(r2)))−1r ′(C + k(I − diag(r2))

)−1 C (C + k(I − diag(r2)))−1 r · r (41)

and the coefficient ofmultiple determination can be found by (31)–(32). For large k, as itwas considered for (22), the invertedmatrix equals k−1diag(1/(1− r2)), so the coefficient (40) is proportional to the ridge parameter:

qRE2 = kr ′Drr ′DCDr

, (42)

with the diagonal matrix D = diag(1/(1− r2)) used in (22). Then the adjusted solution converges to the asymptote:

βRE2.adj = kr ′Drr ′DCDr

· k−1Dr =r ′Drr ′DCDr

· Dr. (43)

The coefficient of multiple determination and its simplification with large k areas follows:

R2(βRE2.adj) =

(r ′(C + k(I − diag(r2))

)−1 r)2r ′(C + k(I − diag(r2))

)−1 C (C + k(I − diag(r2)))−1 r ≈ (r ′Dr)2

r ′DCDr. (44)

In striking contrast to (22), the adjusted solution (43) and quality of fit (44) do not depend on k for its large values.Ridge Enhanced 3 adjusted (RE3.adj). The model of the enhanced ridge RE3 (25) has good features already (27), but its

quality of fit can anyway be amplified by the adjustment procedure. The coefficient of adjustment (30) for this model is:

qRE3 =r ′(C + k

(I − diag(r2)+ r ′r

n

))−1r(

1+ kn

)r ′(C + k


n

))−1C(C + k


n

))−1r. (45)


The solution (25b) adjusted by the coefficient (45) is:

βRE3.adj =

(r ′(C + k


n

))−1r)·

(C + k


n

))−1r ′(C + k


n

))−1C(C + k


n

))−1r· r (46)

and again the corresponding coefficient of multiple determination can be estimated by (31) or (32). With large k, the matrixinversion in (45)–(46) can be taken by the formula (26), so the coefficient (45) reduces to the following constant:

qRE3 =(r ′Dr)(n+ r ′Dr)

r ′DCDr, (47)

with the same diagonal matrix D as in (22) and (42). The adjusted solution in the limit of large k converges to:

βRE3.adj =

(1+

kn

)(r ′Dr)(n+ r ′Dr)

r ′DCDr· k−1

nn+ r ′Dr

Dr ≈r ′Drr ′DCDr

· Dr, (48)

which is the same solution (43) as for the previous model. So for growing k, the coefficient of multiple determinationeventually reduces to the last expression (44).Resuming, the family of seven ridge-regression solutions includes: regular ridge RR (8); ridge enhanced—themodels RE1

(15), RE2 (21), RE3 (25); and their adjusted versions—coincidingmodels RR.adj (34) and RE1.adj (39) (denoted further as onesolution RR&RE1.adj), then RE2.adj (41), and RE3.adj (46). Besides the classical RR and previously considered RR&RE1.adj,all the other models are newly developed.

5. Several other characteristics of ridge models

Besides the above considered coefficients of ridge regressions and their multiple determination, let us describe severalother characteristics of the obtained models. For this aim, the notation A will be used for the matrix operator intransformation β̃ = Ar of the vector of pair correlations r into the vector of beta-coefficients β̃ , where tilde denotes estimatesobtained by any of the considered ridge techniques. For instance, in RRmodel (8) this transformation is fulfilled by thematrixA = (C + kI)−1 for the vector β̃ = βRR, in RE1 (15) – by the matrix A = (1 + k/n)(C + kI)−1 for the vector β̃ = βRE1, etc.,and in RE3.adj – by the whole complicated structure shown in (46) for the vector β̃ = βRE3.adj.The effective number of parameters for a ridge model is defined as:

n∗ = Tr (AC) , (49)

where Tr denotes the trace of a matrix, or total of its diagonal elements. For k = 0 the value (49) coincides with the totalnumber n of the predictors, and with a larger k the value n∗ diminishes. The residual error variance equals:

S2res = (1− R2)/(N − n∗ − 1), (50)

where R2 is a coefficient of multiple determination in any of the above considered models.The mathematical expectation for an estimated vector of parameters is:

M(β̃) = A ·M(r) = AX ′M(y) = AX ′M(Xβ + ε) = AX ′Xβ = ACβ, (51)

with the errors’ expectationM(ε) = 0, and β denotes the theoretical vector of coefficients. If thematrix product AC does notequal the identity matrix I , then a ridge solution is biased. Actually, any ridge solution is biased, and only when k = 0 doesit reduce to the unbiased OLS solution. A convenient measure of the bias can be built as the squared norm of the differencebetween the matrix in (51) and the identity matrix divided by the number of parameters:

Bias(β̃) =1n‖AC − I‖2 . (52)

The lower the total bias is, the closer measure (52) is to zero, for instance, in the OLS solution it equals exactly zero.The covariance matrix of the parameters’ estimates can be written as:

cov(β̃) = S2resAX′XA, (53)

with the residual error variance estimated by (50). The trace of this matrix (53) can be used as the efficiency, or the totalvariance of the estimated coefficients:

Var(β̃) = S2resTr(AX′XA) = S2resTr(A

2C). (54)


a b

(a) Coef. of mult. determination. (b) Bias of estimates.

c d

(c) Efficiency of estimates. (d) Gen. cross-validation

Fig. 1. Ridge profile of quality characteristics.

Another measure for residual variance is the generalized cross-validation criterion well known in regression modeling[32–34], which can presented as follows:

GCV = N‖(I − H)y‖2

(Tr(I − H))2= N

1− R2

(N − Tr(AC))2, (55)

where the hat-matrix H corresponds to projection of the empirical to the theoretical vector of the dependent variableHy = X β̃ = XAX ′y. In (55) it is also taken into account that the squared norm of the residuals y − Hy can be expressedvia the coefficient of multiple determination for each type of ridge model, and the trace Tr(XAX ′) = Tr(AX ′X).

6. Numerical example

Consider a numerical example with the above described ridgemodels traced by the profile parameter k. The data presentvarious cars’ characteristics given in [35], and also available in [36] (as ‘‘car.all’’ data). The data describes dimensions andmechanical specifications of 105 cars, supplied by themanufacturers andmeasured by Consumer Reports. The variables are:y—Price of a car, US$ K; x1—Weight, pounds; x2—Length overall, inches; x3—Wheel base length, inches; x4—Width, inches;x5—Front Leg Room maximum, inches; x6—Front Shoulder room, inches; x7—Turning circle radius, feet; x8—Displacementof the engine, cubic inches; x9—HP, the net horsepower; x10—Tank fuel refill capacity, gallons. The cars’ price is estimated inthe regression model by the dimensions and specifications variables which can help to find better design solutions.


Table 1Correlations, OLS and ridge regressions.

k = 2 ryx RR RE1 RE2 RE3 RR&RE1.adj RE2.adj RE3.adj

x1 0.653 0.085 0.102 0.106 0.114 0.125 0.115 0.125x2 0.533 0.056 0.067 0.057 0.061 0.083 0.062 0.067x3 0.496 0.035 0.042 0.012 0.013 0.051 0.013 0.015x4 0.478 0.025 0.03 −0.001 −0.003 0.037 −0.003 −0.001x5 0.567 0.124 0.149 0.16 0.172 0.183 0.174 0.189x6 0.371 0.003 0.004 −0.01 −0.015 0.005 −0.016 −0.02x7 0.378 0.002 0.002 −0.03 −0.031 0.002 −0.031 −0.03x8 0.642 0.083 0.099 0.085 0.092 0.122 0.093 0.101x9 0.783 0.16 0.192 0.333 0.359 0.236 0.363 0.394x10 0.657 0.092 0.11 0.114 0.123 0.135 0.125 0.135

R2 0.562 0.605 0.667 0.679 0.627 0.68 0.686R2

R2OLS0.778 0.839 0.923 0.94 0.869 0.942 0.951

k = 6 OLS RR RE1 RE2 RE3 RR&RE.1adj RE2.adj RE3.adj

x1 0.278 0.053 0.085 0.081 0.102 0.112 0.105 0.116x2 0.225 0.039 0.063 0.043 0.054 0.082 0.056 0.062x3 −0.085 0.032 0.052 0.026 0.033 0.068 0.034 0.038x4 −0.144 0.029 0.046 0.018 0.023 0.06 0.024 0.026x5 0.245 0.063 0.101 0.099 0.125 0.132 0.129 0.143x6 −0.060 0.017 0.027 0.005 0.006 0.036 0.006 0.007x7 −0.199 0.017 0.028 0.002 0.003 0.037 0.003 0.003x8 0.101 0.053 0.084 0.075 0.094 0.111 0.097 0.107x9 0.409 0.082 0.131 0.224 0.284 0.171 0.293 0.323x10 0.160 0.056 0.09 0.088 0.112 0.118 0.116 0.128

R2 0.722 0.409 0.532 0.579 0.633 0.564 0.637 0.645R2

R2OLS0.567 0.737 0.802 0.877 0.781 0.883 0.894

Fig. 1 shows several main characteristics of the regression quality traced by the k parameter for all seven consideredmodels: regular ridge RR (8); enhanced models RE1 (15), RE2 (21), RE3 (25); and adjusted models RR&RE1.adj (coincidingRR.adj (34) and RE1.adj (39)), RE2.adj (41), and RE3.adj (46). Fig. 1 consists of: (a)—coefficient of multiple determination R2for eachmodel, (b)—bias of the estimates (52), (c)—efficiency of the estimates (54) (logarithmof variance is shown), and (d)—generalized cross-validation (55). All curves start on the OLS solution (4) corresponding to k = 0 in the ridge models. Wesee in graph A that the regular ridge RR has the worst R2 behavior, while the enhanced models are better, and the adjustedmodels have the most stable R2 values when k increases. Enhanced RE3, and especially its adjusted version RE3.adj are thebest of all the models. The next graphs (b), (c), and (d) in Fig. 1, support this conclusion, showing that all the other modelsperform between the RR and RE3.adj ridge models. For this reason, the models RR and RE3.adj are chosen for presentingall ten beta-coefficients in Fig. 2. As it was discussed above, the RR solution reaches a zero level (10) for all the estimates,while the RE3.adj solution (48) reaches stable, constant levels. All the other coefficients behavewithin the range of these twomodels. Several coefficients (for beta 3, 4, 6, and 7) are negative at the origin corresponding to the OLS regression, but withincreasing k they become positive, as the pair correlations are. So we can always find the solution where all the coefficientsare interpretable and have a high quality of fit.Table 1 presents more results. In its first numerical column, there are the vectors of pair correlations ryx of y with x,

and below it, the vectors of beta coefficients of OLS (4). All correlations are positive, but because of multicollinearity thevariables x3, x4, x6, and x7 have negative coefficients in the multiple OLS regression, although it has a good coefficient ofmultiple determination R2OLS = 0.722. The next seven columns in Table 1 present in their upper part all the considered ridgemodels for the parameter k = 2, and below them—the same models with the value k = 6. Below each model, its coefficientof multiple determination is shown, together with the quotient of this coefficient to its value for the OLSmodel, R2/R2OLS. Theregular ridge RR at the upper part of Table 1 has all positive coefficients and R2 = 0.562, the enhanced models outperformit, and the adjusted models have the best quality of fit. All the ridge models in the lower part of Table 1 have positivecoefficients, and a high quality of fit. The values of R2/R2OLS for the two regular ridge RR models are 56.7% and 77.8%, whilefor the best RE3 adjusted model these values are 89.4% and 95.1%. The RE3 adjusted model systematically demonstrates thebest characteristics and suggests interpretable coefficients of regression. This data had also been used for comparison acrossseveral other regularization techniques in [26], and the current results are among the best regressions. The discussed resultsare very typical, and have been observed with different data sets.

7. Summary

A modified least squares objective is used to produce a family of new ridge regressions with enhanced properties. Thesemodels are additionally adjusted to attain the best possible quality of fit. Together with the regular ridge regression, six


Fig. 2. Ridge profile for RR and RE3.adj solution.

newly developed models are described and compared. Each of them outperforms the regular ridge model, in contrast towhich the enhanced and adjusted ridge solutions have a stabilized profile behavior and a better quality of fit. The enhancedmodels are less biased, but are efficient, and encompass other helpful features of ridge regression. The results of the enhanced


ridgemodels are stable and easily interpretable. Judging by the theoretical features and numerical validation, the best of theenhanced models is the RE3 adjusted ridge regression. The suggested approach is useful for theoretical consideration andpractical applications of regression modeling and analysis.

References

[1] A.E. Hoerl, Optimal solution of many variables equations, Chemical Engineering Progress 55 (1959) 69–78.[2] A.E. Hoerl, Application of ridge analysis to regression problems, Chemical Engineering Progress 58 (1962) 54–59.[3] A.E. Hoerl, Ridge analysis, Chemical Engineering Progress Symposium Series 60 (1964) 69–78.[4] A.E. Hoerl, R.W. Kennard, Ridge regression: Biased estimation for nonorthogonal problems, Technometrics 12 (1970) 55–67. (reprinted in 42 (2000)80–86).

[5] A.E. Hoerl, R.W. Kennard, Ridge regression: Iterative estimation of the biasing parameter, Communications in Statistics, Part A 5 (1976) 77–88.[6] A. Grapentine, Managing multicollinearity, Marketing Research 9 (1997) 11–21.[7] C.H. Mason, W.D. Perreault, Collinearity, power, and interpretation of multiple regression analysis, Journal of Marketing Research 28 (1991) 268–280.[8] M. Aldrin, Multivariate prediction using softly shrunk reduced-rank regression, The American Statistician 54 (2000) 29–34.[9] P.J. Brown, Measurement, Regression and Calibration, Oxford University Press, Oxford, 1994.[10] G. Casella, Condition numbers and minimax ridge regression estimators, Journal of the American Statistical Association 80 (1985) 753–758.[11] G. Diderrich, The Kalman filter from the perspective of Goldberger–Theil estimators, The American Statistician 39 (1985) 193–198.[12] N.R. Draper, A.M. Herzberg, A ridge-regression sidelight, The American Statistician 41 (1987) 282–283.[13] R.W. Hoerl, Ridge analysis 25 years later, The American Statistician 39 (1985) 186–192.[14] D.R. Jensen, D.E. Ramirez, Anomalies in the foundations of ridge regression, International Statistical Review 76 (2008) 89–105.[15] D.W. Marquardt, R.D. Snee, Ridge regression in practice, The American Statistician 29 (1975) 3–20.[16] Y.Maruyama,W.E. Strawderman, A new class of generalized Bayesminimax ridge regression estimators, The Annals of Statistics 33 (2005) 1753–1770.[17] G.M. Erickson, Using ridge regression to estimate directly lagged effects in marketing, Journal of the American Statistical Association 76 (1981)

766–773.[18] E.C. Malthouse, Ridge regression and direct marketing scoring models, Journal of Interactive Marketing 13 (1999) 10–23.[19] K.B. Newman, J. Rice, Modeling the survival of Chinook salmon smolts outmigrating through the lower Sacramento river system, Journal of the

American Statistical Association 97 (2002) 983–993.[20] E. Vago, S. Kemeny, Logistic ridge regression for clinical data analysis (a case study), Applied Ecology and Environmental Research 4 (2006) 171–179.[21] B. Efron, T. Hastie, I. Johnstone, R. Tibshirani, Least angle regression, The Annals of Statistics 32 (2004) 407–489.[22] C. Fraley, T. Hesterberg, Least angle regression and LASSO for large datasets, Statistical Analysis and Data Mining 1 (2009) 251–259.[23] G.M. James, P. Radchenko, J. Lv, DASSO: Connections between the Dantzig selector and lasso, Journal of the Royal Statistical Society. Series B 71 (2009)

127–142.[24] S. Lipovetsky, Optimal Lp-metric for minimizing powered deviations in regression, Journal of Modern Applied Statistical Methods 6 (2007) 219–227.[25] S. Lipovetsky, Equidistant regression modeling, Model Assisted Statistics and Applications 2 (2007) 71–80.[26] S. Lipovetsky, Linear regression with special coefficient features attained via parameterization in exponential, logistic, and multinomial-logit forms,

Mathematical and Computer Modelling 49 (2009) 1427–1435.[27] L. Meier, S. van de Geer, P. Buhlmann, The group lasso for logistic regression, Journal of the Royal Statistical Society. Series B 70 (2008) 53–71.[28] R. Tibshirani, Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society. Series B 58 (1996) 267–288.[29] S. Lipovetsky, M. Conklin, Ridge regression in two parameter solution, Applied Stochastic Models in Business and Industry 21 (2005) 525–540.[30] S. Lipovetsky, Two-parameter ridge regression and its convergence to the eventual pairwise model, Mathematical and Computer Modelling 44 (2006)

304–318.[31] W.H. Press, S.A. Teukolsky, W.T. Wetterling, B.P. Flannery, Numerical Recipes: The Art of Scientific Computing, 3rd ed., Cambridge University Press,

New York, 2007.[32] P. Craven, G. Wahba, Smoothing noisy data with spline functions: Estimating the correct degree of smoothing by the method of generalized cross-

validation, Numerical Mathematics 31 (1979) 317–403.[33] G.H. Golub, M. Heath, G. Wahba, Generalized cross-validation as a method for choosing a good ridge parameter, Technometrics 21 (1979) 215–223.[34] T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer, New York, 2001.[35] J.M. Chambers, T.J. Hastie, Statistical Models in S, Wadsworth and Brooks, Pacific Grove, CA, 1992.[36] S-PLUS’2000, MathSoft, Seattle, WA, 1999.

Documents

Enhanced ridge regressions