View
3
Download
0
Category
Preview:
Citation preview
AD-A168 349 ON THE USE OF RIDGE AND STEIN-TYPE ESTIMATORS IN i/iPREDICTION(U) STANFORD UNIV CR DEPT OF STATISTICSA E GELFAND 21 MAY 86 TR-374 N@8814-86-K-@i56
UNCLASSIFIED F/G 12/1
1- 128 112
MtWGP __SLU -E
SO NON THE USE OF RIDGE AND STEIN-TYPE ESTIMATORS IN PREDICTION
BY
0)ALAN E. GELFAIID
(0 TECHNICAL REPORT N0. 374
SMAY 21, 1986
a PREPARED UNDER CONTRACT
N00014-86-K-0156 (NR-042-267)
FOR THE OFFICE OF NAVAL RESEARCH
Reproduction In Whole or In Part is Permitted
for any purpose of the United States Government
Approved for public release; distribution unlimited.
DEPARTMENT OF STATISTICS DTICSTANFORD UNIVERSITY ELECTE
*C.STANFORD, CALIFORNIA JUN9 W
*
99
4 86 6 9 07
i.4
* :-, ON THE USE OF RIDGE AND STEIN-TYPE ESTIMATORS IN PREDICTION
BY
ALAN E. GELFANDS':
TECHNICAL REPORT NO. 374z.'. MAY 21, 1986
Prepared Under Contract
N00014-86-K-0156 (NR-042-267)
For the Office of Naval Research
Herbert Solomon, Project Director
Reproduction ui Whole or in Part is Permittedfor any purpose of the United States Government
Approved for public release; distribution unlimited.
DTICr!ECTE
" JUN 9 1986.9-. DEPARTMENT OF STATISTICS w
-.. , STANFORD UNIVERSITY
STANFORD, CALIFORNIA B4'..#m
ON THE USE OF RIDGE AND STEIN-TYPE ESTIMATORS IN PREDICTION
Alan E. Gelfand
1. Introduction
For the usual regression model with fixed regressors,
Y = X8 + C, Ynxl' Xnxp full rank, BpxI and cnxl - (0,021),
there is considerable literature devoted to alternatives to
the ordinary least squares estimator, BOLS of B. From work
'P. -originally dating to Stein (1956) and James and Stein (1961)
when c is normally distributed and p _ 3, 8OLS is inadmissible
under loss (8-0) Q(6-8), Q an unrestricted positive definite
matrix. Thus, much of this extant discussion focuses on the
development of biased estimators with small "variances" which
achieve a smaller expected loss either uniformly over p-dimen-
sional Euclidean space or at least in the vicinity of some speci-
fied B*. Two "classes" of.such reduced variance regression
estimators are particularly well discussed - ridge estimators
and Stein-type estimators. Either directly or upon orthogonal
transformation these estimators take the form
(1) = Co + (I-C)B'-,.,C OLS
where C is a diagonal matrix, usually data dependent. They may
also be seen to be Bayes or "Empirical" Bayes procedures as well.
The review paper by Draper and Van Nostrand (1979) provides an
-S-
F: { . < "- . ,,' _ZK_ ..* .v.-. .
4. .rN_"
!' 32
excellent summary of both the theoretical and simulated effort
in this area. . In the context of cross-validation, i.e. of
examining the performance of an estimator obtained in one
sample in prediction in a second independent sample, the work
of Stone (1974) leads to estimators of the form in (1) as well.
Herein we consider the simplest such cross-validation
problem. At a new vector of predictor values, X0 , we seek to
estimate XTB. We take as loss function (6 (Y)-XTB)2 for an0 0
estimator 6(Y) and we assume henceforth that c is normally dis-
tributed with a2 unknown. Our problem differs from that of
esti.mating the vector B since the results of Cohen (1965) showT^ T
that aXoBOLS is an admissible estimator of X 0 for 0 < a* _,
i.e. the UMVU estimator is admissible. (In fact, 6(Y) of the
form yTY is admissible for X T i.f.f.0
T -1T( Tx-1 <lXTx)-I(2y-X(X X) ). X < X2OO. ) Nonetheless,
if we have some confidence in B', i.e. that B0 is near the true
value B, then it makes sense to attempt to improve upon XO0OLS
in the "vicinity of 0*" using estimators of the form (1). More
specifically, how well do the "classes" of ridge estimators and
of Stein-type estimators perform in this prediction? Can we
make a "best" choice within these classes for a particular
prediction?
The problem of prediction of an independent observation Y
at X0 using the loss (6(Y)-Y 0) is equivalent to that of pre-
dicting XT8, i.e. E (6(Y)-Y 0 )2 C2 + ET(6(Y)-X 2
iVe p 4(W)
t
3
For an estimator of the form XS, the expected loss be-
comes
(2) E B-0)T C -T E EX0iC I-0i) 20.
In the sequel we take the generalized ridge estimator BR to be
A-(XTX AlXy
(3) (XTy+ABI)"
where A is p.d. symmetric and possibly dependent on Y. We take
the general Stein-type estimator BS to be
(4) B3 (1- c/Q)BOLS + c/Q B0
' where Q - (BOLS-BU) TxTx(BOLS-) and c may depend on Y. In
practice c/Q is usually replaced by min(c/Q, 1).
In section 2 we calculate the risK (2), of the estimators
(3) and (4) when A, c are constant. We then investigate "best"
choices for A, c. Since these choices will be functions of
and a' as well as X0 , A and c must be estimated from Y. In section
3 we summarize a simulation study which compares the performance
of versions of (3) and (4) which are discussed for the estimation
of B along with others motivated by work in section 2. In section
4 we offer concluding remarks in particular with regard to
multiple prediction.
* -
4
2. Theoretical Results
We first .note that for 0OLS (2) becomes
(5) a2XT (XTX) lX .
We now claim that
Theorem 1: For 8R as in (3), (2) becomes
a 2xT(TX+A)-IXTX(XTx+A)-I0
(6)+TTX+A)-l T T+ x-A(-*)(B*) x +A)-
Proof: We transform to principal components form. Let R be
nonsingular such that RXT XRT = I, RART = D, D a diagonal matrix
with diagonal entries di. For any point 0 in p-dimensional'V*
Euclidean space, let a - (R-1)T0. Then
OR (R-)T O a + (I+D) Da*' (I+D)R 0OLS + (ID,
i.e. of the form (1) with C - (I+D)-1 In terms of a, (2)
becomes Ea(a-a)w0w0 (a-a), w0 - RX. Since a - N(a,o 2 I), this
expectation is readily calculated to be
Ii, _
ai 2. TC2. + wT(IC)(a-ca*)(saa)T(IC)w0
Substitution for a, w0 and C yields (6). 0
/
m :m :W 4:2--Kh *'Z-KZ
5
Note: Normality is not employed in this calculation.
In (3) A.is usually taken to be diagonal and, in fact,
the class of ridge (as opposed to generalized ridge) estimators
sets A = al, a > 0. The case where either by design or trans-
formation xTx = I reduces (5) to a 2EX and reduces (6), for
generalized ridge estimators (ai are the diagonal elements of
the diagonal matrix A) to
(7) a2 EX2 1 2 + ]EX () -a 1)2
(l+ai ) i
Investigation of this expression reveals that an optimal
choice for the ai to minimize (7) needn't exist although local
minima can be found. In the case of ridge estimation, i.e.
all a. = a, a unique minimum can be found. This occurs at
•2 - 2 T
(8) a0 = 0 0 X 0
where y = xT(8-'). Note that a > 0 and finite provided 8-8k
isn't orthogonal to XO . The associated minimum equals
2 2Toy XoX02 2 T
0 0
~2.~2T-" T
When 6 is such that 8-8* is orthogonal to X0 , then X 8 predicts
perfectly. For such 8's we can obtain zero expected loss and
q
-'
would want no weight attached to 8 OLS' I.e. would want a = -.
In fact, it is clear that for X0 ,8 fixed there will be a set
of B' 's which predict X perfectly and that 8' needn't be0
close to B in Euclidean distance. Thus the appropriate pseudo-
metric for the prediction problem is (0 1-0 2 )TX 0X T(1I-2). This
pseudometric clarifies the earlier notion of "vicinity of B"'
and under this distance the further V' is from 8 the closer a
is to 0, i.e. the more weight is placed on BOLS, the closer 0*
is to 0 the larger a becomes, i.e. the more weight is placed
on- o . As would be expected, a 0 is invariant to scaling of Xo,
although the risk clearly isn't.T
Using (8) our estimator of X0 is
T = (la) XOBOLS + (l+a 0 ) a0X0B'a 0
and, in fact, for any fixed a > 0, Ta improves upon XTa2 o2a_1 O
whenever y < 2 a (2+a).
,From (8) a convenient estimator of a0 is:
(9) o Y XTX0. 0 0
when 02 is the usual UMVU estimator of a2 and X TY 0 (SOLS-)-
The fact that Ey- 2 doesn't exist suggests that a will be very
dwunstable and that T will perform poorly. We return to thisa0untabl andthatTa0^2
point in the discussion of the simulation study. Since a is
t . ~
7A A A TA
independent of 8 0LS and a0 depends on O only through XOOLS,we may compute, the expected loss for Ta0 If [2 = a XT, then
2 ^ 2 A2 T 0A 0y - N(Y,) and, with T= a X X0, (2) for T- becomes
" 2 2 (T; )2 +2r ; Y. 2_ 2-r 2( r2(10) E Y Y)2 T2 +E (A2 (22 2 )A )-
-. Y+T (YY Ty
The equality (10) is seen using the identity E yf(y)(Y-y),a-..1
SA 2E () (Stein (1973)) valid provided E IAf(Y) which,," Y jhc
as the following calculations show, is the case). Now^22 2 2 2 2 2 2Y P2r X1 , y
2 /2T independent of T /2 ~ X Hence•2A .12 -n2Ll2
(y^ 1^2 )tI-3--,-p) where L - Po(y2/2t2). The expectation
of each term in (10) can thus be evaluated and (10) becomes
(11 T (L(n-)E~-P+L-13)- I(n-p+2) (2L+I) -n-p-2L+l I ,.(ii 2{L(n-)E(n-+2 3)I[(n-p+2L+5) n-p+2L+l~ j
-.. 2-,' :If we divide (11) by T . i.e. consider the risk relative
to that of OLS, then this relative risk is a function of2 2 2
aY /" T. Hence we set T 1 and examine the simpler estimator* A2 3
( +1)-1 3 which may be thought of as an "empirical" Ba-yes estimator
against a normal prior centered at 0, adjusted to have no singulariti
in RI . The risk of this estimator is readily obtained to be
1 + E C +1)-(3y -2) by an argument similar to that leading toY(10). This risk (symmetric about 0) is graphed in Figure I against
% > 0 to illustrate what may be expected, up to scaling, if (11)
is evaluated. Note that the risk is bounded and considerably
.,- , .- -
8
less than I for y small. Because (2+)-23has sngularites
in the complex plane it is not admissible.
If we restore xTx, not necessarily diagonal, our estimator
T ^Tin (3) has A = a0X X or a0 X X according to (8) or (9).
Theorem 2: For BS as in (4) with p > 2, (2) becomes
(12) a X0 (XTX) -x 0 + X0T(XX)-1I0( +4co2)r /a2- 2cr2)
where
r E 2L+l r =E11 (p+2(L+M))(p+2(L+M)-2) ' 2 E p+2(L+M)-2
with
L - Po(X) , /2o 2xT(xTx)-IX
(13)
M- Po(6) , 6-- (AXT(XTX)- X0 _ 2 )/ 2 O2 Xo(XTX)-l
where L,M independent and A ( -*) T xTx(0-0).
Proof: As in Theorem 1, let R be nonsingular such that
XT T ~1 T 2Rx = I and let a = (R )T(SoLS-B*). Then a - N(a,o I) with-1,
.(a = (B-)T(8), Q = a a, and (2) becomes
P .
IK-V
g:N
c, T- T 2
.~ p-T XTXTX)lwhere w = RX 0 (and w w - 0
If we expand (14) we obtain
• (wT^ 2 w (-))(15) E ( (a-O C.EE (CT)2 /(T)2-2cE (WT a
The last term may be written as -2c~wiE af(a)(Cii-Cii) where
f(a) (C (TC)- (w Tc). Using the Stein identity,2 3f(cz)- A
( E C Eaf(a)(a i-i), which is valid here), on this
expression, after manipulation (15) becomes
a(16) 2wTw + (c2+4ca 2 )E ( TA2/( T)2 - 2c 2w wEa (i/MTa).
TA 2 T^A T2:.(W CT ^ (wT0)2Finally if we let U (w w V = (w then
U- xIL 2 with L as in (13), _ .: - 1+2Lw
a
2 (3Y'IM X with M as in (13)vz..0 p-l+2M
;.;U I +2L p-l+2Mand given L and M, U Be(2 --, 2 ) independent of
U+V 2 H:.q -- Hence2. p+2(L+M)"
. "-?.
P.-"
I / % . ".- , ." - . ." . . • , ° .% . " . r - . -. , - o .-r.- , -. - r ~ r -, - r - v_-w-w - .. - . " -4• -
..
.
.
. .
10
(w a)T Ta[ U iLM) T r 1 IL,T 2 w E(- 2 = w wE(V LV( a) (U+V)2
T T-w w E 2L+I 1 w w1
G 2 (p+2(L+MY) p+2(L+M-)-2 21
and similarly
1 1
a ^T^ 2 2aia a
M akinm these substitutions into (16) and restoring Xwe obtain
m0
(12). 0
Note: The proof reveals that the expected loss,3 (2), for more general-
Sestimators of the form (th(Q))os +h(Q) can be developed. In
fact, if
* E8I3h(Q)y Ia OLS,i
the loss Is
2 T T -1 2 22a 2XT XT -_l XEhQ-a2 Eh()(17) a X (X X) X +E h(Q)Y"..4
U Note:Inspectionof ( reveals that the uniquee loss, c2) is oe eea
Sinc Xe loss Inain osclnsfXsoi ti2~~~~~ 0*I0Eh ,. Io oE h ^
apparent that for X0 fixed as a - 0, r2 /rl p, i.e.2
C o- a (p-2).. Hence if we believe 8* is close to 8 the
"usual" constant, a 2 (p-2), may be employed. Using this constant,
if A is near 0, the relative risk of X T to X OBOLS will, from
(12), be near 2/p, as it is in the case of estimating 8. As in
remarks after (10), If C2 is the usual estimator of a independent
of 8OLS we may compute (2) for 8S as in (4) withc - o (p-2).
We obtain
o : (19) a X0T(xTx)- o(1+r (p 2 _- 4 + 2 (n - p ) -1(p-2) 2)-2r2(p-2)).
Expressions similar to (19) can be obtained for instances
of the more general estimators mentioned above (17). This suggests
that the risk (2) may be calculated for commonly used (adaptive)
ridge .estimators, e.g. those considered in section 3. However,
without restrictions on the design matrix X, these estimators
often fail to either provide a in closed form or to define
a as a function of Q.
Since to a first order approximation c0 Z o2(2+1)- 1(p-2+2(6-1))
we may estimate c 0 by
^ --
(20) co = a2(2x+l) (p-2+2(0-.))
where X,6 are the expressions in (13) with 1 replaced by .
We would truncate c0 to the interval CO,Q]. Since E - doesn't
doen'
.1 .. -v . . - ... ' . . . . .[ ' . > -. . - .] .. - . - . .. .. -/ ...< .. <
12AA
exist c0 will be very unstable (as with a0 in (9)) suggesting
that the resulting predictor will perform poorly. Again we
return to this point in the next section.
4,
3. A Simulation Study
A simulation was conducted to compare the use of the OLS
predictor with the predictors discussed in the previous section
and with predictors arising from other estimators of 0 which
have been discussed in the literature. For convenience we set2 T 2a = 1 and take X X= I, i.e. OLS " N($,I). Without loss of
generality we set 0* = 0 and X 0 = 1. Under this setup ridge
estimators become (+a) 0LS and Stein estimators become
,(1 - c/LS OLS) TOLS In addition to OOLS we consider the
following six estimators of a (4 ridge type, 2 Stein type).
4-. AT
(i) HK - arising from a = P/ BOLSOLS" The ridge estimators
discussed by Hoerl and Kennard (1970), Hoerl, Kennard
and Baldwin (1975) and, in fact, Lawless and Wang (1976)
reduce to 0HK in our setup.4-T
(ii) BRM - arising from a = p/( LSBOLS-P) with (l+a) - = 0ATA
if 6OLSOOLS : p. The RIDGM and STEINM estimators discussed
4- by Dempster, Shatzoff, and Wermuth (1977) reduce to 8RM"
(ili) 8 MG - arising from a - (1 - T ^ -OLSOLS) /0 1 OLS/OLS
with (1+a)- 0 if BOLSOLS I p. The ridge estimator
of McDonald and Galarneau (1975) reduces to 8MG
.4
[- I I
13AA
(iv) B^ - arising from a 0 given in (9).a 0
UA
(v) 8p - arising from c = p-2, i.e. the "usual" Stein
estimator.
(vi) B^ - arising from co given in (20), truncated to [0,-).c 0
"Positive part" restrictions were applied to all "shrinkages"
in (v) and (vi).
We note that under the above assumptions the risks in (3)
and (4) and, in fact, of the predictors arising from (i) - (vi)
depend on X 0 and 8 only through (XTS) 2 and aT. Since (XT8) 2
- T 0 < r < 1, we may summarize the results in terms of
T $ and r. We consider p = 3,6,10. For a given p we generated
sets of 2p independent uniform random variables on the interval
[-1,1. In each case we considered the first p observations as
a p vector, standardized to length 1 and designated it as an X0'
Similarly the second p observations are considered as a B vector
with scaling by .1,1,10. Hence we have large OB , i.e. BB - 100,
moderate B T, i.e. B = 1 and small B T, i.e. T B .01. For
each X 0 , pair 1000 B 's were generated from (SI) and using
X each of the seven predictors were calculated for each of the
1000 replications. Bias, variance and mean square error (MSE)
were estimated. A large number of Xo,B pairs (approximately 400)
were investigated enabling a wide range of r's. Table 1 provides
a brief summary indicating the best 0 for prediction over ranges
.A
14
for r along with the typical percentage reduction in risk
(using the best predictor over that range), loo(MSE XTAOL
-MSE X 0 )/MSE X 0 L0 0OLL
Several comments are appropriate:
(I) The cross-over points in Table 1 are approximate, but
.in the vicinity of the cross-over competing predictors are
indistinguishable with respect to MSE.
(ii) It is not surprising that regardless of B, if OT8 large
and r large, the OLS predictor is best. In fact, if STB large
and .01 < r < .5, the percent improvement of the best
predictor over OLS is never greater than 5%.
'-. (iii) As expected, BA , BA performed very badly, always sixthSo ao0 T
or seventh, doing well only when BTB large and r very
small (regardless of p). However,. in such cases, improve-
ments will be substantial, increasing as r decreases, while
the other five predictors are indistinguishable. Near r - .0
SB- is best; much below .01, BA is best.a c
V (iv) Op-2 is likely the best overall choice always amongst
the two or three best apart from cases in (i1) above.
TT. ."(v) When 0is small or moderate,an$
, .RM' OHK':: Op-2 MG
were always the best four. When BT B is small and p' 3,
BMG is close in performance to p-2' When is smallA A Tand p 6, RM and BMG split for second best. When B 8
:4' is small and p = 10, p-2 is second with 8 MG third.p-wM
• 15A 1 A A
(vi) When r is large c 0 is almost always <0 whence - B OLS.
-o C 0
Table 1 reveals that in many cases substantial reduction
in squared error loss over the OLS predictor can be achieved.
It further suggests the possibility of selecting the predictor
according to 8T$ and r. However, finely detailed selection,
e.g. according to X0 , will be unsuccessful as the performance
.of. and 0, reveals. In practice we will have Aia2 instead of BT
co a0
T T -1 -12and r = (AX0(X X) X0 )- , and we might define estimators^% ^ 2A,r with 8 replaced by 8OLS, a replaced by a • We may calculate
A n-i -E(A)= - (p+A) whence A' = n-l -p Is UM.FVU for A. By an argu-
n-3 -ment similar to that contained in Theorem 2, we may show
E(r) = E(p+2(L+M)) (2L+l) : r+(l-rp)((p+A)-+(p+A) 3 2A) where
L,M are distributed as in (13). For individual predictions,
preliminary calculation of A' and r should enable a Judicious
choice of predictor.
As Thisted and Morris (1980, p. 19) observe, the poorest
estimation case for ridge procedures occurs when (with X TX - I,
B* = 0) BT = (l,O,0,...,o) with 81 large. This is also the
poorest estimation case for-Stein type procedures in the sense
that the first coordinate will account for about half of the
total risk and all coordinate risks would decrease if B1 was
excluded (see Baranchik (1964)). For prediction this implies
2 TA large and r = X0 1/XoX0 . Hence this is the poorest case for
prediction as well, i.e. depending upon X0 1, Improvement will
be small or, in fact, the OLS predictor will be better.
-ale
16
i!ow will multicollinearity in XT affect prediction?
Let xTx D, a diagonal matrix with diagonal elements d and
assume d <d 2 ... d . Then the extent of multicollinearity
is usually measured in terms of how close d1 is to 0. Although
ridge methods have been advocated for improved estimation when
d is quite small2 particularly relative to the other d., Thisted
and Morris (p. 21) and others have established that in this case
optimal ridge as well as Stein-type estimators will produce
inconsequential improvement over 0OLS. For prediction (with
2 T22 2St = 0), A = Edi i and r - (X0B)0/(Xi/diEd iB).- - Bingham and
$, Larntz (1977, p. 102) observe that (in this notation) the worst
case for ridge estimation occurs when large 8i are associated
with small d This tells us little about the magnitude of A. How-
ever for fixed X0 and B as XT becomes more severely multicollinear
T^ 2var(X0BOLS) E Xi/di will grow larger and r will become smaller.
As the simulation suggests when r is small, using an appropriate
predictor, we can expect significant improvement over the OLS
predictor.
4. Multiple Prediction
In concluding we offer several comments regarding multiple
or simultaneous prediction. Suppose we wish to make r predictions
* defined by XlX 2 ,...,Xr and we set X* = (Xl,...,Xr). For
-.: convenience we assume the X are a linearly independent set whencei
rank (X*) =r < p. What is an appropriate loss to employ? One
17
choice is unweighted sum of squared error loss, i.e.
E(X(a-_8)) 2 =.( -)TGI(8-8) with G = XuXaT. A second choice
T ^arises from the Joint distribution of the XiBOLS, i.e.
XOT OLS - N(x*ToV2(x*T(xTx)-Ix*)-I ) suggests (8-B)T G2 (0-B)
where G2 X*(XT(XTx)-Ix*)-lxT. Others may be envisioned as
well. If r < p, GI,G2 are positive semi-definite whence, as
noted earlier for individual prediction, we may have loss equal
to 0 but B not close to B. Nonetheless, it is well-known that
if P is any positive definite matrix X*T oLS Is admissible for
X'TB under loss ( -0)x (-0) if p 1 2, inadmissible if
p ! 3. In fact, work done by Berger (1976), Bhattacharya (1966),
Bock (1975), Casella (1977), Efron and Morris (1976) and
Strawderman (1978) leads to explicit minimax predictors which
improve upon x*T^OLS. These predictors will be generalized
adaptive ridge of the form
. (I + A(xTOOLS2 ,xgT(x x)-X,,P))-x*TSoLS* (see e.g. Strawderman,
Theorem 6, p. 626, for a family of such predictors),' parelleling,
in a sense, the individual predictors X . and X Bao 0
As Strawderman notes (p. 626) there is no one predictor
which will dominate the OLS predictor for all P. For P - I (i.e.
G0) a very simple procedure Is to use estimates of A and r to
select a good predictor for each Xi. If we are not prepared to
specify P the simulation study suggests that using 8 p-2 (i.e.
c 2 (p-2) in (4)) regardless of X. is a simple but perhaps
adequate choice.
A.6.
~18
References
Baranchik, A. (1964). "Multiple Regression and Estimationof the Mean of Multivariate Normal Distribution," Dept.of Statistics, Stanford University, Technical Report #51.
Berger, James 0. (1976). "Admissible Minimax Estimation of aMultivariate Normal Mean with Arbitrary Quadratic Loss,"Ann. Statist. 4, 223-226.
Bhattacharya, P.K. (1966). "Estimating the Mean of a Multi-variate Normal Population with General Quadratic LossFunction," Ann. Math. Statist. 37, 1819-1824.
Bingham, C. and K. Larntz (1977). Comment.J. Amer. Statist. Assoc.72, 97-102.
Bock, Mary Ellen (1975). "Minimax Estimators of the Mean of aMultivariate Normal Distribution," Ann. Statist. 3, 209-218.
Casella, George (1977). "Minimax Ridge Regression Estimation,"Mimeograph Series #497, Dept. of Statistics, PurdueUniversIty.
Cohen, A. (1965). "Estimates. of Linear Combinations of theParameters in the Mean Vector of a Multivariate Distribution,"Ann. Math. Statist. 36, 78-87.
-. Dempster, A.P., Martin Schatzoff and Nanny Wermuth (1977)."A Simulation Stidy of Alternatives to Ordinary LeastSquares," J. Amer. Statist.-Assoc. 72, 77-106.
Draper, N., and R.C. Van Nostrand (1979). "Ridge Regression andJames-Stein Estimation: Review and Comments," Technometrics21 (4), p. 451-466.
Efron, B., and C. Morris (1972). "Limiting Risk of Bayes andEmpirical Bayes Estimators - Part II," J. Amer. Statist.Assoc. 67-, 130-139.
_____, __ii-_1. (1976). "Families of Minimax Estimators
of the Mean of a Multivariate Normal Distribution," Ann. Statist.4, 11-21.
Hoerl, Arthur E., and Robert W. Kennard (1970). "Ridge Regression:Biased Estimation for Nonorthogonal Problems," Technometrics 12,55-67.
Regression: Some Simulations," Comm. in Statist. 43 105-123.
19
James, W., and C. Stein (1961). "Estimation with QuadraticLoss," Proceedins of the Fourth Berkeley Symposium, vol. I,ed. by Jerzy Neyman, pp. 361-379. Berkeley and Los Angeles:University of California Press.
Lawless, J. F., and P. Wang (1976). "A Simulation Study of Ridgeand Other Regression Estimators," Comm. in Statist. A5, 307-323.
V McDonald, Gary C., and Diane I. Galarneau (1975). "A Monte CarloEvaluation of Some Ridge Type Estimators," J. Amer. Statist.Assoc. 70, 407-416.
Stein, Charles (1956). "Inadmissibility of the Usual Estimatorfor the Mean of a Multivariate Normal Distribution," Proceed-ings of the Third Berkeley Symposium, Vcl. I, ed. by JerzyNeyman, pp. 197-206. Berkeley and Los Angeles: University ofCalifornia Press.
_______(1973). "Estimation of the Mean of a Multivariate-Normal Distribution," Proceedings of the Prague Symposium on
Asymptotic Statistics, 345-361.
Stone, 14. (1974). "Cross-validatory Choice and Assessment ofStatistical Predictions," J. Roy. Statist. Soc. B-36, 111-147.
Strtwderman, William E. (1978). "Minimax Adaptive GeneralizedRidge Regression Estimators," J. Amer. Statist. Assoc. 73,623-627.
Thisted, R., and C. Morris (1980). "Theoretical Results forAdaptive Ordinary Ridge Regression Estimators," Universityof Chicago Technical Report #94.
,v.I.
-4-:
20
Footnctes
1. A possible refinement to using the predictor defined by
S with c = p-2) would employ the "limited translation"
approach as discussed in Efron and Morris (1972, p. 136).
Limiting the amount of shift for each coordinate of BOLS
toward the corresponding coordinate of 0S using a relevance
function, p, leads to an estimator 0 and resulting predictor
X 8 . We also note that the estimator, O resulting from0-1.
James and Stein (1961, p. 366) sets c = 2
With this c (19) becomes
a2TXI( T )-l X{+np2- 1(-p( 2_ 4-raX x) x 0 {I+(n-p+2) (n-p) (p2-1)f-2(n-p+2) (n-2)(p-2) r2).
2. We recognize that these simplifying assumptions diminish the
utility of the simulation study. In particular, certain
estimators which differ in a more general model become
equivalent. The study is only intended to be illustrativeand suggestive. Certainly a more elaborate one might be
undertaken. We also recognize that with these simplification,Pd"
expressions for the exact risks of some of the predictors
below (ignoring restrictions) may be obtained using (17).-%
2 ,.a
21
va
4.,
,,
"4-
-4
4,...
<.4I
'.,.
0'0
4' ('3vs
* 22
;m 0
4) 0A *i 02<
L44
to (a(~4
r. a CM a0.m4 . 1 '0 ca0ca 0c
4 0av vv(a cv a
-H t4 t4 04 H r- 0- t- 9 0 0
hO' S.-4 Ct) C) %-IA - 0 C) ~ L .- J ~ N r
.4 0 * (v 0 v v ~ 06~
a v a
0 A U"\ UN i-I v A Ul% fH V A LA\ LA\ H- V wl.4- m N~ H- 0 N~ H- 0 M r-I 0
* *4 41
SE-4 ~L
.5.. Cl) C)
94 b4 \.4 co '0o d % :0
V ti ca a4c a an a a2 a 4)IL0 tc
co 0 0 V) o 0
2) _4Z Xr -7 _- z
4 A LA L A IA v
$4 9. %4:4
4-')
H C7\
'I LA A %.
Cu 0% LA\2 %- % 0 2L
00. 4) U .
0.. 4V
OC~4 LiO .4 9
IV .
%- 94 .4
.4-r to
Nih N-J"-)
W. 1
23
UNCLASSIFIEDSECURITY CLASSIFICATION OF THIS PAGE (Whn Data Enteare
IN," REPORT DOCUMENTATION PAGE REAF STRUCTIORSi" J'BEFORE COMPLETING FORMREOTNUMBER GVCSON ______________
. . 374
4. TITLE (and Subtitle) S. TYPE OF REPORT A PERIOD COVERED
On The Use Of Ridge And Stein-Type Estimators TECHNICAL REPORTIn Prediction
6. PERFORMING ORG. REPORT NUMBER
7. AUTHOR(a) 1. CONTRACT OR GRANT NUMBER(O)
Alan E. Gelfand N00014-86-K-0156
9. PERFORMING ORGANIZATION NAME AND ADDRESS 10. PROGRAM ELEMENT. PROJECT. TASK
Department of Statistics ARA A WORK UNIT NUMBERS
Stanford University NR-042-267Stanford, CA 94305I, CONTROLLING OFFICE NAME AND ADDRESS 12. REPORT DATE
Office of Naval Research May 21, 1986Statistics & Probability Program Code 1111- 13. NUMBER OF PAGES
24I. MONITORING AGENCY NAME & ADDRESS(If dilferent from Controlllng Olllce) IS. SECURITY CLASS. (ol this report)
UNCLASSIFIEDI. DECLASSIFICATION/DOWNGRAOING
SCHEDULE
1. DISTRIBUTION STATEMENT (of thil Repore)
APPROVED FOR PUBLIC RELEASE: DISTRIBUTION UNLIMITED.
17. DISTRIBUTION STATEMENT (of the abstract entered in Slock 30, it dJllerent frm Repmrt)
18. SUPPLEMENTARY NOTES% ak
19. KEY WORDS (Continue on t.eee side iI neceeeary and idenltify by block nuimber)
Ridge-type estimators, Stein-type estimators, Regression.
20. ABSTRACT (Contirnue an reverse side iI necessary and identify by block number)
PLEASE SEE FOLLOWING PAGE.
....
rYFORM a.DD I FA 73 1473 EDITION Of I NOV65 IS OBSOLETES/ 0102- 014- 6601 1 UNCLASSIFIED_SECURITY CLASSIFICATION OF THIS PAGE (Whe Date Bntered)
do r r., , , , . . . ., . , . . . . , . , . 2 : ",". .,. : -, ,"." " . . ." . , . .",."." .". .'. .
* - - . - -- -- -. - -. -_ 7 -m - _ - -C .-
.- UTCLASSIFIED- - aUcUTt CLASIFICATION OF ?Nil PAgE1 Mhs . & Rhore
TECHNICAL REPORT NO. 374
20. ABSTRACT
For the usual regression model with fixed regressors, there is a con-
siderable literature devoted to alternatives to ordinary least squares esti-
mators of the regression parameters. These alternatives are biased with
"small" variances resulting in reduced mean square error over some (perhaps
all) of the parameter space. Two prominent classes of such estimators are
ridge-type and Stein-type estimators.
Consider the simplest prediction problem in this context, i.e. prediction
at a single new vector of prediction values. We-eeleulat5 the risk (squared
,- error) for predictors based on estimators in the above families. While the
ordinary least squares predictor is admissible, a simulation study reveals
4that over regions of the parameter space substantial reduction in risk is
possible using estimators in these families. A simple preliminary procedure
based upon the vector of prediction values is given to select a -,good" esti-
mator from these families. It is apparent that in multiple prediction a
single choice of estimator need not be best.
*.2
S11ECURItY CLASSIFICATION OP THIS pAoE*W.(rDsf ,a
'!1 1
* /4'5-',4
>,"
'- .r*
Recommended