Upload
truongdan
View
235
Download
1
Embed Size (px)
Citation preview
1
Chapter 5Bayes Methods
and Elementary Decision Theory
1. Elementary Decision Theory
2. Structure of the risk body: the finite case
3. The finite case: relations between Bayes minimax, admissibility
4. Posterior distributions
5. Finding Bayes rules
6. Finding Minimax rules
7. Admissibility and Inadmissibility
8. Asymptotic theory of Bayes estimators
Chapter 5
Bayes Methods and ElementaryDecision Theory
1 Elementary Decision Theory
Notation 1.1. Let
• ! ! the states of nature.
• A ! the action space.
• X ! the sample space of a random variable X with distribution P!.
• P : X"! #$ [0, 1]; P!(X = x) = probability of observing X = x when ! is true.
• L : !"A #$ R+; a loss function.
• d : A " X $ [0, 1], d(a, x) = d(a|x) = probability of action a when X = x (a decisionfunction).
• D ! {all decision functions}.
The risk for the rule d % D when ! % ! is true is
R(!, d) = E!L(!, d(·|X))
=!
X
!
AL(!, a)d(da|X = x)P!(dx)
=k"
i=1
L(!, ai)
#$
%
m"
j=1
d(ai, xj)P!(X = xj)
&'
( in the discrete case.
We will call R : !"D$ R+ the risk function.A decision rule d is inadmissible if there is a rule d! such that R(!, d!) & R(!, d) for all ! and
R(!, d!) < R(!, d) for some !. A decision rule d is admissible if it is not inadmissible.
Example 1.1 Hypothesis testing. A = {0, 1}, ! = !0 '!1,
L(!, 0) = l01!1(!), L(!, 1) = l11!0(!).
3
4 CHAPTER 5. BAYES METHODS AND ELEMENTARY DECISION THEORY
Since A contains just two points, d(1|x) = 1( d(0|x) ! "(x); and then
P!(accept H0) = E!d(0|X) = E!(1( "(X)),P!(reject H0) = E!d(1|X) = E!"(X),
and
R(!, d) = l1E!"(X)1!0(!) + l0E!(1( "(X))1!1(!).
The classical Neyman - Pearson hypothesis testing philosophy bounds l1# ! sup!"!0R(!, d) and
tries to minimize
l0(1( $d(!)) ! R(!, d)
for each ! % !1.
Example 1.2 Estimation. A = !; typically ! = Rs for some s (but sometimes it is usefulto take ! to be a more general metric space (!, d)). A typical loss function (often used formathematical simplicity more than anything else) is
L(!, a) = K|! ( a|2 for some K.
Then the risk of a rule d is:
R(!, d) = KE!|! ( d(·|X)|2
= KE!|! ( d(X)|2 for a non -randomized rule d
= K{V ar!(d(X)) + [bias!(d(X))]2} when s = 1.
Definition 1.1 A decision rule dM is minimax if
infd"D
sup!"!
R(!, d) = sup!"!
R(!, dM ).
Definition 1.2 A probability distribution " over ! is called a prior distribution.
Definition 1.3 For any given prior " and d % D, the Bayes risk of d with respect to " is
R(", d) =!
!R(!, d)d"(!)
=l"
i=1
R(!i, d)%i if ! = {!1, . . . , !l}.
Definition 1.4 A Bayes decision rule with respect to ", d", is any rule satisfying
R(", d") = infd"D
R(", d) ! Bayes risk.
Example 1.3 ! = {1, 2}.
Urn 1: 10 red balls, 20 blue balls, 70 green balls.
Urn 2: 40 red balls, 40 blue balls, 20 green balls.
1. ELEMENTARY DECISION THEORY 5
One ball is drawn from one of the two urns. Problem: decide which urn the ball came from if thelosses L(!, a) are given by:
!/a 1 21 0 102 6 0
Let d = (dR, dB , dG) with dx = probability of choosing urn 1 if color X = x is observed. Then
R(1, d) = 10P1(action 2) = 10{.1(1 ( dR) + .2(1 ( dB) + .7(1 ( dG)}
and, similarly,
R(2, d) = 6{.4dR + .4dB + .2dG}.
if the prior distribution on the urns is given by %1 = % and %2 = 1( %, then the Bayes risk is
R(", d) = 10% + (2.4 ( 3.4%)dR + (2.4 ( 4.4%)dB + (1.2( 8.2%)dG.
This is minimized by choosing dx = 1 if its coe#cient is negative, 0 if its coe#cient is positive. Forexample, if % = 1/2, then the Bayes risk with respect to % equals
5 + .7dR + .2dB ( 2.9dG,
which is minimized by dR = dB = 0, dG = 1; i.e. the Bayes rule d" with respect to % = 1/2 isd" = (0, 0, 1). Note that the Bayes rule is in fact a non-randomized rule. This gives us the Bayesrisk for % = 1/2 as R(1/2, dB) = 2.1.
I claim that the minimax rule is dM = (0, 9/22, 1), which is a randomization of the two non-random rules d2 = (0, 0, 1), and d7 = (0, 1, 1). This is easily confirmed by computing R(1, dM ) =240/110 = R(2, dM ). Here is a table giving all of the nonrandomized rules and their risks:
X d1 d2 d3 d4 d5 d6 d7 d8
R 0 0 0 1 1 1 0 1B 0 0 1 0 1 0 1 1G 0 1 0 0 0 1 1 1
R(1, d) 10 3 8 9 7 2 1 0R(2, d) 0 1.2 2.4 2.4 4.8 3.6 3.6 6
6 CHAPTER 5. BAYES METHODS AND ELEMENTARY DECISION THEORY
2 Structure of the Risk Body: the finite case
Suppose that
! = {!1, . . . , !l}, X = {x1, . . . , xm}, A = {a1, . . . , ak}.
Definition 2.1 A set A ) Rl is convex if, for all % % [0, 1], x, y % A, %x + (1( %)y % A.
Lemma 2.1 Let % = (%1, . . . ,%t) be a probability distribution (so %i * 0 for i = 1, . . . , t, and)ti=1 %i = 1). Then for any d1, . . . , dt % D,
)ti=1 %idi % D.
Proof. Set d(al, xj) =)t
i=1 %idi(al, xj). Then d(al, xj) * 0 and
k"
l=1
d(al, xj) =t"
i=1
%i
k"
l=1
di(al, xj) =t"
i=1
%i = 1.
!
We call
R = {(R(!1, d), . . . , R(!l, d)) % R+l : d % D}
the risk body.
Theorem 2.1 The risk body is convex.
Proof. Let d1, d2 % D denote the rules corresponding to Ri = (R(!1, di), . . . , R(!l, di)), i = 1, 2,and set d = %d1 + (1( %)d2 % D by the lemma. Then
%R(!l, d1) + (1( %)R(!l, d2) =k"
i=1
m"
j=1
L(!l, ai){%d1(ai, xj) + (1( %)d2(ai, xj)}p!l(xj)
= R(!l, d),
so d % D has risk point %R1 + (1( %)R2 % R. !
Theorem 2.2 Every d % D may be expressed as a convex combination of the the nonrandomizedrules.
Proof. Any d % D may be represented as a k "m matrix of * 0 real numbers whose columnsadd to 1. The nonrandomized rules are those whose entries are 0’s and 1’s. Set d(ai, xj) = dij ,d = (dij), 1 & i & k, 1 & j & m.
There are km nonrandom decision rules: call them &(r) = (&(r)ij ), for 1 & r & km. Given d % D,
we want to find %r * 0 with)km
r=1 %r = 1 so that
km"
r=1
%r&(r) = d.
2. STRUCTURE OF THE RISK BODY: THE FINITE CASE 7
Claim:
%r !m*
j=1
k*
i=1
d"(r)ij
ij =m*
j=1
{k"
i=1
dij&(r)ij }
work. (This is easily proved by induction on m.) !
Remark 2.1 The space of decision rules defined above, D, with d % D being a probability dis-tribution over actions given that X = x, corresponds to the behavioral decision rules as discussedby Ferguson (1967), page 25. This di$ers from Ferguson’s collection D#, the randomized decisionfunctions, which are probability distributions over the non-randomized rules.
8 CHAPTER 5. BAYES METHODS AND ELEMENTARY DECISION THEORY
3 The finite case: relations between Bayes, minimax, and admis-sibility
This section continues our examination of the special, but illuminating, case of a finite set !. Inthis case we can prove a number of results about Bayes and minimax rules and connections betweenthem which carry over to more general problems with some appropriate reformulation.
Theorem 3.1 A minimax rule always exists.
Proof. If 0 /% R, then for some c > 0 the cube with all sides of length c in the positive quadrantdoes not intersect R. Let c increase until the square intersects R; call this number c0. Any decisionrule with a risk point which intersects the c0 square is minimax. (Note that the equation for theboundary of a cube is maxi R(!i, d) = constant.) !
Theorem 3.2 If % = (%1, . . . ,%l) is a prior, then Bayes decision rules have risk points on thehyperplane {x % Rl :
)i %ixi = c#} where
c# = inf{c * 0 : the plane determined by"
i
%ixi = c intersects R}.
Proof. The Bayes risk is R(", d) =)l
i=1 %iR(!i, d) is the Bayes risk and we want to minimizeit. !
Lemma 3.1 An admissible rule is a Bayes rule for some prior %.
Proof. See the picture! !
Note that not every Bayes rule is admissible.
Theorem 3.3 Suppose that d0 is Bayes with respect to % = (%1, . . . ,%l) and %i > 0, i = 1, . . . , l.Then d0 is admissible.
Proof. Suppose that d0 is not admissible; then there is a rule d better than d0: i.e.
R(!i, d) & R(!i, d0) for all i
with < for some i. Since %i > 0 for i = 1, . . . , l,
R(", d) =l"
i=1
%iR(!i, d) <l"
i=1
%iR(!i, d0) = R(", d0),
contradicting the fact that d0 is Bayes with respect to %. !
Theorem 3.4 If dB % D is Bayes for % and it has constant risk, then dB is minimax.
3. THE FINITE CASE: RELATIONS BETWEEN BAYES, MINIMAX, AND ADMISSIBILITY9
Proof. Let r0 be the constant risk. Assume dB is not minimax and let dM be the minimaxrule (which exists). Then
R(!i, dM ) & max1$j$l
R(!j, dM ) < max1$j$l
R(!j, dB) = r0(a)
and
mind"D
R(%, d) = R(%, dB) ="
%iR(!i, dB) = r0.(b)
But (a) yields
R(%, dM ) ="
i
%iR(!i, dM ) <"
i
%ir0 = r0(c)
which contradicts (b). !
Example 3.1 We return to Example 1.3, and consider it from the perspective of Theorem 3.4.When the prior % = (%, 1( %) with % = 6/11, we find that the Bayes risk is given by
R(", d) = 10% + (2.4( 3.4%)dR + (2.4 ( 4.4%)dB + (1.2 ( 8.2%)dG
= 10 · 611
+611
dR + 0 · dB (3611
dG,
so all the rules d" = (0, dB , 1) are Bayes with respect to % = 6/11. Can we find one with constantrisk? The risks of these Bayes rules d" are given by
R(1, d") = 1 + 2(1( dB),R(2, d") = 0 + 2.4dB + 1.2,
and these are equal if dB satisfies 3( 2dB = 1.2 + 2.4dB , and hence dB = 9/22. For this particularBayes rule, d" = (0, 9/22, 1), and the risks are given by R(1, d") = 24/11 = R(2, d"). Thus thehypotheses of Theorem 3.4 are satisfied, and we conclude that d" = (0, 9/22, 1) is a minimax rule.
10 CHAPTER 5. BAYES METHODS AND ELEMENTARY DECISION THEORY
4 Posterior Distributions
Now we will examine Bayes rules more generally.
Definition 4.1 If ! + ", a prior distribution over !, then the conditional distribution of ! givenX, P (! % A|X), A % B(!), is called the posterior distribution of !. Write "(!|x) = P (! & !|X = x)for the conditional distribution function if ! is Euclidean.
If " has density % with respect to ' and P! has density p! with respect to µ, then
%(!|x) =p(x|!)%(!)+
! p(x|!)%(!)d'(!)=
p(x|!)%(!)p(x)
is the posterior density of ! given X = x.
Definition 4.2 Suppose X has density p(x|!), and % = %(·) is a prior density, % % P". If theposterior density %(·|x) has the same form (i.e. %(·|x) % P" for a.e. x), then % is said to be aconjugate prior for p(·|!).
Example 4.1A. (Poisson - Gamma). Suppose that (X|! = !) + Poisson(!), ! + %(#,$). Then
%(!|x) =p(x|!)%(!)
p(x), p(x|!)%(!)
= e%! !x
x!$#
%(#)!#%1e%$!
=$#
x!%(#)!#+x%1e%($+1)!,
so (!|X = x) + %(# + x,$ + 1), and gamma is a conjugate prior for Poisson.B. (Normal mean - normal). (Example 1.3, Lehmann TPE, page 243). Suppose that (X|! = !) +N(!,(2), ! + N(µ, )2). Then
(!|X = x) + N
,µ/)2 + x/(2
1/)2 + 1/(2,
11/)2 + 1/(2
-.
Note that if X is replaced by X1, . . . ,Xn i.i.d. N(!,(2), then by su#ciency,
(!|X = x) + N
,n/(2
1/)2 + n/(2x +
1/)2
1/)2 + n/(2µ, ,
11/)2 + n/(2
-.
Example 4.2 If (X|! = !) + Binomial(n, !), ! + B(#,$), then
%(!|x) =p(x|!)%(!)
p(x), p(x|!)%(!)
=,
n
x
-!x(1( !)n%x!#%1(1( !)$%1 %(# + $)
%(#)%($)
= !#+x%1(1( !)$+n%x%1 %(# + $)%(#)%($)
,n
x
-,
so (!|X = x) + Beta(# + x,$ + n( x).
4. POSTERIOR DISTRIBUTIONS 11
Example 4.3 (Multinomial - Dirichlet). Suppose that (X|! = !) + Multinomialk(n, !) and ! +Dirichlet(#); i.e.
%(!) =%(#1 + · · · + #k).k
i=1 %(#i)!#1%11 · · · !#k%1
k
for !i * 0,)k
i=1 !i = 1, #i > 0. Then (!|X = x) + Dirichlet(# + x),
E(!) =#
)ki=1 #i
,
and
d"(X) =# + X)#i + n
=)
#i)#i + n
#)#i
+n)
#i + n
X
n.
12 CHAPTER 5. BAYES METHODS AND ELEMENTARY DECISION THEORY
5 Finding Bayes Rules
For convex loss functions we can restrict attention to non-randomized rules; see corollary 1.7.9,TPE, page 48. The following theorem gives a useful recipe for finding Bayes rules.
Theorem 5.1 Suppose that ! + ", (X|! = !) + P!, and L(!, a) * 0 for all ! % !, a % A. If(i) There exists a rule d0 with finite risk, and(ii) For a.e. x there exists d"(x) minimizing
E{L(!, d(·|x))|X = x} =!
!L(!, d(·|x))%(!|x)d'(!),
then d"(·|x) is a Bayes rule.
Proof. For any rule d with finite risk
E{L(!, d(X))|X} * E{L(!, d"(X))|X} a.s.(a)
by (ii). Hence
R(", d) = EE{L(!, d(X))|X} = EL(!, d(X))(b)* EE{L(!, d"(X))|X} = EL(!, d"(X))= R(", d").
!
Corollary 1 (Estimation with weighted squared - error loss). If ! = A = R and L(!, a) =K(!)|! ( a|2, then
d"(X) =E{K(!)!|X}E{K(!)|X} =
+! !K(!)d"(!|X)+! K(!)d"(!|X)
.
When K(!) = 1, then
d"(X) =!
!d"(!|X) = E{!|X} ! the posterior mean.
Proof. For an arbitrary (nonrandomized) rule d % D,!
K(!)|! ( d(x)|2d"(!|x) =!
K(!)|! ( d"(x) + d"(x)( d(x)|2d"(!|x)
=!
K(!)|! ( d"(x)|2d"(!|x)
+ 2(d"(x)( d(x))!
K(!){! ( d"(x)}d"(!|x)
+ (d"(x)( d(x))2!
K(!)d"(!|x)
*!
K(!)|! ( d"(x)|2d"(!|x)
with equality if d(x) = d"(x). !
5. FINDING BAYES RULES 13
Corollary 2 (Estimation with L1(loss). If ! = A = R and L(!, a) = |! ( a|, then
d"(x) = any median of "(!|x).
Corollary 3 (Testing with 0( 1 loss). If A = {0, 1}, ! = !0 + !1 (in the sense of disjoint unionof sets), and L(!, ai) = li1!c
i(!), i = 0, 1, then any rule of the form
d"(x) =
#$
%
1 if P (! % !1|X = x) > (l1/l0)P (! % !0|X = x)*(x) if P (! % !1|X = x) = (l1/l0)P (! % !0|X = x)0 if P (! % !1|X = x) < (l1/l0)P (! % !0|X = x)
is Bayes with respect to ". Note that this reduces to a test of the Neyman - Pearson form when!i = {!i}, i = 0, 1.
Proof. Let "(x) = d(1|x). Then
E{L(!,"(x))|X = x} =!
!L(!,"(x))d"(!|x)
=!
!{l1"(x)1!0(!) + l0(1( "(x))1!1(!)}d"(!|x)
= l1"(x)P (! % !0|X = x) + l0(1( "(x))P (! % !1|X = x)= l0P (! % !1|X = x) + "(x){l1P (! % !0|X = x)( l0P (! % !1|X = x)}
which is minimized by any rule of the form d". !
Corollary 4 (Testing with linear loss). If ! = R, A = {0, 1}, !0 = ((-, !0], !1 = (!0,-), and
L(!, 0) = (! ( !0)1!1(!); L(!, 1) = (!0 ( !)1!0(!);
then
d"(x) =
#$
%
1 if E(!|X = x) > !0,*(x) if E(!|X = x) = !0,0 if E(!|X = x) < !0
is Bayes with respect to ".
Proof. Again, by theorem 5.1 it su#ces to minimize
E{L(!,"(x))|X = x}
=!
!{"(x)(!0 ( !)1(%&,!0](!) + (1( "(x))(! ( !0)1(!0,&)(!)}d"(!|x)
=!
!(!0 ( !)1(%&,!0](!)d"(!|x) + (1( "(x)){E(!|X = x)( !0}
which is minimized for each fixed x by any rule of the form d". !
14 CHAPTER 5. BAYES METHODS AND ELEMENTARY DECISION THEORY
Example 5.1 Suppose X + Binomial(n, !) with ! + Beta(#,$). Thus (!|X = x) + Beta(# +x,$ + n( x), so that
E(!) =#
# + $
and hence for L(!, a) = (! ( a)2 the Bayes rule is
d"(X) = E{!|X} =# + X
# + $ + n
=# + $
# + $ + n· #
# + $+
n
# + $ + n
X
n.
If the loss is L(!, a) = (! ( a)2/{!(1 ( !)} instead of squared - error loss, then, with B(#,$) !%(# + $)/(%(#)%($)),
E{!K(!)|X} =B(# + X,$ + n(X ( 1)
B(# + X,$ + n(X),
E{K(!)|X} =B(# + X ( 1,$ + n(X ( 1)
B(# + X,$ + n(X),
and hence the Bayes rule with respect to " for this loss function is
d"(X) =B(# + X,$ + n(X ( 1)
B(# + X ( 1,$ + n(X ( 1)
= %n(#,$)#( 1
# + $ ( 2+ (1( %n(#,$))
X
n
where %n(#,$) = (# + $ ( 2)/(# + $ + n( 2). Note that when # = $ = 1, the Bayes estimator forthis loss function becomes the familiar maximum likelihood estimator p̂ = X/n.
Example 5.2 Suppose that (X |!) + Multinomialk(n,!), and ! + Dirichlet(#). Then
E(!) =#)#i
,
and for squared error loss the Bayes rule is
d"(X) = E(!|X) =# + X)#i + n
.
Example 5.3 Normal with normal prior. If (X|! = !) + N(!,(2), ! + N(µ,(2), then
(!|X) + N
,1/)2
1/)2 + 1/(2µ +
1/(2
1/)2 + 1/(2X,
11/)2 + 1/(2
-.
Consequently
E(!) = µ,
and, for squared error loss the Bayes rule is
d"(X) = E(!|X) =1/)2
1/)2 + 1/(2µ +
1/(2
1/)2 + 1/(2X.
This remains true if L(!, a) = +(! ( a) where + is convex and even (see e.g. Lehmann, TPE, page244).
5. FINDING BAYES RULES 15
The following example of the general set-up is sometimes referred to as a “classification prob-lem”. Suppose that ! = {!1, . . . , !k} and A = {a1, . . . , ak} = !. Suppose that X + Pi ! P!i ,i = 1, . . . , k when !i is the true state of nature where Pi has density pi with respect to a ((finitedominating measure µ A simple loss function is given by
L(!i, aj) = 1{i .= j}, i = 1, . . . , k, j = 1, . . . , k.
Given a (“multiple”) decision rule d = (d(1|X), . . . , d(k|X)), the risk function is
R(!i, d) =k"
j=1
L(!i, aj)E!i{d(j|X)} ="
j '=i
E!i{d(j|X)}
= 1( E!i{d(i|X)}.
Suppose that % = (%1, . . . ,%k) is a prior distribution on ! = {!1, . . . , !k}. then the Bayes risk is
R(%, d) = 1(k"
i=1
%iE!i{d(i|X)}
= probability of missclassification using d
when the distribution from which X is drawnis chosen according to %.
Here is a theorem characterizing the class of Bayes rules in this setting.
Theorem 5.2 Any rule d for which
d(i|X) = 0 whenever %ipi(X) < maxj
%jpj(X)
for i = 1, . . . , k is Bayes with respect to %. Equivalently, since)k
i=1 d(i|X) = 1 for all X,
d(i|X) = 1 if %ipi(X) > %jpj(X), for j .= i.
Proof. Let d! be any other rule. We want to show that
R(%, d!)(R(%, d) * 0
where d is any rule of the form given. But
R(%, d!)(R(%, d) =k"
i=1
%i
!d(i|x)pi(x)dµ(x)(
k"
j=1
%j
!d!(j|x)pj(x)dµ(x)
=k"
i=1
k"
j=1
!d(i|x)d!(j|x){%ipi(x)( %jpj(x)}dµ(x)
* 0
since whenever %ipi(x) < %jpj(x) it follows that d(i|x) = 0. !
Finally, here is a cautionary example.
16 CHAPTER 5. BAYES METHODS AND ELEMENTARY DECISION THEORY
Example 5.4 (Ritov-Wasserman). Suppose that (X1, R1, Y1), . . . , (Xn, Rn, Yn) are i.i.d. with adistribution described as follows. Let ! = (!1, . . . , !B) % [0, 1]B ! ! where B is large, e.g. 1010.Let , = (,1, . . . , ,B) be a vector of known numbers with 0 < & & ,j & 1 ( & < 1 for j = 1, . . . , B.Furthermore, suppose that:
(i) Xi + Uniform{1, . . . , B}.
(ii) Ri + Bernoulli(,Xi).
(iii) If Ri = 1, Yi + Bernoulli(!Xi); if Ri = 0, Yi is missing (i.e. not observed).
Our goal is to estimate
- = -(!) = P!(Y1 = 1) =B"
j=1
P (Y1 = 1|X1 = j)P (X1 = j) =1B
B"
j=1
!j.
Now the likelihood contribution of (Xi, Ri, Yi) is
f(Xi)f(Ri|Xi)f(Yi|Xi, Ri) =1B
,RiXi
(1( ,Xi)1%Ri!YiRi
Xi(1 ( !Xi)
(1%Yi)Ri ,
and hence the likelihood for ! is
Ln(!) =n*
i=1
1B
,RiXi
(1( ,Xi)1%Ri!YiRi
Xi(1( !Xi)
(1%Yi)Ri
,n*
i=1
!YiRiXi
(1( !Xi)(1%Yi)Ri .
Thus
ln(!) =n"
i=1
{YiRi log !Xi + (1( Yi)Ri log(1( !Xi)}
=B"
j=1
nj log !j +B"
j=1
mj log(1( !j)
where
nj = #{i : Yi = 1, Ri = 1,Xi = j}, mj = #{i : Yi = 0, Ri = 1,Xi = j}.
Note that nj = mj = 0 for most j since B >> n. Thus the MLE for most !j is not defined.Furthermore, for most !j the posterior distribution is the prior distribution (especially if the prioris a product distribution on ! = [0, 1]B !). Thus both MLE and Bayes estimation fail.
Here is a purely frequentist solution: the Horovitz- Thompson estimator of - is
-̂n =1n
n"
i=1
Ri
,Xi
Yi.
5. FINDING BAYES RULES 17
Note that
E(-̂n) = E{ R1
,X1
Y1} = EE{ R1
,X1
Y1|X1, R1} = E
/R1
,X1
E{Y1|X1, R1}0
= E
/R1
,X1
!X1
0= EE
/R1
,X1
!X1|X1
0= E
/!X1
,X1
E{R1|X1}0
= E
/!X1
,X1
,X1
0= E{!X1}
= B%1B"
j=1
!j = -(!) .
Thus -̂n is an unbiased estimator of -(!). Moreover,
V ar(-̂n) =1n
#$
%1B
B"
j=1
!j
,j( -(!)2
&'
( .(1)
Exercise 5.1 Show that (1) holds and hence that
V ar(-̂n) & 1n&
under the assumption that ,j * & > 0 for all 1 & j & B. [Hint: use the formula V ar(Y ) =EV ar(Y |X) + V ar[E(Y |X)] twice.]
18 CHAPTER 5. BAYES METHODS AND ELEMENTARY DECISION THEORY
6 Finding Minimax Rules
Definition 6.1 A prior "0 for which R(", d") is maximized is called a least favorable prior:
R("0, d"0) = sup"
R(", d").
Theorem 6.1 Suppose that " is a prior distribution on ! such that
R(", d") =!
!R(!, d")d"(!) = sup
!R(!, d").(1)
Then:(i) d" is minimax.(ii) If d" is unique Bayes with respect to ", d" is unique minimax.(iii) " is least favorable.
Proof. (i) Let d be another rule. Then
sup!
R(!, d) *!
!R(!, d)d"(!)
*!
!R(!, d")d"(!) since d" is Bayes wrt "(a)
= sup!
R(!, d") by (1).
Hence d" is minimax.(ii). If d" is unique Bayes, then > holds in (a), so d" is unique minimax.(iii). Let "# be some other prior distribution. Then
r"! !!
!R(!, d"!)d"#(!) &
!
!R(!, d")d"#(!)
since d"! is Bayes wrt "#
& sup!
R(!, d")
= R(", d") ! r" by (1).
!
Corollary 1 If d" is Bayes with respect to " and has constant risk, R(!, d") = constant, then d"
is minimax.
Proof. If d" has constant risk, then (1) holds. !
Corollary 2 Let
!" ! {! % ! : R(!, d") = sup!"
R(!!, d")}
be the set of !’s where the risk of d" assumes its maximum. Then d" is minimax if "(!") = 1.Equivalently, d" is minimax if there is a set !" with "(!") = 1 and R(!, d") = sup!" R(!!, d") forall ! % !".
6. FINDING MINIMAX RULES 19
Example 6.1 Suppose X + Binomial(n, !) with ! + Beta(#,$). Thus (!|X = x) + Beta(# +x,$ + n( x), and as we computed in section 5,
d"(X) =# + X
# + $ + n=
# + $
# + $ + n
#
# + $+
n
# + $ + n
X
n.
Consequently,
R(!, d") =,
n
# + $ + n
-2 !(1( !)n
+,
# + n!
# + $ + n( !
-2
=1
(# + $ + n)21#2 + (n( 2#(# + $))! + ((# + $)2 ( n)!2
2
=1
(# + $ + n)2#2
if 2#(# + $) = n and (# + $)2 = n. But solving these two equations yields # = $ =/
n/2. Thusfor these choices of # and $, the risk of the Bayes rule is constant in !, and hence
dM (X) =1
1 +/
n
12
+/
n
1 +/
n
X
n
is Bayes with respect to " = Beta(/
n/2,/
n/2) and has risk
R(!, dM ) =1
4(1 +/
n)2& !(1( !)
n= R(!,X/n)
if
|! ( 1/2| & 12
31 + 2
/n
1 +/
n+ 1/
2n1/4.
Hence Beta(/
n/2,/
n/2) is least favorable and dM is minimax. Note that dM is a consistentestimator of !:
dM (X)$p !
as n $ -, but its bias is of the order n%1/2 (the bias equals (1/2 ( !)/(1 +/
n)). Furthermore,dM is a (locally) regular estimator of !, but the limit distribution is not centered at zero if ! .= 1/2:under P!n with !n = ! + tn%1/2
/n(dM (X) ( !n)$d N(1/2 ( !, !(1( !)).
If a least-favorable prior does not exist, then we can still consider improper priors or limits ofproper priors:
Let {"k} be a sequence of prior distributions, let dk denote the Bayes estimator correspondingto "k, and set
rk !!
!R(!, dk)d"k(!).
Suppose that
rk $ r <- as k $-.(2)
20 CHAPTER 5. BAYES METHODS AND ELEMENTARY DECISION THEORY
Definition 6.2 The sequence of prior distributions {"k} with Bayes risks {rk} is said to be leastfavorable if r" ! R(", d") & r for any prior distribution ".
Theorem 6.2 Suppose that {"k} is a sequence of prior distributions with Bayes risks satisfying
rk $ r,(3)
and d is an estimator for which
sup!
R(!, d) = r.(4)
Then:A. d is mininax, andB. {"k} is least - favorable.
Proof. A. Suppose d# is any other estimator. Then
sup!
R(!, d#) *!
R(!, d#)d"k(!) * rk
for all k * 1. Hence
sup!
R(!, d#) * r = sup!
R(!, d) by (4),
so d is minimax.B. If " is any prior distribution, then
r" =!
R(!, d")d"(!) &!
R(!, d)d"(!) & sup!
R(!, d) = r
by (4). !
Example 6.2 Let X1, . . . ,Xn be i.i.d. N(!,(2) given ! = !, and suppose that ! + N(µ, )2) andthat (2 is known. Then it follows from (5.3) that
d"(X) = E{!|X} =1/)2
1/)2 + n/(2µ +
n/(2
1/)2 + n/(2Xn
and
R(", d") ! r" = E{! ( d"(X)}2
= EE{[! ( d"(X)]2|X}= EV ar{!|X}
=1
1/)2 + n/(2$ (2
nas )2 $-
= R(!,X)
for all ! % ! = R. Hence by theorem 6.2, X is a minimax estimator of !.
The remainder of this section is aimed at extending minimaxity of estimators from smallermodels to larger ones.
6. FINDING MINIMAX RULES 21
Lemma 6.1 Suppose that X + P % M ! {all P ’s on X}, and that ' : P ) M #$ R is afunctional (e.g. '(P ) = EP (X) =
+xdP (x), '(P ) = V arP (X), '(P ) = Pf for some fixed function
f). Suppose that d is a minimax estimator of '(P ) for P % P0 ) P1. If
supP"P0
R(P, d#) = supP"P1
R(P, d),
then d is also minimax for estimating '(P ), P % P1.
Proof. Suppose that d is not minimax. Then there exists a rule d# with smaller maximum risk:
supP"P0
R(P, d#) & supP"P1
R(P, d#) < supP"P1
R(P, d) = supP"P0
R(P, d)
which contradicts the hypothesis that d is minimax for P0. !
Example 6.3 This continues the normal mean example 6.2. Suppose that (2 is unknown. Thus,with ! = {(!,(2) : ! % R, 0 < (2 <-},
sup(!,%2)"!
R((!,(2),Xn) = sup(!,%2)"!
(2
n=-.
So, to get something reasonable, we need to restrict (2. Let
P0 = {N(!,M) : ! % R},P1 = {N(!,(2) : ! % R, 0 & (2 &M}.
Then Xn is minimax for P0 by our earlier calculation, P0 ) P1, and
supP1
R(P,Xn) = supP0
R(P,Xn) =M
n.
Thus Xn is a minimax estimator of ! for P1.
Example 6.4 Let X1, . . . ,Xn be i.i.d. P % Pµ where
Pµ = {all probability measures P : EP |X| <-}
and consider estimation of '(P ) = EP (X) with squared error loss for the families
Pb%2 ! {P % Pµ : V arP (X) &M <-}Pbr ! {P % Pµ : P (a & X & B) = 1} for some fixed a, b % R.
Then:A. X is minimax for Pb%2 = P1 by Example 6.3 since it is minimax for P1 ! {N(!,(2) : ! %R, 0 < (2 &M}, and
supP"P0
R(P,X) = supP"P1
R(P,X).
B. Without loss of generality suppose a = 0 and b = 1. Let
P1 = {P : P ([0, 1]) = 1},P0 = {P % P1 : P (X = 1) = p, P (X = 0) = 1( p for some 0 < p < 1}.
22 CHAPTER 5. BAYES METHODS AND ELEMENTARY DECISION THEORY
For P0 we know that the minimax estimator is
dM (X) =1
1 +/
n
12
+/
n
1 +/
nX.
Now, with EP X ! !,
R(P, dM ) =1
(1 +/
n)2
4VarP (X) +
,12( !
-25
& 1(1 +
/n)2
{! ( !2 + (1/4) ( ! + !2}
since 0 & X & 1 implies EP X2 & EP X ! !
=1/4
(1 +/
n)2.
Thus
supP"P1
R(P, dM ) = supP"P0
R(P, dM ),
and by Lemma 6.1 dM is minimax.
Remark 6.1 The restriction (2 & M in (5) is crucial; note that '(P ) = EP (X) is discontinuousat every P in the sense that we can easily have Pm $d P , but '(Pm) fails to converge to '(P ).
Remark 6.2 If ! = [(M,M ] ) R, then X is no longer minimax in the normal case; see Bickel(1981). The approximately least favorable densities for M $- are
%M (!) = M%1 cos2((./2)(!/M)1[%M,M ](!).
7. ADMISSIBILITY AND INADMISSIBILITY 23
7 Admissibility and Inadmissibility
Our goal in this section is to establish some basic and simple results about admissibility / inad-missibility of some classical estimators. In particular, we will show that X is admissible for theGaussian location model for k = 1, but inadmissible for the Gaussian location model for k * 3.The fundamental work in this area is that of Stein (1956) and James and Stein (1961). For aninteresting expository paper, see Efron and Morris (1977); for more work on admissibility issuessee e.g. Eaton (1992), (1997), and Brown (1971).
Theorem 7.1 Any unique Bayes estimator is admissible.
Proof. Suppose that d" is unique Bayes with respect to " and is inadmissible. Then thereexists an estimator d such that R(!, d) & R(!, d") with strict inequality for some !, and hence
!
!R(!, d)"(!) &
!
!R(!, d")d"(!)
which contradicts uniqueness of d". Thus d" is admissible. !
Lemma 7.1 If the loss function L(!, a) is squared error (or is convex in a) and, with Q definedby Q(A) =
+P!(X % A)d"(!), a.e. Q implies a.e. P, then a Bayes rule with finite Bayes risk is
unique a.e. P. [a.e. P means P (N) = 0 for all P % P.]
Proof. See Lehmann and Casella TPE Corollary 4.1.4 page 229. !
Example 7.1 Consider the Bayes estimator of a normal mean, d" = pnµ + (1 ( pn)X , pn =(1/)2)/(1/)2 +n/(2). The Bayes risk is finite and a.e. Q implies a.e. P. Hence d" is unique Bayesand admissible.
Theorem 7.2 If X is a random variable with mean ! and variance (2, then aX + b is inadmissibleas an estimator of ! for squared error loss if
(i) a > 1(ii) a < 0(iii) a = 1, b .= 0.
Proof. For any a, b the risk of the rule aX + b is
R(!, aX + b) = a2(2 + {(a( 1)! + b}2 ! +(a, b).
(i) If a > 1
+(a, b) * a2(2 > (2 = +(1, 0),
so aX + b is dominated by X.(ii) If a < 0, then (a( 1)2 > 1 and
+(a, b) * {(a( 1)! + b}2 = (a( 1)2{! +b
a( 1}2
> {! +b
a( 1}2 = +(0,(b/(a ( 1)).
24 CHAPTER 5. BAYES METHODS AND ELEMENTARY DECISION THEORY
(iii) If a = 1, b .= 0,
+(1, b) = (2 + b2 > (2 = +(1, 0),
so X + b is dominated by X. !
Example 7.2 Thus aX + b is inadmissible for a < 0 or a > 1. When a = 0, d = b is admissiblesince it is the only estimator with zero risk at ! = b. When a = 1, b .= 0, d is inadmissible. What isleft is X: is X admissible as the estimator of a normal mean in R? The following theorem answersthis a#rmatively.
Theorem 7.3 If X1, . . . ,Xn are i.i.d. N(!,(2), ! % ! = R with (2 known, then X is an admissibleestimator of !.
Proof. Limiting Bayes method. Suppose that X is inadmissible (and ( = 1). Then there isan estimator d# such that R(!, d#) & 1/n = R(!,X) for all ! with risk < 1/n for some !. Now
R(!, d) = E!(! ( d(X))2
is continuous in ! for every d, and hence there exists / > 0 and !0 < !1 so that
R(!, d#) <1n( / for all !0 < ! < !1.
Let
r#& !!
!R(!, d#)d"(!)
where " = N(0, )2). Thus
r& !!
!R(!, d& )d"(!) =
11/)2 + n
=)2
1 + n)2
so r& & r#& . Thus
1/n ( r#&1/n ( r&
=1(2'&
+ &%& {1/n (R(!, d#)} exp((!2/2)2)d!
1/n ( )2/(1 + n)2)
*1(2'&
/+ !1!0
exp((!2/2)2)d!
1/n(1 + n)2)
=n(1 + n)2)/
2.)/
! !1
!0
exp((!2/2)2)d!
$ - · / · (!1 ( !0) =- as ) $-.
Hence1n( r#& >
1n( r&
for ) > some )0, or, r#& < r& for some ) > )0, which contradicts d& Bayes with respect to "& withBayes risk r& . Hence X is admissible. !
7. ADMISSIBILITY AND INADMISSIBILITY 25
Proof. (Information inequality method). The risk of any rule d is
R(!, d) = E!(d( !)2 = V ar![d(X)] + b2(!)
* [1 + b!(!)]2
n+ b2(!)
since I(!) = 1. Suppose that d is any estimator satisfying
R(!, d) & 1/n for all !.
then
1n* b2(!) +
[1 + b!(!)]2
n.(a)
But (a) implies that:(i) |b(!)| & 1/
/n for all !; i.e. b is bounded.
(ii) b!(!) & 0 since 1 + 2b!(!) + [b!(!)]2 & 1 .(iii) There exists !i $- such that b!(!i)$ 0.
[If b!(!) & (/ for all ! > !0, then b(!) is not bounded.]Similarly, there exists !i $ (- such that b!(!i)$ 0. (iv)
b2(!) & 1n( [1 + b!(!)]2
n= (2b!(!) + [b!(!)]2
n
Hence b(!) = 0, and R(!, d) ! 1/n. !
Theorem 7.4 (Stein’s theorem) If k * 3, then X is inadmissible.
Remark 7.1 The sample mean is admissible when k = 2; see Stein (1956), Ferguson (1967), page170.
Proof. Let g : Rk $ Rk have
E666
0
0xigi(X)
666 <-,
and consider estimators of the form
7!n = X + n%1g(X).
Now
E!|X ( !|2 ( E!|7!n ( !|2 = E!|X ( !|2 ( E!|X ( ! + n%1g(X)|2
= (2n%1E!0X ( !, g(X)1 ( n%2E!|g(X)|2.(a)
!
To proceed further we need an identity due to Stein.
26 CHAPTER 5. BAYES METHODS AND ELEMENTARY DECISION THEORY
Lemma 7.2 If X + N(!,(2), and g is a function with E|g!(X)| <-, then
(2Eg!(X) = E(X ( !)g(X).(1)
If X + Nk(!,(2I), and g : Rk $ Rk has
E| 0
0xigi(X)| <-,
then:
(2E0
0xigi(X) = E(Xi ( !i)gi(X), i = 1, . . . , k.(2)
Proof. Without loss of generality, suppose that ! = 0 and (2 = 1. Integration by parts gives
Eg!(X) =1/2.
! &
%&g!(x) exp((x2/2)dx
= ( 1/2.
! &
%&g(x) exp((x2/2){(2x/2}dx
= EXg(X).
In applying integration by parts here, we have ignored the term uv|&%& = g(x)"(x)|&%&. To provethat this does in fact vanish, it is easiest to apply Fubini’s theorem (twice!) in a slightly devious wayas follows: let "(t) = (2.)%1/2 exp((t2/2) be the standard normal density. Since "!(t) = (t"(t),we have both
"(x) = (! &
x"!(t)dt =
! &
xt"(t)dt
and
"(x) =! x
%&"!(t)dt = (
! x
%&t"(t)dt.
Therefore we can write
Eg!(X) =! &
%&g!(x)"(x)dx
=! &
0g!(x)
! &
xt"(t)dtdx(
! 0
%&g!(x)
! x
%&t"(t)dtdx
=! &
0t"(t)
/! t
0g!(x)dx
0dt(
! 0
%&t"(t)
/! 0
tg!(x)dx
0dt
=,! &
0+
! 0
%&
-{t"(t)}[g(t) ( g(0)]}dt
=! &
%&tg(t)"(t)dt = EXg(X).
Here the third equality is justified by the hypothesis E|g!(X)| <- and Fubini’s theorem.]To prove (2), write X(i) ! (X1, . . . ,Xi%1,Xi+1, . . . ,Xk). Then
(2E0
0xigi(X) = (2EE{ 0
0xigi(X)|X(i)}
= EE{(Xi ( !i)gi(X)|X(i)} by (1)
7. ADMISSIBILITY AND INADMISSIBILITY 27
!
Proof. Now we return to the proof of the theorem. Using (2) in (a) yields
(2n%2E!
k"
i=1
0gi
0xi(X)( n%2E!|g(X)|2.(a)
Now let - : Rk $ R be twice di$erentiable, and set
g(x) = 2{log -(x)} =1
-(x)
,0-(x)0x1
, . . . ,0-(x)0xk
-.
Thus
0gi
0xi(x) =
1-(x)
02
0x2i
-(x)( 1-2(x)
,0-(x)0xi
-2
=1
-(x)02
0x2i
-(x)( gi(x)2
andk"
i=1
0gi
0xi(x) =
1-(x)
22-(x) ( |g(x)|2.
Hence the right side of (a) is
= n%2E!|g(X)|2 ( 2n%2E!
/1
-(X)22-(X)
0> 0(b)
if -(x) * 0 and 22- & 0, g .= 0 (i.e. - is super-harmonic). Here is one example of such a function:A. Suppose that -(x) = |x|%(k%2) = {x2
1 + · · · + x2k}%(k%2)/2. Then
g(x) = 2 log-(x) = (k ( 2|x|2 x
and 22-(x) = 0, so - is harmonic. Thus
!̂n =,
1( k ( 2n|X|2
-X
and
E!|X ( !|2 ( E|7! ( !|2 = n%2E!|g(X)|2 =,
k ( 2n
-2
E!|X |%2
=,
k ( 2/n
-2
E!|/
n(X ( !) +/
n!|%2
=,
k ( 2/n
-2
E0|X +/
n!|%2
=
4 8k%2n
92E0|n%1/2X + !|%2 = O(n%2), ! .= 0,
(k%2)2
n1
k%2 = k%2n , ! = 0
28 CHAPTER 5. BAYES METHODS AND ELEMENTARY DECISION THEORY
since |X|2 + 12k with E(1/12
k) = 1/(k ( 2). Hence
E0|7! ( !|2
E0|X ( !|2=
2k
< 1.
For general !,
R(!, 7!) =k
n( (k ( 2)2
nE0
,1
|X +/
n!|2
-
=k
n
,1( k ( 2
nE0
,k ( 212
k(&)
--
since |X +/
n!|2 + 12k(&) with & = n|!|2/2. Thus
R(!, 7!)R(!,X)
=,
1( k ( 2n
E0
,k ( 212
k(&)
--
= 2/k when ! = 0,$ 1 as n$- for fixed ! .= 0,$ 1 as |!|$- for fixed n.
!
Remark 7.2 Note that the James-Stein estimator
7!n =,
1( k ( 2n|X|2
-X
derived above is not regular at ! = 0: if !n = tn%1/2, then/
n(X ( !n) d= Z + Nk(0,(2I) underP!n , so that
/n(7!n ( !n) =
/n(X ( !n)( k ( 2
|/
n(X ( !n) + t|21/
n(X ( !n) + t2
= Z ( k ( 2|Z + t|2 (Z + t)
which has a distribution dependent on t.
Remark 7.3 It turns out that the James - Stein estimator 7!n is itself inadmissible; see Lehmannand Casella, TPE pages 356-357, and Section 5.7, pages 376-389.
Remark 7.4 Another interesting function - is
-(x) =/
|x|%(k%2), |x| */
k ( 2,(k ( 2)%(k%2)/2 exp([(k ( 2)( |x|2]/2), |x| <
/k ( 2.
For this - we have
g(x) = 2 log-(x) =
4(k%2
|x|2 x, |x| */
k ( 2(x, |x| &
/k ( 2.
7. ADMISSIBILITY AND INADMISSIBILITY 29
Another approach to deriving the James-Stein estimator is via an empirical Bayes approach. Weknow that the Bayes estimator for squared error loss when X + Nk(!,(2I) and ! + Nk(0, )2I) ! "&
isd"(X) =
)2
(2 + )2X,
with Bayes risk
R(", d") =k
1/(2 + 1/)2=
k(2)2
(2 + )2.
Regarding ) as unknown and estimating it via the marginal distribution Q of X is a (parametric)empirical Bayes approach. Since the marginal distribution Q of X is Nk(0, ((2 +)2)I), with density
q(x; )2) =1
(3
2.((2 + )2))kexp
,( 3x32
2((2 + )2)
-
the MLE of (2 + )2 is 3X32/k, and the resulting MLE )̂2 of )2 is given by
)̂2 =,3X32
k( (2
-+
.
Thus)̂2
(2 + )̂2=
,1( k(2
3X32
-+
,
and this leads to the following version of the (positive part) James-Stein estimator:
dEB,MLE(X) =,
1( k(2
3X32
-+
X.
If, instead of the MLE, we used an unbiased estimator of )2/((2 + )2), then we get the positivepart James-Stein estimator
dJS(X) =,
1( (k ( 2)(2
3X32
-+
X.
30 CHAPTER 5. BAYES METHODS AND ELEMENTARY DECISION THEORY
8 Asymptotic theory of Bayes estimators
We begin with a basic result for Bayes estimation of a Bernoulli probability !.Suppose that (Y1, Y2, . . .)|! = ! are i.i.d. Bernoulli(!). Suppose that the prior distribution is
an arbitrary distribution on (0, 1). The Bayes estimator of ! with respect to squared error loss isthe posterior mean d(Y ) = d(Y1, . . . , Yn) = E(!|Y ). Now this sequence is a (uniformly integrable)martingale; this type of martingale is sometimes known as a Doob martingale. Hence by themartingale convergence theorem,
d(Y n) = E(!|Y n)$a.s. E(!|Y1, Y2, . . .).
To prove consistency, it remains only to show that E(!|Y1, Y2, . . .) = !.To see this, we compute conditionally on !: now
E
:n"
i=1
Yi|!;
= n!, V ar
:n"
i=1
Yi|!;
= n!(1( !).
Hence, with 7!n ! Y n,
E(7!n ( !)2 & 1/4n
,
and by Chebychev’s inequality, 7!n converges in probability to !. Therefore 7!nk $ ! almost surelyfor some subsequence nk. Hence ! = !̃ a.s. where !̃ ! limk
7!nk is measurable with respect toY1, Y2, . . .. Thus we have
E(!|Y1, Y2, . . .) = E(!̃|Y1, Y2, . . .) = !̃ = ! a.s. P"
where
P"(Y % A, ! % B) =!
BP!(Y % A)d"(!).
Therefore
P"(d(Y n)$ !) =!
[0,1]P!(d(Y n)$ !)d"(!) = 1,
and this implies that
P!(d(Y n)$ !) = 1 a.e. ".
Thus the Bayes estimator d(Y n) ! E(!|Y n) is consistent for " a.e. !.A general result of this type for smooth, finite - dimensional families was established by Doob
(1948), and has been generalized by Breiman, Le Cam, and Schwartz (1964), Freedman (1963),and Schwartz (1965). See van der Vaart (1998), pages 149-151 for the general result. The upshotis that for smooth, finite - dimensional families (!,") is consistent if an only if ! is in the supportof ". However the assumption of finite-dimensionality is important: Freedman (1963) gives anexample of an inconsistent Bayes rule when the sample space is X = Z+ and ! is the collection ofall probabilities on X . The paper by Diaconis and Freedman (1986) on the consistency of Bayesestimates is an outgrowth of Freedman (1963).
Asymptotic E"ciency of Bayes Estimators
8. ASYMPTOTIC THEORY OF BAYES ESTIMATORS 31
Now we consider a more general situation, but still with a real-valued parameter !. LetX1, . . . ,Xn be i.i.d. p!(x) with respect to µ where ! % ! ) R, and !0 % ! (an open interval)is true; write P for P!0 .
Assumptions:B1: (a) - (f) of Theorem 2.6 (TPE, pages 440 - 441) hold. Thus it follows that
ln(!)( ln(!0) = (! ( !0)l̇!(X ; !0)(12(! ( !0)2{nI(!0) + Rn(!)}(1)
where n%1Rn(!)$p 0; see problems 6.3.22 and 6.8.3, TPE, pages 503 and 514. We strengthen thisto:B2: for any / > 0 there exists a & > 0 such that
P
4
sup|!%!0|$"
|n%1Rn(!)| * /
5
$ 0 as n$-.
B3: For any & > 0 there exists an / > 0 such that
P
4sup
|!%!0|)"(ln(!)( ln(!0)) & (/
5$ 1.
B4: The prior density % of ! is continuous and > 0 for all ! % !.B5: E(|!| <-.
Theorem 8.1 Suppose that %#(t|X) is the posterior density of/
n(! ( Tn) where
Tn ! !0 +1
I(!0){n%1l̇!(X ; !0)}.
(i) If B1 - B4 hold, then
dTV ("#(·|X), N(0, 1/I(!0))) =!
|%#(t|X)(3
I(!0)"(t3
I(!0))|dt$p 0.
(ii) If B1 - B5 hold, then!
(1 + |t|)|%#(t|X)(3
I(!0)"(t3
I(!0))|dt$p 0.
Theorem 8.2 If B1 - B5 hold and !̃n is the Bayes estimator with respect to % for squared errorloss, then
/n(!̃n ( !0)$d N(0, 1/I(!0)),
so that !̃n is/
n(consistent and asymptotically e#cient.
Example 8.1 Suppose that X1, . . . ,Xn are i.i.d. N(!,(2) and that the prior on ! is N(µ, )2).Then the posterior distribution is given by (!|X) + N(pnX + (1( pn)µ,(2
n/n) where
pn !n/(2
n/(2 + 1/)2, (2
n =n
n/(2 + 1/)2
32 CHAPTER 5. BAYES METHODS AND ELEMENTARY DECISION THEORY
so that pn $ 1,/
n(1( pn)$ 0, and (2n $ (2. Note that
Tn = !0 +1
1/(2
(X ( !0)(2
= X,
and the Bayes estimator with squared error loss is
E(!|X) = pnX + (1( pn)µ.
Thus
(/
n(! (X)|X) + N(/
n(1( pn)(µ(X),(2n)
$d N(0,(2) = N(0, 1/I(!0)) a.s.
in agreement with theorem 8.1, while/
n(E(!|X)( !0) =/
n(pnX + (1( pn)µ( !0)= pn
/n(X ( !0) +
/n(1( pn)(µ( !0)
d= pnN(0,(2) +/
n(1( pn)(µ( !0)$d N(0,(2)
in agreement with theorem 8.2.
Proof. We first suppose that theorem 8.1 is proved, and show that it implies theorem 8.2. Notethat
/n(!̃n ( !0) =
/n(!̃n ( Tn) +
/n(Tn ( !0).(a)
Since/
n(Tn ( !0)$d N(0, 1/I(!0)),(b)
it su#ces to show that/
n(!̃n ( Tn)$p 0.(c)
To prove (c), note that by definition of !̃n
!̃n =!
!%(!|X)d! =! /
t/n
+ Tn
0%#(t|X)dt,
and hence/
n(!̃n ( Tn) =!
t%#(t|X)dt(!
t3
I(!0)"(t3
I(!0))dt.
It folows that/
n|!̃n ( Tn| &!
|t||%#(t|X)(3
I(!0)"(t3
I(!0))|dt$p 0
as n$ - by theorem 8.1(ii). Thus (c) holds, and the proof is complete; it remains only to provetheorem 8.1.
8. ASYMPTOTIC THEORY OF BAYES ESTIMATORS 33
Proof of theorem 8.1: (i) By definition of Tn
%#(t|X) =%(Tn + n%1/2t) exp(l(Tn + n%1/2t))+%(Tn + n%1/2u) exp(l(Tn + n%1/2u))du
! e)(t)%(Tn + n%1/2t)/Cn
where
2(t) ! l(Tn + n%1/2t)( l(!0)(1
2nI(!0){l̇!(X ; !0)}2(d)
and
Cn !!
e)(u)%(Tn + n%1/2u)du.(e)
Now if we can show that
J1n !!
|e)(t)%(Tn + n%1/2t)( exp((t2I(!0)/2)%(!0)|dt$p 0,(f)
then
Cn $p
!exp((t2I(!0)/2)%(!0)dt = %(!0)
32./I(!0),(g)
and the left side of (2) is Jn/Cn where
Jn !!
|e)(t)%(Tn + n%1/2t)( Cn
3I(!0)"(t
3I(!0))|dt.(h)
Furthermore Jn & J1n + J2n where J1n is defined in (f) and
J2n !!
|Cn
3I(!0)"(t
3I(!0))( exp((t2I(!0)/2)%(!0)|dt(i)
=666Cn
3I(!0)/2.
( %(!0)666!
exp((t2I(!0)/2)dt $p 0
by (g). Hence it remains only to prove that (f) holds.(ii) The same proof works with J !
1n replacing J1n where J !1n has an additional factor of (1+ |t|).
!
Lemma 8.1 The following identity holds:
2(t) ! l(Tn + n%1/2t)( l(!0)(1
2nI(!0){l̇!(X ; !0)}2(2)
= (I(!0)t2
2( 1
2nRn(Tn + n%1/2t)
4t +
1I(!0)
l̇!(X; !0)/n
52
.
34 CHAPTER 5. BAYES METHODS AND ELEMENTARY DECISION THEORY
Proof. This follows immediately from the definition of Tn, (1), and algebra. !
Proof. that J1n $p 0: Recall the definition of J1n given in (f) of the proof of theorem 8.1.Divide the region of integration into:(i) |t| &M ;(ii) M & |t| & &
/n; and
(iii) &/
n < t <-.For the region (i), the integral is bounded by 2M times the supremum of the integrand over theset |t| & M , and by application of lemma 8.1 it is seen that this supremum converges to zero inprobability if
sup|t|$M
6661n
Rn(Tn + n%1/2t)
:
t +1
I(!0)l̇!(X ; !0)/
n
;2 666$p 0(a)
and
sup|t|$M
|%(Tn + n%1/2t)( %(!0)|$p 0.(b)
Now % is continuous by B4 and Tn $p !0 by B1, so % is uniformly continuous in a neighborhoodof !0 and
!0 ( & & Tn ( n%1/2M & Tn + n%1/2t & Tn + n%1/2M & !0 + &(c)
for |t| & M , so (b) holds. Now n%1/2l̇!(X; !0) is bounded in probability (by B1 and the centrallimit theorem), so it follows from (c) and B2 that (a) holds.
(ii) For the region M & |t| & &/
n it su#ces to show that the integrand is bounded by anintegrable function with probability close to one, since then the integral can be made small bychoosing M su#ciently large. Since the second term of the integrand in J1n is integrable, it su#cesto find such an integrable bound for the first term. In particular, it can be shown that for every/ > 0 there exists a C = C(/, &) and an integer N = N(/, &) so that, for n * N ,
P (exp(2(t)%(Tn + n%1/2t) & C exp((t2I(!0)/4) for all |t| & &/
n) * 1( /;
this follows from the assumption B2; see TPE, pages 494-496.(iii) The integral over the region |t| * &
/n converges to zero by an argument similar to that
for the region (ii), but now the assumption B3 comes into play; see TPE pages 495-496.The proof for J !
1n requires only trivial changes using assumption B5. !
Remark 8.1 This proof is from Lehmann and Casella, TPE, pages 489 - 496. This type of theoremapparently dates back to Laplace (1820), and was later rederived by Bernstein (1917), and von Mises(1931). More general versions have been established by Le Cam (1958), Bickel and Yahav (1969),and Ibragimov and Has’minskii (1972). See the discussion on pages 488 and 493 of TPE. For arecent treatment along the lines of Le Cam (1958) that covers the case of ! ) Rd, see chapter 10of van der Vaart (1998). For a version with the true measure Q generating the data not in P, seeHartigan (1983).
Remark 8.2 See Ibragimov and Has’minskii (1981) and Strasser (1981) for more on the consis-tency and e#ciency of Bayes estimators.
8. ASYMPTOTIC THEORY OF BAYES ESTIMATORS 35
Remark 8.3 Analogues of these theorems for nonparametric and semiparametric problems arecurrently an active research area. See Ghosal, Ghosh, and van der Vaart (2000), Ghosal and vander Vaart (2001), and Shen (2002).
Now we will continue the development for convex likelihoods started in section 4.7. The notationwill be as in section 4.7, and we will make use of lemmas 4.7.1 and 4.7.2. Here, as in section 4.7, wewill not assume that the distribution P governing the data is a member of the parametric modelP = {P! : ! % ! ) Rd}.
Here are the hypotheses we will impose.
C1. %(!) & C1 exp(C2|!|) for all ! for some positive constants C1, C2.
C2. % is continuous at !0.
C3. log p(x, !0 + t)( log p(x, !0) = -(x)T t + R(x; t) is concave in t.
C4. EP-(X1)-(X1)T ! K, EP-(X1) = 0.
C5. EP R(X1; t) = (tTJt/2 + o(|t|2) and VarP (R(X1; t)) = o(|t|2) where J is symmetric, positivedefinite.
C6. X1, . . . ,Xn are i.i.d. P .
Theorem 8.3 Suppose that C1 - C6 hold. Then the maximum likelihood estimator !̂n is/
n(consistentfor !0 = !0(P ) and satisfies
/n(!̂n ( !0)$d Nd(0, J%1K(J%1)T ).
Moreover the Bayes estimator !̃n = E{!|Xn} satisfies/
n(!̃n ( !̂n)$p 0,
and hence also/
n(!̃n ( !0)$d Nd(0, J%1K(J%1)T ).
Proof. Let Ln(!) !.n
i=1 p(Xi, !) denote the likelihood function. Define random convexfunctions An by
exp((An(t)) = Ln(7!n + t//
n)/Ln(7!n).
By definition of the MLE, An achieves its minimum value of zero at t = 0. Then
!̃n =+!Ln(!)%(!)d!+Ln(!)%(!)d!
= 7!n +1/n
t exp((An(t))%(7!n + t//
n) exp((C2|7!n|)dt
exp((An(t))%(7!n + t//
n) exp((C2|7!n|)dt
by the change of variable ! = 7!n + t//
n.
Claim: The random functions An converge in probability uniformly on compact sets to tT Jt/2.
36 CHAPTER 5. BAYES METHODS AND ELEMENTARY DECISION THEORY
Proof. Let
A0n(t) = nPn{log p(x; !0 + t/
/n)( log p(x; !0)}.
From the proof of theorem 4.7.6 (with h(x; !) ! ( log p(x; !) and multiplication by (1),
A0n(t) = UT
n t +12tTJt + op(1),(a)
and this holds uniformly in t in compact sets by Lemma 4.x.y. Note that
An(t) = A0n(t +
/n(7!n ( !0))(A0
n(/
n(7!n ( !0)).
Since/
n(7!n ( !0) = Op(1) so that for every / > 0 there exists a compact set K ! K* such thatP (/
n(7!n ( !0) % K) * 1( /. Hence by the uniformity in t in the convergence in (a),
An(t) = UTn (t +
/n(7!n ( !0))( UT
n (/
n(7!n ( !0))
+12{(t +
/n(7!n ( !0))T J(t +
/n(7!n ( !0))
(/
n(7!n ( !0)T J/
n(7!n ( !0)} + op(1)
= UTn t +
12tT Jt +
/n(7!n ( !0)T Jt + op(1)
=12tT Jt + op(1)
where the last line follows because/
n(7!n ( !0) = (J%1Un + op(1). Since An is convex andconverges pointwise in probability to tT Jt/2, it converges uniformly in probability on compactsubsets by lemma 4.7.1. !
Now we return to the main proof. Define *n ! inf |u|=1 An(u). It converges in probability to*0 = inf |u|=1 uT Ju/2 > 0. By the same argument used in lemma 4.7.2 it follows that
An(t) * *n|t| for |t| > 1.
Now the integrand in the numerator above converges in probability for each fixed t to
t exp((tT Jt/2)%(!0) exp((C2|!0|),
and similarly the integrand of the denominator converges for each fixed t to
exp((tT Jt/2)%(!0) exp((C2|!0|),
Furthermore the domination hypotheses of the following convergence lemma hold for both thenumerator and denominator with dominating function
D(t) ! 2C11{|t| & 1} + C1|t| exp((*0|t|/2)1{|t| > 1}.
Hence the ratio of integrals converges in probability to+
t exp((tT Jt/2)%(!0) exp((C2|!0|)dt+exp((tT Jt/2)%(!0) exp((C2|!0|)
= 0.
!
8. ASYMPTOTIC THEORY OF BAYES ESTIMATORS 37
Lemma 8.2 Suppose that Xn(t), Yn(t) are jointly measurable random functions, Xn(t,2), Yn(t,2)for (t,2) % K " & where K ) Rd is compact, that % is a measure on Rd, and :(i) Yn(t)$p Y (t) and Xn(t)$p X(t) for % almost all t % K.(ii)
+K Yn(t)d%(t)$p
+Y (t)d%(t) with |
+K Y (t)d%(t)| <- almost surely.
Then+K Xn(t)d%(t) $p
+K X(t)d%(t).
Proof. It su#ces to show almost sure convergence for some further subsequence of any givensubsequence. By convergence in probability for fixed t, and then dominated convergence
Hn(/) ! (P 4 ")({(2, t) : |Xn(t,2)(X(t,2)| > /)
=!
KP ({2 : |Xn(t,2)(X(t,2)| > /})dt$ 0
by the dominated convergence theorem, and similarly for {Yn}. By replacing / by /n 5 0 andextraction of subsequences we can find a subsequence {n!} so that the integrals
+K Xnd% converge
and for some set N ) K " & with P 4 %(N) = 0 we get convergence for all (2) % N c of both Xn"
and Yn" . Then we have %({t : (2, t) % N}) = 0 for almost all 2. Hence by Fatou’s lemma appliedto Yn" ± Xn" we deduce that
!
KXn"(t)d%(t) $a.s.
!
KX(t)d%(t).
!
Corollary 1 Suppose that:(i) Gn(t)$p G(t) for each fixed t % Rd.(ii) P (|Gn(t)| & D(t) for all t % Rd)$ 1.(iii)
+D(t)dt <-.
Then+
Gn(t)dt$p+
G(t)dt.
Proof. Let / > 0; choose a compact set K = K* so that+Kc D(t)dt < /. This is possible since+
D(t)dt <-. Then, using |G(t) & D(t),666!
Gn(t)dt (!
G(t)dt666 &
666!
KGn(t)(
!
KG(t)dt
666 +!
Kc|Gn(t)|dt +
!
Kc|G(t)|dt
&666!
KGn(t)1[|Gn(t)|$D(t)]dt(
!
KG(t)dt
666
+!
K*[|Gn(t)|>D(t)]|Gn(t)|dt +
!
Kc*[|Gn(t)|>D(t)]|Gn(t)|dt
+ 2!
KcD(t)dt.
Now apply lemma 8.2 with Yn(t) ! D(t) and
Xn(t) ! Gn(t)1[|Gn(t)|$D(t)].
Then Xn(t) $p G(t) and the second hypothesis of the lemma holds easily. Hence+
Xn(t)dt $p+G(t)dt so that the first term above $p 0. The second and third terms converge in probability to
zero because the set over which the integral is take is empty with arbitrarily high probability; andthe third term is bounded by 2/ by choice of K. !