ON THE ESTIMATION OF ENTROPY

Embed Size (px)

Citation preview

  • 8/12/2019 ON THE ESTIMATION OF ENTROPY

    1/20

    Ann. Inst. Statist. Math.Vol. 45, No. 1, 69-88 (1993)

    O N T H E E S T I M T IO N O F E N T R O P Y

    P E T E R H A L L t A N D S A L LY C M O RT O N 2

    1 C e n t r e f o r M a t h e m a t i c s a n d i t s A p p l i c a ti o n s , A u s t r a l i a n N a t i o n a l U n i v e r s i ty,G . P. O . B o x 4 , C a n b e rr a A . C . T. 2 6 0 1, A u s t r a l ia

    a n d C S I R O D i v i s i o n o f M a t h e m a t i c s a n d S t a t i s t i c s2 S t a t i s ti c a l R e s e a r c h a nd C o n s u l t i n g G r o u p , T h e R A N D C o r p o r at io n ,

    1700 Ma in S t ree t , P.O . Box. 2138 , San ta M onica , CA 90407-2138 , U .S .A.

    (Received May 7, 1991; revised April 14, 1992)

    Ab st rac t. Motivated by recent work of Joe (1989, A n n . I n s t . S t a t i s t . M a t h . ,41, 683-697), we introduce estimators of entropy and describe their proper-ties. We study the effects of tail behaviour, dis tribution smoothness and di-mensionality on convergence properties. In particular , we argue that root -nconsistency of entropy estimation requires appropriate assumptions abou t eachof these three features. Our estimators are different from Joe s, and may becomputed without numerical integration, but it can be shown that the sameinteraction of tail behaviour, smoothness and dimensionality also determines

    the convergence rate of Joe s estimator. We study both histogram and kernelest imators of entropy, and in each case suggest empirical methods for choosingthe smoothing parameter.

    K e y w o r d s a n d p h r a se s : Convergence rates, density estimation, entropy, his-togram estimator, kernel estimator, projection pursuit, root-n consistency.

    1 I n t r o d u c t i o n

    This paper was motivated by work of Joe (1989) on estimation of entropy.Our work has three main aims: elucidating the role played by Joe s key regularitycondition (A); developing theor y for a class of estimators whose const ruction doesnot involve numerical integration; and providing a concise account of the influenceof dimensionality on convergence rate properties of entropy estimators. Our mainresults do not require Joe s (1989) c onditi on (A), which asks th at tail prope rtiesof the underlying distribu tion be ignorable. We show concisely how tail propertiesinfluence estimat or behavi our, including convergence rates, for estimators basedon both kernels and histograms. We point out that histogram estimators may beus ed to const ruct r oot-n consistent entropy estimators in p = 1 dimension, and

    tha t kernel estimators give root-n consistent entropy estimator s in p = 1, 2 and 3dimensions, but that neither type generally provides root-n consistent estimationbeyond this range, unless (for example) the underlying distribution is compactly

    69

  • 8/12/2019 ON THE ESTIMATION OF ENTROPY

    2/20

    70 PETER HA LL AND SALLY C. MORTON

    suppor t e d , o r i s pa r t i c u l a r ly smoo th a nd b i a s - r e duc t ion t e c hn ique s a r e e mp lo ye d .Joe 1989 ) de ve lops t he o r y f o r a r oo t - n c on s i s t e n t e s t im a to r i n t he c a se p = 4 ,bu t he ma ke s c r uc ia l u se h i s c ond i t i on A ) . E ve n f o r p = 1 , r oo t - n c ons i s t e nc y

    o f ou r e s t ima to r s o r o f t ha t e s t ima to r sugge s t e d by Joe 1989) r e qu i r e s c e r t a inp r ope r t i e s o f t he t a i ls o f t he und e r ly ing d i s tr i bu t ion . G o lds t e in a nd Me sse r 1 991)b r ie f ly m e n t i o n t h e p r o b l e m o f e n t r o p y e s t i m a t i o n , b u t lik e J o e t h e y w o r k u n d e rt h e a s s u m p t i o n A ) .

    To f u r th e r e luc ida t e ou r r e su l t s i t is ne c e s sa r y t o i n t r odu c e a l i t tl e no t a t i o n .L e t X 1 , X 2 , . . . ,X ~d e n o t e a r a n d o m s a m p l e d r a w n f r o m a p - v a r i a t e d i s t r i b u t i o nw i th d e ns i ty f , a nd p u t I = f f log f , w he r e t he i n t e g r a l is a s sume d to c on ve rgea b s ol u te l y. T h e n - I d e n o t e s t h e e n t r o p y o f t h e d i s t ri b u t i o n d e t e r m i n e d b y f . Wec o n s id e r e s t i m a t i o n o f I . O u r e s t i m a t o r s a r e m o t i v a t e d b y t h e o b s e r v a t io n t h a tf = n - 1 y]in__m ogf X i ) is unb ia se d f o r I , a nd i s r o o t - n c on s i s t e n t i f f f l o g f ) 2 0 .M ore gene ra l ly, sup po se p ~ 1 and the t a i l s o f f dec re ase l ike []x[ ] ~ , say

    f x ) = c1 c2 +] Ix]]2) -~ /2 forCx,C~ > 0 and a > p. (Th e l a t t e r con d i t ion isnec ess a ry to ens ure tha t t h i s f i s i n t egrab le . ) Th en , d e f in ing [ = n -1 l og f (X i ) ,w e h a v e

    3 .2 ) ~ = i - { c l ~ h p ) - I p / ~) + c 2 h 4 } + o p { ~ h ~ ) - I p /~ ) + h 4 } ,

    f o r p o s i t i v e c o n s t a n t s C 1 a n d C 2 . T h e t e c h n i q u e s o f p r o o f a r e v e r y s im i l a r t o t h o sei n Ha l l ( 19 8 7 ) , a n d so we sh al l n o t e l a b o r a t e o n t h e p r o o f .

    Of c o u r se , [ i s u n b i a se d a n d r o o t - n c o n s i s t e n t f o r I , w i t h v a r i a n c e n - 1 .{ f l o g f ) 2f - Z2} . The sec ond o rde r t e rm in (3 .2 ) i s

    J = C 1 nh p)- l+ ( p /~ ) + C2h4 '

    a n d i s m i n i m i z e d b y t a k i n g h t o b e o f si zen - a w h e r e

    ( 3 . 3 ) a - - ~ - ; ) / { . ; + 4 ) - ; 2 }

    Th en J i s o f s i zere 4a, wh i c h i s o f sm a l l e r o r d e r t h a nn -1 /2 i f an d on ly i f 1 _< p < 3a n d a > p ( 8- p ) / 4 - p ) . W he n p = 1 th i s r edu ces to c~ > 7 /3 , wh ich i s equ iv a len t

  • 8/12/2019 ON THE ESTIMATION OF ENTROPY

    18/20

    86 PETER HA LL AND SALLY C. MORTON

    to existence of a moment higher tha n the 1'rd. (For example, finite variances u m c e s . )

    Recall from Subsection 2.2 that histogram estimators only allow root-n con-

    sistent estimation of entropy when p = 1. We have just seen th at nonnegativekernel esti mator s e xtend this range t o 1 _< p _< 3, and so the y do have advantage s.The ease p = 2 is of practical interest since practitioners of exploratory projectionpursuit sometimes wish to project a high-dimensional distribut ion into two, ra therth an one, dimensions. As noted above, in the case of a densi ty whose tails decreaselike I[xl a we need

    > 2 s - 2 ) / 4 - 2 ) = 6

    if we are to ge t /~ = f + op(n-1/2) in p = 2 dimensions. This corresponds to theexistence of a moment higher than the fourth.

    Since C1 and C2 in formula 3 . 2 )are both positive then a simple practical,empirical rule for choosing ban dwi dth is to select h so as to maximize =/~k (h).Now, it ma y be proved th at (3.2) is available unif orml y over h's in any set ~nsuch that ~-~n ~ (n--1+6, n- 6) for some 0 < (5 < 1/2 an d # ~ n = O (n C)forsome C > 0. If the maxi miza tion is tak en over a rich set of such h's then/~k =i + Ov(n-4a),where a is given by (3.3), and so/~k =/v + (n_1/2)if 1 _< p < 3a nd a > p ( 8 - p ) / ( 4 - p ) .

    In principle, the e sti mato r -Tk may be c onst ruct ed wit hou t using "leave-one-out" metho ds. If we define

    f x)= n aP ) - 1 E K { x - Xj) /h}j = l

    then an appropriate entropy estimator is given by

    Zk z ~ - l E log f X i )i=1

    = n - 1 Z l o g { 1 - n - l ) f x d +i=1

    Here, as noted above, it is essential that the kernel have appropriately heavy tails;for example, K could be a Stude nt' s t density.

    Formulae (3.1) and (3.2) continue to hold in this case, except that the constantC1 is no longer positive. Compa re f ormula (2.1), which is also for the case of anestimator that is not constructed by the "leave-one-out" method. Thus, the band-width selection argument described in the previous paragraph is not appropriate.A penalty term should be subtracted before attempting maximization, much as inthe case described in Section 2.

    3.2 Simulation studyThis subsection describes a simulation st udy of the behaviour of our kernel esti-

    mator of negative entropy, Ik. It is similar to the previous simulation stu dy of thehistogram estimator presented in Subsection 2.5 and its interpretation is subject

  • 8/12/2019 ON THE ESTIMATION OF ENTROPY

    19/20

  • 8/12/2019 ON THE ESTIMATION OF ENTROPY

    20/20

    8 8 P E T E R H A L L A N D S A L LY C . M O RT O N

    e s t ima to r e xa mp le s , a ga in a s e xpe c t e d . T he b i a s i s pos i t i ve f o r t he mos t h e a vy -t a i l e d d i s t r ibu t ion , t he S tude n t s t w i th t h r e e de g r e e s o f f r e e dom, pe r ha ps d ue t othe f a c t t h a t h ighe r- o r de r t e r m s a r e ha v ing a l a rge e f fe c t on t he e xpa n s ion ( 3 .1 ) .

    Tw o b iva r i a t e e xa mp le s a r e p r e se n t e d . Bo th a r e b iva r i a t e no r ma l s ; i n t he f i rs t,c om pon e n t s a r e i nde p e nde n t , a n d in t he s e c ond , t he c o r r e l a t i on c oe f f ic i e n t i s 0 .8 . I ne a c h c ase , t he t r u e ne ga t ive e n t r o py i s know n . T h e ke r ne l e s t im a to r pe r f o r ms w e l li n bo th e xa m ple s , g ive n the sm a l l s a mp le si ze . H ow e ve r, t he c o m pu ta t i ona l w or kr e qu i r e d t o c a l c u l a t e t he d i s t a n c e be tw e e n e ve r y pa i r o f po in t s m a ke s t he k e r ne le s t i m a t o r i n t r a c t a b l e fo r e x p l o r a t o r y p r o je c t i o n p u rs u i t. T h e b i n n i n g p e r f o r m e din t he h i s tog r a m e s t im a to r r e d uc e s t he w o r k r e qu i r e d i n t he p = 1 c a se f r om O ( n 2 )t o O (m 2 ) , w h e r e rn is t h e n u m b e r o f b in s. U n f o r t u n a t e l y t h is a p p r o a c h c a n n o t b eused in the p = 2 case , a s d iscussed in Sec t ion 2 .

    cknowledgements

    T h e a u t h o r s w o u l d lik e to t h a n k t h e r e fe r ee s f o r t h e i r h e l p fu l c o m m e n t s a n dsugges t ions .

    R E F E R E N C E S

    F r i e d m a n , J . H . (1 9 8 7 ). E x p l o r a t o r y p r o j e c t i o n p u r s u i t ,J. Amer. Statist. Assoc. 7 6 , 8 1 7 - 8 2 3 .G o l d s t e i n , L . a n d M e s s e r, K . ( 19 9 1 ). O p t i m a l p l u g - i n e s t i m a t o r s fo r n o n p a r a m e t r i c f u n c t i o n a l

    e s t i m a t i o n , Ann. Statist. ( t o a p p e a r ) .H a l l , P. ( 1 9 8 7 ). O n K u l l b a c k - L e i b l e r l o ss a n d d e n s i t y e s t i m a t i o n ,Ann. Statist. 15 , 1491-1519 .

    H a l l , P. ( 1 9 8 9) . O n p o l y n o m i a l - b a s e d p r o j e c t i o n i n d i c e s f o r e x p l o r a t o r y p r o j e c t i o n p u r s u i t ,Ann.Statist. 17 , 589-605 .

    H a l l , P. ( 1 9 90 ) . A k a i k e s i n f o r m a t i o n c r i t e r io n a n d K u l l b a c k - L e i b l e r l o ss fo r h i s t o g r a m d e n s i t ye s t i m a t i o n , Probab. Theory Related Fields 85 , 449-466 .

    H i l l, B . M . ( 1 9 7 5) . A s i m p l e g e n e r a l a p p r o a c h t o i n f e re n c e a b o u t t h e t a i l o f a d i s t r i b u t i o n ,Ann.Statist. 3, 1163-1174.

    H u b e r, P. J . ( 1 9 85 ) . P r o j e c t i o n p u r s u i t ( w i t h d i sc u s s io n ) ,Ann. Statist. 13 , 435-525 .J o e , H . ( 1 9 8 9 ). E s t i m a t i o n o f e n t r o p y a n d o t h e r f u n c t i o n a l s o f a m u l t i v a r i a t e d e n s it y,Ann. Inst.

    Statist. Math. 41 , 683-697 .J o n e s , M . C . a n d S i b so n , R . (1 9 8 7 ). W h a t i s p r o j e c t i o n p u r s u i t ? ( w i t h d is c u s s io n ) ,J. Roy.

    Statist. Soc. Set. B 150 , 1 -36 .M o r t o n , S . C . ( 1 98 9 ). I n t e r p r e t a b l e p r o j e c t i o n p u r s u i t , P h . D . D i s s e r t a t i o n , S t a n f o r d U n i v e r s i t y,

    Ca l i fo rn ia .S i lve rman , B . W. (1986) .Density Estimation for Statistics and Data Analysis C h a p m a n a n dH a l l, L o n d o n .

    Va s i ce k , O . ( 1 9 8 6) . A t e s t f o r n o r m a l i t y b a s e d o n s a m p l e e n t r o p y,J. Roy. Statist. Soc. Set. B38 54-59 .

    Zipf , G. K. (1965) . Human Behaviour and the Principle of Least Effort: An Introduction toHuman Ecology (Facs im i l e o f 1949 edn . ) , Hafner, New Y ork .