10
Statistical Papers 48, 295-304 (2007) Statistical Papers © Springer-Verlag 2007 A multivariate version of Gini's rank association coefficient Javad Behboodian 1, Ali DolatP, Manuel Ubeda-Flores z ' Department of Statistics, College of Sciences, Shiraz University, 71454 Shiraz, Iran; (e-mail: [email protected]) 2 Departamento de Estad/stica y Matem~itica Aplicada, Universidad de Almerfa, Carretera de Sacramento s/n, La Cafiada de San Urbano 04120, Almerfa, Spain; (e-mail: [email protected]) Received: June 17, 2004; revised version: February 22, 2005 Abstract In this paper, we introduce a multivariate generalization of the population version of Gini's rank association coefficient, giving a response to this open question posed in [4]. We also study some properties of this version, present the corresponding results for the sample statistic, and provide several examples. AMS classification: Primary 62H05; Secondary 62H20. Keywords: Copulas; Gini's coefficient; Multivariate association. 1 Introduction The problem of association between random variables has been largely studied in the liter- ature. Recently, the development of the theory of copulas has had a great impact in the study of non-parametric measures of association in the case of continuous random variables. Many measures of bivariate association have been proposed in terms of copulas according to concepts of concordance and discordance. In multivariate setting, we can find multivariate analogues of some well-known bivariate measures of association--such as Kendall's tau, Spearman's rho, Blomqvist's beta and the Spearman's footrule coefficient--, based on the probability of concordance alone, in terms of copulas (see [4] and [5]). Our purpose is to develop a multi- variate population version of the Gini's rank association coefficient, providing a response to

A multivariate version of Gini's rank association coefficient

  • Upload
    usach

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Statistical Papers 48, 295-304 (2007) Statistical Papers © Springer-Verlag 2007

A multivariate version of Gini's rank association coefficient

J a v a d B e h b o o d i a n 1, A l i Do la tP , M a n u e l U b e d a - F l o r e s z

' Department of Statistics, College of Sciences, Shiraz University, 71454 Shiraz, Iran; (e-mail: [email protected])

2 Departamento de Estad/stica y Matem~itica Aplicada, Universidad de Almerfa, Carretera de Sacramento s/n, La Cafiada de San Urbano 04120, Almerfa, Spain; (e-mail: [email protected])

Received: June 17, 2004; revised version: February 22, 2005

Abstract

In this paper, we introduce a multivariate generalization of the population version of Gini's

rank association coefficient, giving a response to this open question posed in [4]. We also study

some properties of this version, present the corresponding results for the sample statistic, and

provide several examples.

AMS classification: Primary 62H05; Secondary 62H20.

Keywords: Copulas; Gini's coefficient; Multivariate association.

1 I n t r o d u c t i o n

The problem of association between random variables has been largely studied in the liter-

ature. Recently, the development of the theory of copulas has had a great impact in the study

of non-parametric measures of association in the case of continuous random variables. Many

measures of bivariate association have been proposed in terms of copulas according to concepts

of concordance and discordance. In multivariate setting, we can find multivariate analogues

of some well-known bivariate measures of association--such as Kendall's tau, Spearman's

rho, Blomqvist's beta and the Spearman's footrule coefficient--, based on the probability of

concordance alone, in terms of copulas (see [4] and [5]). Our purpose is to develop a multi-

variate population version of the Gini's rank association coefficient, providing a response to

296

this question posed in [4]. For a random sample {(Xi, Y/)}im_- 1 from a continuous bivariate

distribution, with corresponding vectors of ranks (RI, R 2 , . . . , R~,) and ($1, $2 , . . . , S,~), this

index is defined by

1 "~ g = k,~/2j Y~(IR, + s~ - .~ - i)] - IRi - s~]), (I.i)

i:i

where Ltj denotes the integer part of t, and was first discussed by Corrado Gini [1], who called

it the indice de cograduazione semplice.

We now review the concept of a copula. Let n _> 2 be a natural number. The term (n

-dimensional) copula (briefly n-copula) refers to a multivariate distribution function whose

n univariate margins are uniform ]i (= [0, 1]). The importance of copulas in statistics is

due to the following result: The joint distribution function H of a set of n continuous

random variables X1, Xu , . . . ,X~ with univariate margins F1, F 2 , . . . , Fn can be expressed in

the form H(x) = C(FI(XI),F2(x2) . . . . ,Fn(xn)) for all x = (xl, x2 , . . . ,xn) in [-oo, oo] ~, in

terms of an n-copula C, that is uniquely determined on R a n F l x RanF2 x . . . x RanF~. Let

u = (ul,u2,. . . ,u,~) be a point in ]I n, and let II~(u) = UlU2...un denote the n-copula of

independent random variables. For every u in lI n, any n-copula C satisfies that Wn(u) = n

m a x ( ~ ni - n + 1,0) < C(u) < min(ul, u2 , . . . , u~) = Mn(u). For every n _> 2, M ~ is an i=1

n-copula; however W ~ is an n-copula if and only if n = 2. For a complete survey about

copulas, see [3].

We will use some notation and terminology. We denote by A ~ the function defined by

( ]~rn _}_ l . V n ) / 2 . We define the survival function K of a measurable function K : ]i'~ ~ ]i by

n

K ( u ) = 1 + E ( - 1 ) k ~ Ki~i2...ik(uit,ui2,...,uik) , (1.2) k : l l ~i l <~2<'"<{k <n

where the functions on the right-hand side are appropriate lower margins of K. If C is an

n-copula, 0 denotes the survival copula of C, which is given by C(u) = C(1 - u), for every

u in lI '~.

Finally, let U and V be two vectors of uniform ]i random variables with respective n-

copulas C1 and C2, and let Q'n(C1, C2) denote the probability of concordance between U and

V, which in terms of copulas can be expressed as Q'(C1, C2) = / w ~ ( C l ( u ) + Cl(u))dC2(u) . o

a J J _

This function can be extended to the case in which one of C1 or C2 is a measurable function

from lI n to II. We will use this definition later, especially when C1 is W ~ or A ~.

297

or, equivalently,

2 A m u l t i v a r i a t e p o p u l a t i o n v e r s i o n o f G i n i ' s c o e f f i c i e n t

In the bivariate case, the population version of Gini's coefficient, which we denote by 72

(or 72(C) for any 2-copula C), is giving by

~ 2 ( C ) = 2 j / ( l ' a + v - 1 I - l u - v l ) d c ( u ,v),

72(C) = 8 ~ A2(u, v)dC(u, v) - 2

(for details, see [3]). Since A2(u, v) + -A2(u, v) = 1 - u - v + 2A2(u, v) for every (u, v) in lI ~,

we have that

Q'2(A ~-, C ) : 2 j ( A2(u, v)dC(u, v) and Q'2(A 2, I I2 )= 2 fH 2 A2(u, v)dIl2(u, v ) = ~;

whence "~'2 can be re-written as a function of probabilities of concordance in the following

nlanner:

: 4(Q2(A , C) - Q'2(A 2, I12)).

Using the above motivation, we define a multivariate version of Gini's coefficient associated

with an n-copula C as

9'n( C) = c~(Q~n( A n, C) - Q'n( A n, IIn)), (2.1)

where a is a constant such that %z~(M ~) = 1. Our purpose is to give a more explicit expression

to (2.1), but first we need three preliminary lemmas.

L e m m a 2.1. For any n >_ 2, we have that

,~ ~ ( u ) d M " ( u ) = 1 - ~ . i = 1

Proof. Using (1.2), we have that

~ ~ V " ( u ) d M n ( u ) = ~ d M ~ ( u ) - n f ~ uidMn(u)

= 1 - g + (2u~ - 1)du~ - . . . + ( -1 ) n (nut - n + 1)dui /2 a (~-1)/~

= 1 + E ( - 1 ) i 1 1 ~=1- 2,q" ~:I Z : I

298

For the last equality, see formula (1.45) in [2].

L e m m a 2.2. Let (ul, u2, . . . , un) be a point in 1P. Then, for all n > 2,

ui - n + 1 dundun-1 du2 u~ _~ _ ~ _ ~ _1_~ ~ n!

f P r o o f . We prove the resul t us ing induct ion. Firs t , a ssume t h a t n = 2, t h e n (ul + u2 - 1 Ul

1)du2 = u~/2. Now, suppose t h a t t he result is t rue for k = 2, 3 . . . . , n - 1, and consider t he

case k = n. Let I deno te t he integral

/1 ) 2-~1- . . . . . 1 - ~ ~, _ u~ - n + 1 d u ~ d U n _ l . , dua,

,=1

and let 1 - u~ = 2 - ul - u2. Then , by hypothes is , we have t h a t

/1/1 1( ) I . . . . d u 3 - (U~)(n--1) ~; ~i ~ - 2 - ~ "~ ~, u*l + E ui - n + 2 d u ~ d u ~ _ l . . . (n - 1)~

~=3 i=3

Hence f l f l (ul + u2 _ 1)~-1 ~

Idu2 = ( n - 1)! due - us n ! ' -I$i -Ul

which comple tes the proof. •

L e m m a 2.3. For" arty n >_ 2, we have that

° ( ) f~n ~n (u )d i In (u ) : E(__I) i ~ 1 i=0 ' ( i + 1 ) [ "

P r o o f . Using L e m m a 2.2, we have t h a t

~ , W-~(u)dII'~(u) = f ~ d I I n ( u ) - n ~ uidIF~(u)

-~- (2 ) j~n ma~x(ui @ Uj -- 1, 0)grin(u) . . . . -~-(--1) n J]I n f Wn(u)dIIn(u)

o(o) (:)( = l - j + 2 ~ . . . . +(-1)n 1 n + l ) ! '

whence the resul t follows. •

299

Using Lemmas 2.1 and 2.3, we obtain the following expressions:

a ~ = .(An(u) + 3 ~ ( ~ ) ) d ~ ( u ) = g-~l + 2(. +1 1)~ + ~](-1)'i:0 2(i+ 1)!

and

(2.2)

~ 1 (2.3) b n = n ( A n ( u ) + ~ ( u ) ) d M n ( u ) = 1 - ~ . i=1

By induction, it is easy to check that b~ - a~ ¢ 0 for all n > 2. If a = 1/(b, - a~) in (2.1),

then we have a measure % (or %(C) for any n-copula C) given by

l ~ ( ~ , (A~(u) + - ~ (u) )dC(u) - an), (2.4) ~ ( C ) - b . -

w h e r e a,~ and b~ are given by (2.2) and (2.3), respectively. 7.~ is a multivariate population

version of Gini 's coefficient such that 7n(C) = 0 if C - II ~ (case of independence) and

7n(C) = 1 if C = M ~ (case of perfect dependence) for each n _> 2. Note that, for n = 2, the

expression in (2.4) reduces to bivariate Gini's coefficient.

Remark 2.1. Using (1.2), and the fact that as it is easy to check--for any n-copula C,

~ M~(u)dC(u) = f ~ n C ( u ) d M ' ~ ( u ) a n d jfH M'~(u)dC(u) = f~ C(u)dM~(u) , another

equivalent expression to (2.4) is:

1 ~°(~) = 2(~n- ao/If0 ~ ( ~ ( ' ' ' ' '') ÷ o(,,,,... ,,)/d, + £~ Wn(-/d~(-)

n k

x 1,1 "= l < ~ l < i 2 < ' " < i k S n ]in \ j = l

The average of the pairwise Gini's coefficients of an n-copula C is given by

7 . . . . (C) = 2 2(]u + v - 11 - l u - vI)dCik(u, v), (2.5) _< _ 1 ~<k n

where C~k, 1 _< i < k < n, denotes the (i, k) bivariate margin of C. We now show that in the

trivariate case our generalization is just the average of the pairwise coefficients.

T h e o r e m 2.4. Let (U, V, W) be a vector of uniform ]I random variables with 3-copula C,

and let 7uv,Tuw and 7vw denote Gini's 72 for the three bivariate margins of C. Then,

~3(c) = (~uv + ~uw + ~ .w)/3 .

300

Proof. Since Aa(u, --a v,w)+ A (u,v,w) = 1 - u - v - w + A2(u,v, 1)+ A2(u,l,w)+ A2(1, v,w) for every (u, v, w) in ]I 3, from (2.4), we have that

~ ( / ; + A (u,v,w))dC(u,v,w) - ~) , /~(c) = ~(A~(~,~,.~) - ~

_ 13 ( 8 j f a A2(u 'v ' l )dC(u 'v 'w)-2+8~ A2(u,l,w)dC(u,v,w)-2

+8 ~ A2(1, v, w)dC(u, v, w) - 2),

whence the result follows.

However, the generalization is not the pairwise average in higher dimensions. The following

tetradimensional example--for the sake of simplicity--shows this faet.

Example 2.1. Consider the n-copula C given by C(u) = W 2 ( u i , u 2 ) u 3 . . • Un, for every u in

IP. Then, after some elementary algebra, we obtain that %(C) = -7/40. Now, observe that

C has all its bivariate margins II 2 except one which is I/V 2 ,whence it is easy to conclude

that %,,,4(C) = -1/6.

To finish this section, we compute our multivariate version of Gini's coefficient for several

known families of n-copulas.

Example 2.2. Let Ca be the n-copula given by Ca(u) = AM~(u) + (1 - ,~)IIn(u) for all u

in lI n, with ~ c 1[. Cx belongs to a nmltivariate version of the F%chet family of copulas (see

[3]). Then, from (2.4), we have that

1 %(Ca)- b~ ~ -an ( fH, (An(u) +-~(u))dC~(u) - a,~)

-b,~ l a~ ( k f F ~(A'~(u) + A-~(u))dM'~(u)+(1-A)f~ ~(A~(u) + ~ ( u ) ) d I P ( u ) - an) 1

: - - ( A b n q- (1 - ~ )ar~ - - a n ) : /~. bn -- an

The following example illustrates that, in some sense, the version % 'can improve' to that

of %v,,~.

Example 2.3. Let Ca be the n-copula given by Ca(u) = f l ui[1 + a f l(1 - ui)] for u in 1I n, /=1 i=1

with g c [-1, 1]. Ca belongs to the Farlie-Gumbel-Morgenstern family of n-copulas (see [3]).

301

From (2.4), we have that

1 %(C~) - b~ - a n ( f ~ " ( K ~ ( u ) + ~ n ( u ) ) d C ~ ( u ) - a ' ~ )

' ( / , ( ) ) - - b n - a n ~(An(u)+A-~(u)) l + S l - I ( 1 - 2 u i ) du ldU2 . . . dun -ar~ •

i=1

For the case n = 4, after some elementary algebra, we obtain that ~'4(Ca) = 325/4725. Observe

that if (U, V, W, Z) is a vector of continuous random variables with 4-copula C~, then any

three of these four random variables are mutually independent (note also that %~,4(C~) = 0);

however all four are not unless 5 = 0.

3 Sample version

Let {Xlj, X2j , . . . , Xnj}r~=l be a random sample of size m from a v e c t o r ( X 1 , X 2 . . . . . Xn)

of continuous random variables--whose associated n-copula is C with corresponding ranks

{(RIj, R2j , . . . , Rnj)}j~_l, then the sample version of (2.4) is given by

m

1 Ann(I~lj, 1~2j,..., Rnj) + Am(Rlj , R2 j , . . . , l~nj)] -- Cm,n m j=l

, (3.1) g'~ = d.m,n - - Cm,n

where

An,~(Rlj,R2j,. . . Rnj) = m i n ( R x j , R 2 j , . . . , R ~ j ) + m a x [ R ] j + . . . + R n j - ( ' n - 1 ) ( r e + l ) , 0 ] ' 2

and

A,,(Rlj , R23 . . . . , R~j) = m + 1 + ~ ( -1 ) k ~ (min(Rilj , Ri2j , . . . , Rikj) = 1_<~1 <~2<. . .<i~<n

+max[Rit3 + - . . + R~j - (k - 1)(rn + 1) ,0])) ,

and c,~,. and d,~,~ are two constants such that 9~ = 1 when the ranks coincide (perfect

dependence), and g~ = 0 when the ranks have a natural order (independence case), namely

j l , j 2 , . . . , j~. Using these conditions, we obtain that

c ... . - m~ "" [ A ~ ( j l , j 2 , . . . , j n ) + A,~(31, j2, . . . , j~)] j1=1 j 2 = l jn=l

302

and 1 rn ~ . .

d~,n = m E[A~"(J'J'"" 'j) + Am(3'3'"" ' j)]" j=l

Observe tha t - - a s it is easy to check for n = 2, we have that

m + l IR13 + R23 (m + I)I - IRlj - R2jl A2z(Rlj, R 2 j ) 4- A ~ n ( n l j , n 2 j ) : ~ 4- 9

then

and

1 ~ ~ ( m + l + l j l 4 - j 2 - - ( m + l ) l - - l j l - - j2] ) m + l Crn,2 : - - m 2 2 2

j l = l j 2 : l

d,~2: , 1 ~ ( m + l = + 1 2 j - ( m + l ) l ) _ m + l + km2/2j. m _ 2 2 2 2rn

and replacing in (3.1), we obtain (1.1).

For n = 3, since

1 ( m )) A~(RU, R2j,R3j)+A~,,(RU,R2j,R3j)= ~ + 1 + E (IRis + Rkj - (m+ 1)]-]R,y - Rkjl , l < i < k < 3

we have that

1 ,~ ,~ m ~ m + 1

j l = l 32=1 j s = l \ l<~<k<3

and = ] ~+1 3Lm72j.

and replacing in (3.1), we obtain 93 as the pairwise average of g2%.

For the case n = 4, if R4 = (R U, R2j, R33, R4j), after some elementary calculations we

have

4 1 1

i : 1

1 1<i<k<4

( ]Rij 4- Rkj 4- Rlj -- 2(rrt + 1)1 -- 2 min( Rij, Rkj, Rlj) ) l<i<k<l<4

IRlj + R2j + Raj + R4j - 3(m + 1)1 + min(Rl j , R2j, R33, R43);

2

303

thus,

Crn,4 - -

and

r a + l 1 ~ - ~ - ~ E ( I j i+jk+j t - -2(rn+l)] - -2min( j i , j k , jz)) 2 4m 4

j l = 1 j 2 = 1 j 3 = 1 j 4 = 1 1<_i<k<l_<4

+rrt~ i~=l ~=l ~=l i~=l . . . . . l J1 + jz + Ja +2j4 - 3(m + 1)[ + min(j l , j2, Ja, j4 ) ) j l = j2 = j3 = j4 =

d ~ 7 ( ~ + 1 ) 3[~2/2J 1 m ' - ~ + 2 ' ~ ~-2mm ~ ( 1 4 j - 3 ( m + 1 ) l - 2 1 3 j - 2 ( m + l ) l ) "

j = l

Similar expressions can be found in higher dimensions. Finally, note that the statistic (3.1)

is an alternative to the average pairwise Gini's coefficient which--using the above notat ion-- is

giving by

g ..... -- E E ([Rij + Rkj -- (m + 1)t -- IRij - Rkj[). l < a < k < n j = l

Observe that this is an estimate of (2.5).

4 D i s c u s s i o n

We have defined a multivariate population version of Gini 's coefficient in terms of proba-

bilities of concordance. Of course, one could define the generalization using other ideas: For

example, looking at 11 distances. In the bivariate case, the coefficient is a sealed 11 distance

between the main diagonal (x = y) and the orthogonal set (x + y - 1 = 0). For the trivariate

case, one could look at ll distances between the main diagonal (x = y = z) and the orthogonal

set (which is now the plane x + y + z = 3/2). It would be easy to generalize this idea to

higher dimensions.

On the other hand, it is known that for any n-copula C the lower bound for the multivariate

versions of Kendall 's tau and Blomqvist's beta in [4] and [5], respectively, is attained when,

at least, one of the bivariate margins of C is W 2 (see [5]), and this bound--which is best-

possible--is given by - 1 / ( 2 n-1 - 1). The lower bound of these two measures for the 4-copula

given in Example 2.1 is - 1 / 7 , which is greater than - 7 / 4 0 - - t h e value of %(C). It is an open

question to know the best-possible lower bound for %(D) for any n-copula D.

304

Acknowledgements The authors would like to thank an anonymous referee for his/her valuable suggestions

on an earlier version of this paper. The first and second authors thank the research council

of Shiraz University for support. The third author also thanks the Ministerio de Cieneia y

Tecnologfa (Spain) and FEDER, for support under the research project BFM2003-06522.

R e f e r e n c e s

[1] Gini, C. (1914). L'Ammontare e la composizione della ricehezza delle nazione. Bocca,

Torino.

[2] Gould, H.W. (1972). Combinatorial Identities. Morgantown Printing and Binding Co., W.

Va.

[3] Nelsen, R.B. (1999). An Introduction to Copulas. Springer, New York.

[4] Nelsen, R.B. (2002). Concordance and copulas: A survey. In: C. Cuadras, J. Fortiana, J.A.

Rodriguez (Eds.), Distributions with Given Marginals and Statistical Modelling. Kluwer

Academic Publishers, Dordrecht, pp. 169 178.

[5] Ubeda-Flores, M. (2005). Multivariate versions of Blomqvist's beta and Spearman's

footrule. Ann. Inst. Statist. Math. In press.