Upload
usach
View
0
Download
0
Embed Size (px)
Citation preview
Statistical Papers 48, 295-304 (2007) Statistical Papers © Springer-Verlag 2007
A multivariate version of Gini's rank association coefficient
J a v a d B e h b o o d i a n 1, A l i Do la tP , M a n u e l U b e d a - F l o r e s z
' Department of Statistics, College of Sciences, Shiraz University, 71454 Shiraz, Iran; (e-mail: [email protected])
2 Departamento de Estad/stica y Matem~itica Aplicada, Universidad de Almerfa, Carretera de Sacramento s/n, La Cafiada de San Urbano 04120, Almerfa, Spain; (e-mail: [email protected])
Received: June 17, 2004; revised version: February 22, 2005
Abstract
In this paper, we introduce a multivariate generalization of the population version of Gini's
rank association coefficient, giving a response to this open question posed in [4]. We also study
some properties of this version, present the corresponding results for the sample statistic, and
provide several examples.
AMS classification: Primary 62H05; Secondary 62H20.
Keywords: Copulas; Gini's coefficient; Multivariate association.
1 I n t r o d u c t i o n
The problem of association between random variables has been largely studied in the liter-
ature. Recently, the development of the theory of copulas has had a great impact in the study
of non-parametric measures of association in the case of continuous random variables. Many
measures of bivariate association have been proposed in terms of copulas according to concepts
of concordance and discordance. In multivariate setting, we can find multivariate analogues
of some well-known bivariate measures of association--such as Kendall's tau, Spearman's
rho, Blomqvist's beta and the Spearman's footrule coefficient--, based on the probability of
concordance alone, in terms of copulas (see [4] and [5]). Our purpose is to develop a multi-
variate population version of the Gini's rank association coefficient, providing a response to
296
this question posed in [4]. For a random sample {(Xi, Y/)}im_- 1 from a continuous bivariate
distribution, with corresponding vectors of ranks (RI, R 2 , . . . , R~,) and ($1, $2 , . . . , S,~), this
index is defined by
1 "~ g = k,~/2j Y~(IR, + s~ - .~ - i)] - IRi - s~]), (I.i)
i:i
where Ltj denotes the integer part of t, and was first discussed by Corrado Gini [1], who called
it the indice de cograduazione semplice.
We now review the concept of a copula. Let n _> 2 be a natural number. The term (n
-dimensional) copula (briefly n-copula) refers to a multivariate distribution function whose
n univariate margins are uniform ]i (= [0, 1]). The importance of copulas in statistics is
due to the following result: The joint distribution function H of a set of n continuous
random variables X1, Xu , . . . ,X~ with univariate margins F1, F 2 , . . . , Fn can be expressed in
the form H(x) = C(FI(XI),F2(x2) . . . . ,Fn(xn)) for all x = (xl, x2 , . . . ,xn) in [-oo, oo] ~, in
terms of an n-copula C, that is uniquely determined on R a n F l x RanF2 x . . . x RanF~. Let
u = (ul,u2,. . . ,u,~) be a point in ]I n, and let II~(u) = UlU2...un denote the n-copula of
independent random variables. For every u in lI n, any n-copula C satisfies that Wn(u) = n
m a x ( ~ ni - n + 1,0) < C(u) < min(ul, u2 , . . . , u~) = Mn(u). For every n _> 2, M ~ is an i=1
n-copula; however W ~ is an n-copula if and only if n = 2. For a complete survey about
copulas, see [3].
We will use some notation and terminology. We denote by A ~ the function defined by
( ]~rn _}_ l . V n ) / 2 . We define the survival function K of a measurable function K : ]i'~ ~ ]i by
n
K ( u ) = 1 + E ( - 1 ) k ~ Ki~i2...ik(uit,ui2,...,uik) , (1.2) k : l l ~i l <~2<'"<{k <n
where the functions on the right-hand side are appropriate lower margins of K. If C is an
n-copula, 0 denotes the survival copula of C, which is given by C(u) = C(1 - u), for every
u in lI '~.
Finally, let U and V be two vectors of uniform ]i random variables with respective n-
copulas C1 and C2, and let Q'n(C1, C2) denote the probability of concordance between U and
V, which in terms of copulas can be expressed as Q'(C1, C2) = / w ~ ( C l ( u ) + Cl(u))dC2(u) . o
a J J _
This function can be extended to the case in which one of C1 or C2 is a measurable function
from lI n to II. We will use this definition later, especially when C1 is W ~ or A ~.
297
or, equivalently,
2 A m u l t i v a r i a t e p o p u l a t i o n v e r s i o n o f G i n i ' s c o e f f i c i e n t
In the bivariate case, the population version of Gini's coefficient, which we denote by 72
(or 72(C) for any 2-copula C), is giving by
~ 2 ( C ) = 2 j / ( l ' a + v - 1 I - l u - v l ) d c ( u ,v),
72(C) = 8 ~ A2(u, v)dC(u, v) - 2
(for details, see [3]). Since A2(u, v) + -A2(u, v) = 1 - u - v + 2A2(u, v) for every (u, v) in lI ~,
we have that
Q'2(A ~-, C ) : 2 j ( A2(u, v)dC(u, v) and Q'2(A 2, I I2 )= 2 fH 2 A2(u, v)dIl2(u, v ) = ~;
whence "~'2 can be re-written as a function of probabilities of concordance in the following
nlanner:
: 4(Q2(A , C) - Q'2(A 2, I12)).
Using the above motivation, we define a multivariate version of Gini's coefficient associated
with an n-copula C as
9'n( C) = c~(Q~n( A n, C) - Q'n( A n, IIn)), (2.1)
where a is a constant such that %z~(M ~) = 1. Our purpose is to give a more explicit expression
to (2.1), but first we need three preliminary lemmas.
L e m m a 2.1. For any n >_ 2, we have that
,~ ~ ( u ) d M " ( u ) = 1 - ~ . i = 1
Proof. Using (1.2), we have that
~ ~ V " ( u ) d M n ( u ) = ~ d M ~ ( u ) - n f ~ uidMn(u)
= 1 - g + (2u~ - 1)du~ - . . . + ( -1 ) n (nut - n + 1)dui /2 a (~-1)/~
= 1 + E ( - 1 ) i 1 1 ~=1- 2,q" ~:I Z : I
298
For the last equality, see formula (1.45) in [2].
L e m m a 2.2. Let (ul, u2, . . . , un) be a point in 1P. Then, for all n > 2,
ui - n + 1 dundun-1 du2 u~ _~ _ ~ _ ~ _1_~ ~ n!
f P r o o f . We prove the resul t us ing induct ion. Firs t , a ssume t h a t n = 2, t h e n (ul + u2 - 1 Ul
1)du2 = u~/2. Now, suppose t h a t t he result is t rue for k = 2, 3 . . . . , n - 1, and consider t he
case k = n. Let I deno te t he integral
/1 ) 2-~1- . . . . . 1 - ~ ~, _ u~ - n + 1 d u ~ d U n _ l . , dua,
,=1
and let 1 - u~ = 2 - ul - u2. Then , by hypothes is , we have t h a t
/1/1 1( ) I . . . . d u 3 - (U~)(n--1) ~; ~i ~ - 2 - ~ "~ ~, u*l + E ui - n + 2 d u ~ d u ~ _ l . . . (n - 1)~
~=3 i=3
Hence f l f l (ul + u2 _ 1)~-1 ~
Idu2 = ( n - 1)! due - us n ! ' -I$i -Ul
which comple tes the proof. •
L e m m a 2.3. For" arty n >_ 2, we have that
° ( ) f~n ~n (u )d i In (u ) : E(__I) i ~ 1 i=0 ' ( i + 1 ) [ "
P r o o f . Using L e m m a 2.2, we have t h a t
~ , W-~(u)dII'~(u) = f ~ d I I n ( u ) - n ~ uidIF~(u)
-~- (2 ) j~n ma~x(ui @ Uj -- 1, 0)grin(u) . . . . -~-(--1) n J]I n f Wn(u)dIIn(u)
o(o) (:)( = l - j + 2 ~ . . . . +(-1)n 1 n + l ) ! '
whence the resul t follows. •
299
Using Lemmas 2.1 and 2.3, we obtain the following expressions:
a ~ = .(An(u) + 3 ~ ( ~ ) ) d ~ ( u ) = g-~l + 2(. +1 1)~ + ~](-1)'i:0 2(i+ 1)!
and
(2.2)
~ 1 (2.3) b n = n ( A n ( u ) + ~ ( u ) ) d M n ( u ) = 1 - ~ . i=1
By induction, it is easy to check that b~ - a~ ¢ 0 for all n > 2. If a = 1/(b, - a~) in (2.1),
then we have a measure % (or %(C) for any n-copula C) given by
l ~ ( ~ , (A~(u) + - ~ (u) )dC(u) - an), (2.4) ~ ( C ) - b . -
w h e r e a,~ and b~ are given by (2.2) and (2.3), respectively. 7.~ is a multivariate population
version of Gini 's coefficient such that 7n(C) = 0 if C - II ~ (case of independence) and
7n(C) = 1 if C = M ~ (case of perfect dependence) for each n _> 2. Note that, for n = 2, the
expression in (2.4) reduces to bivariate Gini's coefficient.
Remark 2.1. Using (1.2), and the fact that as it is easy to check--for any n-copula C,
~ M~(u)dC(u) = f ~ n C ( u ) d M ' ~ ( u ) a n d jfH M'~(u)dC(u) = f~ C(u)dM~(u) , another
equivalent expression to (2.4) is:
1 ~°(~) = 2(~n- ao/If0 ~ ( ~ ( ' ' ' ' '') ÷ o(,,,,... ,,)/d, + £~ Wn(-/d~(-)
n k
x 1,1 "= l < ~ l < i 2 < ' " < i k S n ]in \ j = l
The average of the pairwise Gini's coefficients of an n-copula C is given by
7 . . . . (C) = 2 2(]u + v - 11 - l u - vI)dCik(u, v), (2.5) _< _ 1 ~<k n
where C~k, 1 _< i < k < n, denotes the (i, k) bivariate margin of C. We now show that in the
trivariate case our generalization is just the average of the pairwise coefficients.
T h e o r e m 2.4. Let (U, V, W) be a vector of uniform ]I random variables with 3-copula C,
and let 7uv,Tuw and 7vw denote Gini's 72 for the three bivariate margins of C. Then,
~3(c) = (~uv + ~uw + ~ .w)/3 .
300
Proof. Since Aa(u, --a v,w)+ A (u,v,w) = 1 - u - v - w + A2(u,v, 1)+ A2(u,l,w)+ A2(1, v,w) for every (u, v, w) in ]I 3, from (2.4), we have that
~ ( / ; + A (u,v,w))dC(u,v,w) - ~) , /~(c) = ~(A~(~,~,.~) - ~
_ 13 ( 8 j f a A2(u 'v ' l )dC(u 'v 'w)-2+8~ A2(u,l,w)dC(u,v,w)-2
+8 ~ A2(1, v, w)dC(u, v, w) - 2),
whence the result follows.
However, the generalization is not the pairwise average in higher dimensions. The following
tetradimensional example--for the sake of simplicity--shows this faet.
Example 2.1. Consider the n-copula C given by C(u) = W 2 ( u i , u 2 ) u 3 . . • Un, for every u in
IP. Then, after some elementary algebra, we obtain that %(C) = -7/40. Now, observe that
C has all its bivariate margins II 2 except one which is I/V 2 ,whence it is easy to conclude
that %,,,4(C) = -1/6.
To finish this section, we compute our multivariate version of Gini's coefficient for several
known families of n-copulas.
Example 2.2. Let Ca be the n-copula given by Ca(u) = AM~(u) + (1 - ,~)IIn(u) for all u
in lI n, with ~ c 1[. Cx belongs to a nmltivariate version of the F%chet family of copulas (see
[3]). Then, from (2.4), we have that
1 %(Ca)- b~ ~ -an ( fH, (An(u) +-~(u))dC~(u) - a,~)
-b,~ l a~ ( k f F ~(A'~(u) + A-~(u))dM'~(u)+(1-A)f~ ~(A~(u) + ~ ( u ) ) d I P ( u ) - an) 1
: - - ( A b n q- (1 - ~ )ar~ - - a n ) : /~. bn -- an
The following example illustrates that, in some sense, the version % 'can improve' to that
of %v,,~.
Example 2.3. Let Ca be the n-copula given by Ca(u) = f l ui[1 + a f l(1 - ui)] for u in 1I n, /=1 i=1
with g c [-1, 1]. Ca belongs to the Farlie-Gumbel-Morgenstern family of n-copulas (see [3]).
301
From (2.4), we have that
1 %(C~) - b~ - a n ( f ~ " ( K ~ ( u ) + ~ n ( u ) ) d C ~ ( u ) - a ' ~ )
' ( / , ( ) ) - - b n - a n ~(An(u)+A-~(u)) l + S l - I ( 1 - 2 u i ) du ldU2 . . . dun -ar~ •
i=1
For the case n = 4, after some elementary algebra, we obtain that ~'4(Ca) = 325/4725. Observe
that if (U, V, W, Z) is a vector of continuous random variables with 4-copula C~, then any
three of these four random variables are mutually independent (note also that %~,4(C~) = 0);
however all four are not unless 5 = 0.
3 Sample version
Let {Xlj, X2j , . . . , Xnj}r~=l be a random sample of size m from a v e c t o r ( X 1 , X 2 . . . . . Xn)
of continuous random variables--whose associated n-copula is C with corresponding ranks
{(RIj, R2j , . . . , Rnj)}j~_l, then the sample version of (2.4) is given by
m
1 Ann(I~lj, 1~2j,..., Rnj) + Am(Rlj , R2 j , . . . , l~nj)] -- Cm,n m j=l
, (3.1) g'~ = d.m,n - - Cm,n
where
An,~(Rlj,R2j,. . . Rnj) = m i n ( R x j , R 2 j , . . . , R ~ j ) + m a x [ R ] j + . . . + R n j - ( ' n - 1 ) ( r e + l ) , 0 ] ' 2
and
A,,(Rlj , R23 . . . . , R~j) = m + 1 + ~ ( -1 ) k ~ (min(Rilj , Ri2j , . . . , Rikj) = 1_<~1 <~2<. . .<i~<n
+max[Rit3 + - . . + R~j - (k - 1)(rn + 1) ,0])) ,
and c,~,. and d,~,~ are two constants such that 9~ = 1 when the ranks coincide (perfect
dependence), and g~ = 0 when the ranks have a natural order (independence case), namely
j l , j 2 , . . . , j~. Using these conditions, we obtain that
c ... . - m~ "" [ A ~ ( j l , j 2 , . . . , j n ) + A,~(31, j2, . . . , j~)] j1=1 j 2 = l jn=l
302
and 1 rn ~ . .
d~,n = m E[A~"(J'J'"" 'j) + Am(3'3'"" ' j)]" j=l
Observe tha t - - a s it is easy to check for n = 2, we have that
m + l IR13 + R23 (m + I)I - IRlj - R2jl A2z(Rlj, R 2 j ) 4- A ~ n ( n l j , n 2 j ) : ~ 4- 9
then
and
1 ~ ~ ( m + l + l j l 4 - j 2 - - ( m + l ) l - - l j l - - j2] ) m + l Crn,2 : - - m 2 2 2
j l = l j 2 : l
d,~2: , 1 ~ ( m + l = + 1 2 j - ( m + l ) l ) _ m + l + km2/2j. m _ 2 2 2 2rn
and replacing in (3.1), we obtain (1.1).
For n = 3, since
1 ( m )) A~(RU, R2j,R3j)+A~,,(RU,R2j,R3j)= ~ + 1 + E (IRis + Rkj - (m+ 1)]-]R,y - Rkjl , l < i < k < 3
we have that
1 ,~ ,~ m ~ m + 1
j l = l 32=1 j s = l \ l<~<k<3
and = ] ~+1 3Lm72j.
and replacing in (3.1), we obtain 93 as the pairwise average of g2%.
For the case n = 4, if R4 = (R U, R2j, R33, R4j), after some elementary calculations we
have
4 1 1
i : 1
1 1<i<k<4
( ]Rij 4- Rkj 4- Rlj -- 2(rrt + 1)1 -- 2 min( Rij, Rkj, Rlj) ) l<i<k<l<4
IRlj + R2j + Raj + R4j - 3(m + 1)1 + min(Rl j , R2j, R33, R43);
2
303
thus,
Crn,4 - -
and
r a + l 1 ~ - ~ - ~ E ( I j i+jk+j t - -2(rn+l)] - -2min( j i , j k , jz)) 2 4m 4
j l = 1 j 2 = 1 j 3 = 1 j 4 = 1 1<_i<k<l_<4
+rrt~ i~=l ~=l ~=l i~=l . . . . . l J1 + jz + Ja +2j4 - 3(m + 1)[ + min(j l , j2, Ja, j4 ) ) j l = j2 = j3 = j4 =
d ~ 7 ( ~ + 1 ) 3[~2/2J 1 m ' - ~ + 2 ' ~ ~-2mm ~ ( 1 4 j - 3 ( m + 1 ) l - 2 1 3 j - 2 ( m + l ) l ) "
j = l
Similar expressions can be found in higher dimensions. Finally, note that the statistic (3.1)
is an alternative to the average pairwise Gini's coefficient which--using the above notat ion-- is
giving by
g ..... -- E E ([Rij + Rkj -- (m + 1)t -- IRij - Rkj[). l < a < k < n j = l
Observe that this is an estimate of (2.5).
4 D i s c u s s i o n
We have defined a multivariate population version of Gini 's coefficient in terms of proba-
bilities of concordance. Of course, one could define the generalization using other ideas: For
example, looking at 11 distances. In the bivariate case, the coefficient is a sealed 11 distance
between the main diagonal (x = y) and the orthogonal set (x + y - 1 = 0). For the trivariate
case, one could look at ll distances between the main diagonal (x = y = z) and the orthogonal
set (which is now the plane x + y + z = 3/2). It would be easy to generalize this idea to
higher dimensions.
On the other hand, it is known that for any n-copula C the lower bound for the multivariate
versions of Kendall 's tau and Blomqvist's beta in [4] and [5], respectively, is attained when,
at least, one of the bivariate margins of C is W 2 (see [5]), and this bound--which is best-
possible--is given by - 1 / ( 2 n-1 - 1). The lower bound of these two measures for the 4-copula
given in Example 2.1 is - 1 / 7 , which is greater than - 7 / 4 0 - - t h e value of %(C). It is an open
question to know the best-possible lower bound for %(D) for any n-copula D.
304
Acknowledgements The authors would like to thank an anonymous referee for his/her valuable suggestions
on an earlier version of this paper. The first and second authors thank the research council
of Shiraz University for support. The third author also thanks the Ministerio de Cieneia y
Tecnologfa (Spain) and FEDER, for support under the research project BFM2003-06522.
R e f e r e n c e s
[1] Gini, C. (1914). L'Ammontare e la composizione della ricehezza delle nazione. Bocca,
Torino.
[2] Gould, H.W. (1972). Combinatorial Identities. Morgantown Printing and Binding Co., W.
Va.
[3] Nelsen, R.B. (1999). An Introduction to Copulas. Springer, New York.
[4] Nelsen, R.B. (2002). Concordance and copulas: A survey. In: C. Cuadras, J. Fortiana, J.A.
Rodriguez (Eds.), Distributions with Given Marginals and Statistical Modelling. Kluwer
Academic Publishers, Dordrecht, pp. 169 178.
[5] Ubeda-Flores, M. (2005). Multivariate versions of Blomqvist's beta and Spearman's
footrule. Ann. Inst. Statist. Math. In press.