7

AGeometrical - perso.univ-lyon2.fr

  • Upload
    others

  • View
    11

  • Download
    0

Embed Size (px)

Citation preview

Page 1: AGeometrical - perso.univ-lyon2.fr
Page 2: AGeometrical - perso.univ-lyon2.fr

A Geometrical Relat.ional Madelfor Data Analysis

Yves.Schektrnan! and Rafik Abdesselarrr'

l' i bis, rue des Potiers, 31000 Toulouse, France2 GEMMA-LERE, Faculté de Sciences Economiques et de Gestion,

Université de Caen, 14032 Caen, France

Abstract. 'Theproposed model aIlo~s analyses which are more powerful than Fac-torial Discriminant or Corresponde~ce: Analyses; it may be considered as a usefulcomplement to MultivariateAnalysisof Variance, Comparing to MANOVA, statis-tics are not carried out fromvariables, but from statistical uni ts. The statisticalunit space, linked to the variable space by an isometry, contains two orthogonalsubspa~é's assaciated to mean-and residual values as weiL ,

Finally, lnertiaof configurations of points in the unitspace, used in particu-lar to determine factorial aXes,can measure either symmetrical or dissymmetticalassociation coefficients from explanatory variables to independent variables,

1 Introduction

ln the geometrical Relational Model proposed, vectors represent statisticalunits (s.u), for example individuals. We can describe different configurationsof s.u. points, located in a maximum of four subspaces,associated respec-tively to independent variables, explanatory variables, in particular durnmyvariables associated ta the levelsof acontrolled factor, .mean and residualvariables. The Relational Modelis of sorne interest because the distance ins.u, space is Relational (Schektman (1978)), Le. taking into account relation-ships observed between variables. '

_ Section 2 is concerned with a brief description of Relational Distances.Sorne proofs of the suitability of this choice are given in section 3. We show,insection 4, that the a priori general Relational Model can be simplified andwe givesorne propertiesof practical i~terest. "

To shorten this paper, notations arenot described as saon as we are ableto understand them, without .difficulty, by analogy with similar notationsdefined above. We do the same for sorne properties and we give references forsorne proofs which have already been published.

2, Relaticinal distances<. »,

{xi} and {zk} being twosets of zero mean variables, observed on the samepopulation of statistical units (s.u.), let us denote:

Page 3: AGeometrical - perso.univ-lyon2.fr

él60 Schektrnan and Abdesselam A Geornetrical Relational Model for Data Analysis 361

- Ex [resp. Ez] the subspace of the s.u. space E = Ex œ E; œ ... identified,by the canonical injection lnx [resp.lnz], to an euclidean vector spaceEx [resp. Ez] whose dimension is equal to tire number of ~ariables {xi}

/ [resp. {zk}], . .- lmt X c Ex [resp. Imt Z C Ez] the image of the mapping transpose of

the linear mapping X[resp. Z] defined by the matrix, denote X [resp. Z],whose elements of the jth [resp. kth] column are the. values of variablei [ "] . . -Ô:»x resp. Z,.· ....

- Xi E Ex [resp. Zi E E~]thes.u. vector whose coordinates are the elementsof ith row of matrix X [resp. Z], ,

- N", = {x;} C Ex [resp. Nz = {Zi} C Bz] the configuration of s.u. pointsassociated to the rows or' matrix X [resp. Z], ..

- Mx [resp. Mz] an .euclidean distance .in space Ex [resp. Ez],D the diagenal euclidean. distance, whose.elements are the weights at-tached to s.u., in .the variable space denoted F,. '" . .{Àj(x)} [resp. {>.k(Z)}), {Cj(x) EEx} [resp.{c·k(Z)E Ez}land{Ci(x) =[Àj (X)]-1/2 X M",cj(x) E F} [resp. {Ck(z) E F}] respectively the principalinertia moments and a basis of principal vectors of N", [resp. Nz], and thecorresponding normalized principalcornponènts, with respect to the pairof distances ( Mx , D ) [resp. ( u, ,D )],

- P; [resp. P;] the orthogonal projection operator; in space Ex [resp. Ez],onto the sth principal axis spanned by cs(x) [resp. cs(z)].

Note that the use of generalized inverse is necessary when variables {zk} are,in particular, the zero .mean dummy variables associated to the levels pf a .factor because dimension of lmt Z equals dimE~ -1 thus VzMz is singular.

Lemrna l'

Lèt Ux = XM",[(V",Mi)I/2j+ and U; = ZMz[(VzMz)I/2]+.

al) (Vsf)...(x)f. 0) U,,[cs(x)] = C8(x) ; KerU", = (ImtX).l ,ImU", = ImX.a:2) u; is apartiàl isometry from Imt.xc E", onto ImX C F.b) Same propertlee for Uz' •ln the following property, U is the partial linéar mapping defined by

çv'(t = luxer) + lnz(s) f r E Ex, sE 'Bz)) U(t) = Ux(r) + U~(s) E F.

Definition

M is a Relational (serni-) Distance in s.u, 'spacé E,with respect to the setsof variables {xi} and {zk}, if and only if

a) M[Inx(ci(x)), lnz(ck(z))j = D[Ci(x), çk(z)] if Ài(X)Àk(Z) f. Of"- (1)b) M[lnx(u), lnz(v)] = a if u E (Irn'X') ': or if v E (lmt Z).l. .•

For convenience we shall use indifferently tz or Inxf u) in th~ followirrg. It isshown (Schektman (1978)-(1994») and Croquette (1980») the following prop-erty 1 and lemma 1. '

Property 2

.Sêt:'~h!!'iOÙO~in,g:'!assertions:. .

a) (V(h, t2) E (Imînx' X Œl l~lnz t z)2) M(tl, t2) = D[U(t!), U(t2)].b) The image, via U, of each pair of canonical variables carried out from

(Imlnx tX, lmlnz t Z, M) isa pair of canonical variables carried out from(IrriX, ImZ, D), moreover the corresponding canonical correlation coef-ficients are equal. .

c) The restriction of M to Imlnx' X œ lmlnz tZ is an euclidean distance.

Wè have: Al «(1) {;:}(a) <=> (b),El (a) =} (ImX n ImZ = {O} {;:}(c)).

Property 1

Given that tlnxMlnx = M", and tlnzMlnz = Mz, where (Mx, Mz) is a pairof euclidean distances, M is a Relational (semi-) Distancewith respect to thesets of variables {xi} and {zk}, if and only if

ilnxMlnz = M",[(VxMx)I/2l+ Vxz Mz[(Vz Mz) 1/2]+ (2)where '

- \Ix = iXDX, Vz = tZDZ, Vu = tXDZ, .- for t equal x or z, [(VtMdl/2]+ = L [Às(t)]-1/2 Pt' is the Moore-

{S/À,(t);iO}

Penrose generalized inverse of (FtMt)I/2, weighted by Mt. •

Proof

Al (a) =} (1) : obvious as lemma 1-a1-b holds.(1) =} (a) : starting from expressions of tl and t2 in a basis of principalvectors, then, by developing M(tl, t2), using (1), -êqualittes of norms andangles for corresponding, Cj and Ci, and finally lemrna l-a l-b, it cornesthe developmentof D[U(tl), U(t2)] telatively to principal' vectors.(a) =} (b) : obvious as lernma 1-a2-b holds, reasoning by absurdo(b) =* (a) : similar proof as for (1) =>- (a)), but starting from expressionsof tl and h in a basis of canonical variables.

Bj d'iveb that Dis an euclidean distance, bilinearity, symmetry and posi-tivity of the restriction of M follow immediately from (a); furthermore,giyen that U is a linear mapping, IrnX n ImZ· = {a} is equivalent to "Uis à -partial bijection from Imlnx' X œ lmlnz tZ onto ImX œ ImZ" ,

~equivalent ta (Vt E (lmlnx tX œ Imlnz tZ, t f. 0) * U(t) f. 0)and finally, according to the properties of M, equivalent to (c), because(a) =} (U(t) # 0 {;:} M(t,t)f. 0). •

It cornes that if (1), Le. property 2-a, holds th en U is a partial isometry fromIrn lnx' X a1 Imlnzt Z onto ImX ffi IrnZ if and only if IrnX n ImZ = {a}.

Page 4: AGeometrical - perso.univ-lyon2.fr

%2 Schektrnan and Abdesselam A Geometrical Relational Model for Data' Analysis 363

3 Furtdamental results

- Pz the orthogonal projection operator onto E; C E(Pz is defined be-cause the restriction of Mto Ez is an euclidean dist~nce (Schektmanand Abdesselam (2000))), .

- P: [resp. P:] the orthogonal projection operator onto.the sth canonicalaxis, carr ied out from (Imlnxt'Xi; Irnlnzt Z, M),' belongingto Ex [resp;.Ez),

- N; = {Pz (Xi) 1 Xi E N,;}C Ex,c.,E,il: = {P;(Xi)[ Xi E Nx}CE", CE,;

- [[N;] [resp. I[N;J] the inertia of N; [resp. 1V;laccording to its centre ofgravit y (origin of E),

- Ps the sth canonical correlation coefficient carried out fra/TI(lm~, ~ll!.Z,D),- Qz the orthogonal projection operator onto ImZ, ,... ..

ln property 3-a, we ~ive a statisti~al'and ge~metricat constructionof .N:,configuration of points which plays a fundalllentai role 'in our p.ppro~c~.·

[[N;] synthesizes, in term of inertia, the classical syrrimetrical~sociationindices. Moreover, the principal axes ofN;ancl thecorresponding principalcomponents arevaccordlng iothe types ofvariables {xi} and {zk}, thoseof Factorial Correspondence Analysis (Beniecri(1982)) or Faétorial Discrim-inant Analysis -:

b) If Mx is the unit matrix then [(N;) = 'Ei 1\ Qz(xi) 112 (property 3-c)and [(Nx) = 'Ei Il xi 112: soleN;) 1[(N",) synthesizes, in terms of inertia,the classical dissymmetrical association coefficients (Goodman-Kruskall -r,Stewart-lové coefficient (1968)). These properties are used, in particular, todefine a Factorial Dissymmetrical Correspondence Analysis (Abdesselam andSchektman(1996)) .

Let us den ote:

4 Relational model

Note 1

Let vs give sorne more fundamental results (Schektman(1987)) which .illus-trate the significant information contained inN;. .'.

a) If Mx = (t X DX)+, where "+" denotes the Moore-Penros~ generalizedinverse, th en I(iI:] = 1 and consequently [[N;] = L. P; (property 3~b): so

4.1 Qiéneral model .

. {xi} and {yk} being respectively independent' vadabÎes and explanatory vari-ables, let: .

- {gi = Q,,(xj)} called mean variables if {yk} are the zero mean dummyvariables associated to the levels of afactor, or fitted variables otherwise,

- {ri = xi - gi} called residual variables, J

where Qy is the orthogonal projection operator ontolmY CF.Of course, we define for variables {yk}, {gi} and {ri}, the same notations

EiJ, ImY, P", Q", ... , as defined in sections 2and 3 but for variables {zk}

We have the following classical results :

- variables {gi}'arid {ri}ar~ zero ~~~ns .. '- ImG C ImY , ImR,.L ImY , IrnX C ImG ffi ImR. (3)- Vry = Vrg = 0, Vg = Vxg == Vgx, Vgv = Vxv" Vr = lin = Vrx = V", - Vg.

As for variables {xi}, a configuration of s.u. points, denoted,iNg [resp. Nr],

isassociated to variables {yi} [resp. {ri}]. _Theproposed Relational Model must satisfy the following hypotheses:

Hl) E= e; ffi s, ffi s, ffi Er.. H2) .M Îs a relational (semi-) distance in E for each of the six pairs of sets of

variables defined justabove.H3) Distances in spacesEgand Er are equal to euclidean distance Mx in Ez,:

it isindeed reasonable to "see' Ngand N; in the same way as Nz:

According to Note 1, Ev is.,the "explanatory" subspace upon which we shallproject s.u. {Xi}' So the nature of euclidean (semi-) distance in E" is of noimportance; however, we shall optfor the Moore-Penrose generalized inverseof V"' denoted Vy+, for its use simplifies calculations. Note that we can optfor the chi-square distance if {yk} are associated to a factor.

Property 3

If M is relational for variables {xi} and {zk} then

a) Pz(x;) = LP:P;(Xi) with Il P:P;(Xi) Il,= P. Il P;(Xi) II·s .' '.' .

b) I[N;] = 'Ep;I[iI;].s

c) [lN;] = LÀj(X) Il Qz[Ci(x)] W = LÀi(x) Il Pz[Cj(x)] Wj i

= 'E (M"']ii' D(Qz(xi), Qz(xi')](i,n

where [Mx]jj' is the (j,j') element of matrix Mx.

Proof

As Nx C lm -x, we have Inx(~i) = 'Es- P;Inx(Xi) then,,(a) follows fromprojective property of cauonicalaxes and property 2-A. As canonical axesare orthogonal~obviously (a) ::} (b). (c) is shown in (Schektman(i9g4)). _

Page 5: AGeometrical - perso.univ-lyon2.fr

364 Schektman and AbdesselamA Geometrical Relational Model for Data Analysis 365

Property 4

a) M is an euclidean semi-distance; its restriction to Imlngta ffi Irnlrir! Ris a distance.

b) e, ..LEr.

PxPg1nx = Inx[(V:r.Mx)1/2J+ Vg Mx [(vx Mx)1/2J+ using (6);

then adding P",(Pg + Pr)Inx = Inx, since Vr + Vg = Vx .

Proof

Therefore [[X - (Pg + Pr)(X) [[2 = M[x,x]- M[x, (Pg + Pr)(X)] = 0

for. M[x, (Pg + Pr)(X)] = M[x,Px(Pg + Pr)(x)] = M[x,x]. •a) These results follow from (3) and property 2-B.b) Using (2), it follows that Vrg = 0 =:} 'tlnrMlng = O. •

Accordingtolemma 2-b, euclidean representations of Ng and Ni are identi-cal, sa the Relational Madel can be simplified by taking E = Ex EIl Eg ffi Er .Rence variables {yk} only serve to calculate variables {gi}, and Eg replacesEv' Notice that Eg is of a richer nature than Ey since Eg :J Ng• This sim pli-ficationcan be confirmed analytically: for, it follows

4.2 Simplified. model

Lemma 2

a) (Vg E Eg)b) (Vx E Ex)c) (Vx E Ex)

l[g-Py(g)II=O.[[ Pg(x) - Pv(x) [[ = O.[1 x - (Pg + Pr)(X) I[ = o.

• from (6) that the principal components and principal inertia momentsassociated to principal axes of Ng, are characteristic elements ofX t[(V",Mx)1/2]+ t(VgMx)1/2 Mx (VgM",)1/2 [(V",M",)1/2)+ t X Dequal to X Mx [(VxMx)1/2)+ l'oMx [(VxMx)1/2]+ tX D . (9)

Il from (7) that the corresponding operator, but with N!i, isXMx [(VxM"Y/2]+ v., Vy+Vy", M",[(VxMx)1/2)+ t X D equal to expres-sion (9) sinee Vyx = Vyg and (5).

Proof

a) Wè have Pglny = IngM;l tlngMlny= Ing[(VgM",)1/2]+Vgy Vy+[(Vy Vy+)1/2]+ using (2)= Ing[(V: M )lj2]+v: v-v v:+9 . x. .... . 9V y y y= Ing[(VgMx)t/2'I+Vg~V/ .

Pylng = InyVvgMx[(VgMx)1/2J+ . (4)

PgPylng = Ing[(Vg Mx)l/2]+ Vgv Vy+VygMx[(VgM;r)1/2J+ .TIngv: v:+v. - taDyv:+tYDG - 'coo G ~ 'coc - v: (5)gy v yg - 'y - y . - - s :

[[9 - Py(g) W = M[g, g)- M[g, Py(g)] = 0M[g,Py(g)]= M[Pg(g),Py(g)] = M[g,PgPy(g)] == M[g,g).

According to lemma 2-c euclidean represe'fit:ations of Nx and Ng+r == {Pg(Xi)+Pr(Xi) / Xi E Nx}are identical; so, tlle M~del can be once more simplified bytaking E = Eg ffi Er and replacing Nx by Ng+r.

Sirnilarly

It follows Note 2[or

Thus[or

Using (6) and (8), the two following partitioned matrJces,

b) We have Pglnx = Ing[(VgMx)1/2]+ VJ~Mx[(VxM,z:)1/21+= Ing(VgMx)1/2 [(VxM"Y/2)+ since Vg", = Vg. (6)

Using (G), (5) and Vyg= Vy"" it followsPyPglnx = InyVyx Mx [cYx Mx)1/2]+ = Pv1nx . (7)

thus 1/ Pg(x) - Py(x) [1 = Il Pg(x) - PVPg(x) Il = 0 using (a)

X t([(V:M)1/2J+) "((VgM",)1/2), d (Mx 0)x x . (VrMx)1/2 . an OMx

are respectively associated to Ng+r and to distanee M in E = Eg EB Er.

4.3 Saille properties

c) We have PxIllg = Inx[(VxMx)1/2]+(VgMx)1/2 since V",g == VgPxIllr = IllX[(V",M",)1/2]+ (VrMx}1/2 sinee V:i:r= VrPrIllX = Inr(VrM:i;)1/2 [(VxMx)1/2]+ since Vr'" = Vr. (8)

Pruper-ty 5

It follows PxPr1nx = Inx[cYxMx))/2]+ VrMx[(ViMx)1/2J+

Principal axes and principal inertia moments of Ni and Ng [resp. N; andNr] are identical; moreover , the principal components associated to principalaxes of 'Ng [resp. N;] belong to.Irn.X.

Page 6: AGeometrical - perso.univ-lyon2.fr

3GG Schektrnan and Abdesselarn A Ceometrical Relational Model for Data Analysis 367

Proof Pro of

It lollo\Vs from lernma 4 and lemma H. that Pg[c.(x)] = UOQg[C8(X)] andPr[c.(x)] = U*Qr[CS(x)]. Moreover,aS ImG 1. ImR, e, 1. Er and accord-ing to lernma 3-a1-b thus U' is a partial isometry from IITlG œ ImR ontoImIng 'c œ ImInr tR.. •

lt follows from (G) that principal axes and principal inertia moments of Nfare characteristic elements of

(1!gM"Y/2 [(V",.i\{,,)1/2]+tXDX M",[(V",M"Y/2]+ (Vg~",)1/2 = VgM",

and the property of associated principal components is a consequence of (9).Same proofs for N; and Nç, use in particular (8). _ Note 3

ImUr = ImR and ImUg = ImG. (10)

. .' .

According to Notel, symmetrical (Benzecri(1982») or dissymmetrical (Ab-desselarn and Schektman(1996» Correspondance Analyses, with simultane-OUS representationof tnodalities of'variables, are equivalent ta Principal Corn-ponent Analysis (PCA) of(N% UNg), suitable distances being chosen in Eg.So, the Relational Model leads us naturally to enrich the results, providedby these analyses, with those of PCA of (N; U Nr);

Let Ug = GMx[(VgM",)1/2J+, U: == [(VgM",)l/ZJ+ tGD, Ur = RMx[(VrM.Y/2J+and U: = [(VrM",)1/2]+ t RD. Obviously, Ur ane!.Ug have the sarne properties(lem ma 1) as Ux; in particular .

It is easy to show the following lemma 3 (Schektman (1994». 5 Conclusion

IngU:QgUX = PgInx and InrU:QrUx = Prlnx.

Notes 1 and 3 clearly describe that Relational Model is a formaI tool useful(i) to synthesize weil known FactorialAnalyses; (ii) toenrich provided resultswith those extracted frorri résidual configurations ofs.u. poin~s, alld (iii) toextend the areaof these analysesto dissymmetrical association coefficients.Concerning this latter new approach, which is often more appropriateto the

. observed reality, yOUCÇl.nfindcriteria, a tooI, an exarnple and references in(Abdesselam and Schektrnan (1996» ta know, in particular, how ta choose areasonabledissyrnmetricalassociation coefficient and for what benefits.

The fundamental utility of the Relational Model is to propose the orthog-onal decomposition Xi = Pg(Xi) + Pr(Xi) of each s.u. vector, according tomean and residual subspaces. Thus we hold in the s.u. space E what classi-cally exils for each variable xi = Qg (xi) + Qr(xi) = ~i +ri, in the variablespace F. Moreover, E = Eg ffi Er and F being linkëd by an isometry,wecan enrich the representation of e.u, points on principal planes, with respecteither to fittecl (or mean) variables or residual variables, with' elements of F,as indicated in Property 6.

These results may be useful, in MANOVA, if we really must try, for ourself,to understand, with more details, variations observed on data. ln this case,we shall notice that

• design matrix Y can either correspond to dumrny variables associated tothe lèvels of a factor or be deduced from a mill hypothesis on parameters,

• ûsing Property 3-c, I[N%] = Li ÀJCx) Il Qg[Gi(x)]IIZ = trace[VgMx].. c> . .

Notice that I[N1J is equal to Pillai criteria if Mx = V.+.Obviously, the Relational Model can be alsoof pratical interest in cluster-

ing or classification, where explanatory variables {yk} are quantitatives andvariables {xi} are dummy variables.

Lemma 3

al) U; is a partial isornetry from ImG onto ImtG.a2) U: is the Moore-Penrose generalized inverse of Ug, weighted by the pair

of distances (Mx,D).b) Same properties for U:. _Lemma 4

Praaf

IngUiQgUx = IngUiUgUiUx = IngUiUx (Jemma 3-a2 and (10»= Ing[(VgMx)1/2]+ tGDXMx[(VxMx)1/2]+= IngM;l tIngMlnx = PgInx using (2).

Similar pro of for the second expression. _

Let U' be the partiallinear mapping defined by(V(t = u+ w 1 u E IrnG, W E ImR» U*(t) := IngU:(u) -1- InrU:(w) E E.

Praperty 6J

E = EgEBEr can be enriched withthe images, via U', of Qg[C' (x)], Q,.[Gs (x)],gi = Qg (xi), ri = Qr(Xj

) and xi = Ls a~CS(x) Le. respectively with Pg[csex)J,Pr[csex)], Ls a~Pg[cs(x)], L. a{Pr[cs(x)] and L. a{[Pg + PrJc.(X).

Page 7: AGeometrical - perso.univ-lyon2.fr

368 Schektman and Abdesselam

Finally, as explanatory subspaee Im)" in the variable space, proved -itsutility for independent variables, we hope thata large scale use of eorre-sponding subspace lm ty (or lm tG), in the Relaaîonal Model.wil! prove thesame, but for statistieal unitsor individu aIs.

References

ABDESSELAM, R~ and SCHEKTMAN, Y.(1996) : Une Analyse Factorieiie'del'Association Dissymtrique entre deux Variables Qualitatives.'lievue ,deStatistique Applique. Paris, XLIV(2), 5-34. , _ _ _,-, _,' "

BENZECRI, J.~. (l!J82) : L'Analyse des donnes.: L'Analyse des Correspondances.Edition Dunod, Paris. -"" : _" _-- _" ~, ' ", ,-_

CROQUETTE, A. (1980) : QuelquèsRsultats Synthtiquesen Analysedes DonnesMultidimensionnelles : Optlrnalltej Mtriques Effets Relatlonpels.' 'I'hseSdmecycle, Universit Toulouse III.' -"" ' -

SCHEKTMAN, Y. (1978) : Contribution la Mesure en Facteurs dans les SciencesExprimentales et ' la Mise en Oeuvre Automatique des Calculs Statistiques.'Thse d'Etat, Universit Toulouse III. . -' ' " ' " , -

SCHEInMAN, Y. ,(1987) : ,A General EUc1ideanAppi-oacl;Cor Measurlng andDescribing Associations between, SeveralSetaofVarlables.Tni.C. Hayashi andal. (Eds.): Recent Deuelopments inclustering"and pata, Analysis. AcademiePress.Trie, Tokyo, 37~48. -, - ,"

SCHEKTMAN, Y. (1994) : Proprits des ProduIts Scalaires Rel~tionnel~.Not~ In-terne, DIEM~LEMME, 'Universit Tciiilouséni(47 pages).'" ,

SCHEKTMAN, Y. and ABDESSELAM, 11,;'(2000)':' Semi-Produits" ScalairesRelationneis. Note Interne, GEMMA-LERE, Universit de 'Caen (42 pages).

STEWART, D: an-d LOVE, W: (1968): Ageneràl canonlcalcorrelation index.'Psychological Bull, 70, 160-163. '