22
Introduction MACROS FOR DETERMINING THE OPTIMAL CLUSTERS IN LARGE DATA SETS B. GERARDIN J.L. MOLLIERE Electricite de France In most cases, before using a clustering technique the user has no prior idea of the number of clusters which will give the better differentiation of his data. This unknown number may correspond to some real and hidden structure of the data which may be very desirable to the user to find out. Most often, the degree of partitioning has no prior meaning before the user gives an interpretation of the different clusters that the method will produce. However the user is very interested in summarizing his data in the best possible way ; that is to say in getting a compromise between a good degree of differentiation and a not too high number of clusters. A usual way of determining this optimal number of clusters is to look at the squared multiple correlation R 2 (which is the sum of squares between all clusters divided by the total sum of squares) for partitions corresponding at different numbers of clusters. The plot of R 2 against the number of clusters may suffice to get the desired number of clusters. In SAS 82.3 a new criterion, the cubic clustering criterion (CCC), is calculated which provides very clear and useful information (particularly from a statistical point of view). In either case (R 2 or CCC), it is necessary to determine as many clusterings as there are different numbers of clusters required for plotting the R 2 or the CCC. Here is the advantage of hierarchic hal methods (illustrated by the PROC CLUSTER) which, in a single execution, give as many clusterings as levels in the hierarchy. However the PROC CLUSTER of SAS 82.3 quickly becomes very C.P.U.-time consuming as the number of observations increases and must hence be restricted to data sets which are not too large (no more than some hundreds of observations). On the other hand, the very efficient new procedure F ASTCLUS of SAS 82.3 may also prove to be expensive when used repeatedly for giving clusterings with different numbers of clusters. It will be easy to combine the respective advantages of these two PROCS (and of some others) in three MACROS, which will be described further on ; they will help the user to get the clustering he is looking for, and provide him with supplementary information about the structure of his data. 45

Macros for Clustering Large Data Sets when the … for Clustering Large... · Introduction MACROS FOR DETERMINING THE OPTIMAL CLUSTERS IN LARGE DATA SETS B. GERARDIN J.L. MOLLIERE

Embed Size (px)

Citation preview

Page 1: Macros for Clustering Large Data Sets when the … for Clustering Large... · Introduction MACROS FOR DETERMINING THE OPTIMAL CLUSTERS IN LARGE DATA SETS B. GERARDIN J.L. MOLLIERE

Introduction

MACROS FOR DETERMINING THE OPTIMAL CLUSTERS

IN LARGE DATA SETS B. GERARDIN

J.L. MOLLIERE Electricite de France

In most cases, before using a clustering technique the user has no prior idea of the number of clusters which

will give the better differentiation of his data.

This unknown number may correspond to some real and hidden structure of the data which may be very

desirable to the user to find out.

Most often, the degree of partitioning has no prior meaning before the user gives an interpretation of the

different clusters that the method will produce. However the user is very interested in summarizing his data in

the best possible way ; that is to say in getting a compromise between a good degree of differentiation and a

not too high number of clusters.

A usual way of determining this optimal number of clusters is to look at the squared multiple correlation R 2

(which is the sum of squares between all clusters divided by the total sum of squares) for partitions

corresponding at different numbers of clusters. The plot of R 2 against the number of clusters may suffice to

get the desired number of clusters.

In SAS 82.3 a new criterion, the cubic clustering criterion (CCC), is calculated which provides very clear and

useful information (particularly from a statistical point of view).

In either case (R 2 or CCC), it is necessary to determine as many clusterings as there are different numbers of

clusters required for plotting the R 2 or the CCC.

Here is the advantage of hierarchic hal methods (illustrated by the PROC CLUSTER) which, in a single

execution, give as many clusterings as levels in the hierarchy. However the PROC CLUSTER of SAS 82.3

quickly becomes very C.P.U.-time consuming as the number of observations increases and must hence be

restricted to data sets which are not too large (no more than some hundreds of observations).

On the other hand, the very efficient new procedure F ASTCLUS of SAS 82.3 may also prove to be expensive

when used repeatedly for giving clusterings with different numbers of clusters.

It will be easy to combine the respective advantages of these two PROCS (and of some others) in three

MACROS, which will be described further on ; they will help the user to get the clustering he is looking for,

and provide him with supplementary information about the structure of his data.

45

Page 2: Macros for Clustering Large Data Sets when the … for Clustering Large... · Introduction MACROS FOR DETERMINING THE OPTIMAL CLUSTERS IN LARGE DATA SETS B. GERARDIN J.L. MOLLIERE

A. METHODOLOGY IN MACROS MCLASl, MCLAS2, MCLAS.

I. Choosing the number of clusters

MACRO MCLASI

1) The principle of mixed clustering.

On large data sets a useful methodology consists first in summarizing the observations in a large enough

number of clusters (100 may be a standard value) and then applying a hierarchichal clustering technique

for aggregating these groups.

This procedure has the advantages of the hierarchichal method for showing an optimal number of

clusters and solves the difficulty of the too high initial number of observations by first clustering them,

using a non hierarchichal method, in a smaller number of clusters. This number is a parameter of the

procedure; it must be high enough in order not to impose a prior partitionning of the data.

The MACRO MCLASI realizes this mixed clustering technique, combining the use of PROC FASTCLUS

for the initial partitionnirig in a fixed number of clusters (l00 is the default value) and of the PROC

CLUSTER for the second step where the WARD's method is used.

PROC TREE then produces a tree diagram showing the 1 "heights" at which clusters join in the

hierarchy, which may graphically be very clear for cutting the tree at the optimal level. For getting

this optimum, two plots will also be very useful : the first one plotting the CCC criterion and the

second" the semi partial R 2 (or decrease in R 2 caused by joining the clusters) against the number of

clusters. The last plot contains in fact the same information as that displayed on the tree diagram.

The two different approaches for determining the number or clusters are then that by the R2 and that

by the CCC (which itself is linked to the R 2).

Figures 1, 2 and 3 show the diagrams produced by MACRO MCLASI on real data (population 1 of the

examples of part B).

46

Page 3: Macros for Clustering Large Data Sets when the … for Clustering Large... · Introduction MACROS FOR DETERMINING THE OPTIMAL CLUSTERS IN LARGE DATA SETS B. GERARDIN J.L. MOLLIERE

!, i ~:

W ~'

:~,

.,. ','

,:< )}; ',' ~ ~ f. ,'.

Jl~ t , ~~ },

Ji. ~;

l; j:

i , ll'

~ ~, J1 J ,. ~ , ~ ·x: }\

i \

t. ~ 11 , ~ ,1~

i ., ~ ?J'

25.(' + , I I

~'.5 + I I I

20 .. (' • I I ,

1'7.5 • I I I

1'5.0 + I

1 I u I , 12'.5 • T I t I • I t 1(1.0 + N I

I I

c 7.~ +

• I , I T I • ~.o +

• I t I D I N 7.5 •

I I I

0.0 + I I I -2.' • I I I

-5'.0 + I

I 9

4

'XfCtITttw of LA 'UCRD SAS MCLIn 1112'1 FRIDAY. MMCH 2'. 1914

SVMBOL 15 YALUE OF NCl

I I I 2 2' 2 2 2

• • • • • • • • • •

Figure 1. Plot of CCC showing 6 clusters Population 1

. .. • •

___ • __ .--+_--+ __ .... __ +-+- .. , --o, __ +,_~_-+_--.._---+. ___ - __ --•• --__ - .... -

]9 • 3 31 •• n.3 :t6 •• 35 • ] .... ]1 • 1 32 • 3 31 • 3 1('1 + ~ ,. 2-. 2'

U 28 + 2 .. 27. 2 II 26. 2 f 25. 2 R 24. 2'

23 + 2 o zz. 2 F 21. 2

20. 2 C 19 + 1 l UI + 1 un + 1 S 16 • I T 15 • 1 f 14 • 1 R 13 • 1 S 12 + 1

11 • 1 10 + 1

9 • •

• • • ., . ., 6. J) • • · . · . 2 • 1

• tt t3 I'T I. Zl 2J 25 Z'P 19 J1 " ,5: ... 39

IlU'IP 0' CLU5T!R S

EXEcunON DE LA "ACRO SolS MCUSl 18:21 FRIDAY. MARCH 23. UB,. 11

r.R_PHe DU "I01Iftt~~ OE CLASS~S EN FONCTION [lJ $fIllI-PAR.TIAL R2'

PLOT OF NCl*VR2 SYMBOL IS VALue OF NCl

Figure 2 . Plot of semi-partial R2 showing 6 clusters Population 1

-+------+----+-----------+----+-----+---------+-------+--------+--------+----".00 0.tt2 0.04 0 • .,)6 O.OB 0.10 0.12 0.14 0.16 0.18 0.20 0.22 0.24

SE"'I-PARTtAL It-SOUAR ED

47

Page 4: Macros for Clustering Large Data Sets when the … for Clustering Large... · Introduction MACROS FOR DETERMINING THE OPTIMAL CLUSTERS IN LARGE DATA SETS B. GERARDIN J.L. MOLLIERE

I uoqelndod

s-A<l:).snlJ 9

6U~M04S we-A6e~p <l<l-Al

. £ <l-A n6 U

fl ,,861 '£l HJO'WW 'AWall:f:::l U:Ql

81'

xxx xxx xxx xxxxx • xxxxx XXX)I:ux uXXllXXXXXX xxxxxxx xxxxx xxx xxxxx X;(X xxxxxxx xxxxxxx xxxxxxxxx luexxXXx xxxxxxxX)(xxxxxxx)(X~ xxxxxxxxxxxxx xxxxunx xxxxxxxxxxx xxxxxx)()txxxxxxxxx XXXXX)()( XX)()()(XX)l.XXXXXXXxxxx X),xxxUJtll.XXXXXX)(XXXXXXX xx)(Xxxxxxxx xx xx xxxxxxxxxxxxx xxxx AXX xx XXXX)LA XXXlI.,\ XX)!. X)I;).X XX XXXll.XXXXl(XXXllXXX XXXX X X XXXXXXXXXX XXXX XXXx xx XX XX XX X XUXXX X)(XXXXXXXAAXXXX XXX XXX XXX XXxx XXXXXXXXXXXXX XXX X XXXXXXXXXAA XXlCXXXXXXXXXXXXXX X~~~~}'-~.X~XX~~!cl!:~~Xl:'~X~X;PI,l!~X)C_X~2'XX~XAXX~XX.~]t_XXXXXXX XXXXlXXXXXX X XXX XXXXXXX XXXXX X XX XXXX XXlOl XXXX XXXXXXX XXlI. X X»X xxnXxxxxXXXXXlOl. XXX XXXll. X XXXXXXXUX XI(XXUXXXXXXXXX)()( )OtxxxxxnXXXXXXXXllXXXXXXX)().XXX)O(XXXXXXXxxxuxxxnxx X)()tXXXXXXXX XXXXxxxxxxxnXXXX XXXXXXXX)(kXXXXXXXXXXXXXXX)()(XXXXXXXXXXXXXXXlI.XXXX){XXX XXXXXXXXXJtl( )(xXX XXXXXXXXXXXXX XXXXXXXXXX),X.w;,( XX XXXXA XXX ... XXX XX XXXXXXXXXXX XXX)(XXXXX X XXXXXXU.XXX X XXX XXXX XXXXXXXXX XXXXXX XX XXXXXX XXXX XX)( XXX )(X)(X XXXXXXX XXAXX XXXX XXXXXA X X XAXXXXXXXX XXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXU:XXXXXXXXXXXXXXxxxxxxnXXXXXXXX XXXXXXXXXXX xxxx XXXXXXXXXXXX X XXXXXX XXXX )(XXX XXXXXXX XX)( XXXX XX XXXXX XXX XXX XXX XXXXXX X X XXXXXXXXXX XXXXXXXXXXXXXXXXX )(XXXXX X)( xx)(xn X)( X){ XXX XX). XXUXXXXXXXXX XXXXX){X XXXXXX X XXX)()(XXXXX X XXXXXXXXX.w;XXXXXX){ XXXXXXXXXXXx.w;XXXXXXXXUXXXXXXXXXkXXXXXX)tXXXXXXXXXXX XXXXXXXXXXX XXXX XXXXXXXXXXXX X XXXXXX XXXXXXXX XXJ()(XX X XXX X XXX XX XXXX XX .... XXXXXXX XXXXXX X X XXXXXXXXXX XXXX XXXXXXXXXXXX X XXXXXX XX XX XXXX XX XX XXX XXXXX XX).X XXXXXXXXXXX XXX XXX X XX X X XXXXXXXXXA XXXX XXXXXXXXXXXX). XXXXXX X XXXXXXXXX XXXXX XXX XXXXXX X XXXXXXXXXX XXX XXXXXX X X )(XXX XXXXXX XX XX XXXX XXXXXXX XX XXXX X)( XXXX XXXX XXXXXXX )(XX XXX)( XX XXXXXXXX XXX X XX XXX X XX X X XXXXXXXXXA X XXXXXXXX XXXXXXXX XX)()(X). XXXX XXXXXXXX XXX XXXX XXX XXXXlOf,Xn.Xxxx),),JC XXX XXX X X XXX),XXJlXXX XXXX XXXXXXXXXXXXX UX..( AX XXXX XXXXX,,(XX XX X XX)' XlI..k.X X),XXXXXXXX XXX XX.k.XXXXXX XXX XXXXXXXXXX XXXX XXXXXXXXXX XXX XXXX XX XXXXXXXX XXXXXXX XXXXXXX XXXXXXXX XX XXX XXX xXXXXX xXXXXxXXJ(XXXX xx XX XXX)()(XX)()()()()(JC XXXXXX XXXXXXXX XXXX XXX XXX X XU XXXXXXXX XX XX X XXX XXX XXX XXXXXXX)()(XXX"( XXX)( XXXXXXXXXXXXX XXXX XXXXXXXXXX XXXXXXX XXX XXXX XXXXXXXXXXXXXXXXXXX XXX XXXXXXXXXlOeXX

:::==~~~~~~ ~~::~=~~:::::~~~=:;::==:;:~=::::;;=:;:;;=:::::::::~~~::~:~:=: XXXXXXXXXJ(XXXXXXX XXXXXXXXXXXXXXXXXXJClI,XUX)!,X)!,X)!,XXlO(XXXXXXX)!,XXXxxxxxxxxxnxxxxnxx XXXX XXlXXXXX)(X XXX XXXXXXX XXX XXXX XX XX xXX XXX X XXX XX XXX X X XXXXXX XXXXXXXXX X XX XXXXXXXXXX xxXXXXXXXXXXXXXXX XXXXXXXXXXXlCXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX)(XXXXXXXXXXXX)()(X)(XXXX XXXXXXXX)(XXXlXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXxxxxxxnXXXXX;UXXXXXXXXXXXXXXXU:XX XXX)( XXXXXXXXXXXXXX XXXX XX XXX)( XX XX XXXX XX .. X)()!, )(XX)!, XX XXXX X)!,X)()l.X). XXXXX)tX XA XXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXX)(XXXXXXXXXXX)(XXXXXXXXXXXXXUXXXXXXXXXXXXXXXXXXXXXXXXXX XXXX XXXXXXXXXXXXXXUXXXX XXXXXXXXXXXXXX)( XXX XX)'X XXXXXX XXXX XXXX)'XXXX XXX XXX XXXXXXXXX X XXXX XXXXXXXXXXXXXXXXXXXX XXXX XXX X XX XXXXX XX)' X XXX XXXX XX XXXXXXX XXX XXX XXX X XX XXXXXXXXXX XXXX XXXXXXXXXX XXXXXXXX XXXX XX XXXX XXXXXXX XXX XXXX XX XXXXXXX X XXXXXXXXX XXX XXXXXXXXXXXX ~ XXXX XXXXXX)(xX)()( XXX UX)()(X XXXX XXXX XXXXXXX XXXXXXX XX XXxxxxn XXXXXX XXX XXX X XX XXXXXXX)(XX X XXX XXXXXXXXXXXXX)();X XXXX XXXXXXXX XXXXXXX XXX X XXX XX XXXXXXX)( XX X XU XXX X n X XX XXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXX)(XXXXXXXXXXXXU)(XXXXXXXX)(UXXXXXX).XAXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXX)(XXXXXXXXXXX)(XXXXXXXXXXXXXAXXXXXX).XXXXXXXXXXXXXXXXXXXXXXXXXXX)(X XXXX XXXXXXXXXXXXXX XXXXXX X XXX XXXX XXXXXXX xxxxn X XXXXXXXXX X)()'XXXXXXX XXX XXX XXXX XXXXXX X XXX XXXXXXXXXXXX XX XXXXXX XX XX XXXX XXXX XXX XXX XXAX XXXXXX XXXXXX X XXXXX XXX" X XX XXXXXXXXXX X XXX XXXXXXXXXX XX XXXXXXXX XXXX xxx XXX nxxx Xx:aX).XX XXX),)'X XXXXXXXX;"'XXX). XXX), XX X XXXXXXXXX X XXX XXXX XXXXXX XUXi.XXXXX XX).X XXXX XXX). XXXXXXX X;,.,X XX ),XX XXX)(')' XX X XXX XXX XXX X XX X XXXXX),XX X X XlO( XXXXXXXXXXXXXXXXXXXXX ).XA AXXX XXXXX;,cX X.XXXX). X X.k. XAXX XXXXXXX )l.XXXX;w.XXX XXX XXXXXXUXX X XXX XXXX XXXX XXXXXX XXXXX XX XXX XXXX XXXXXXX XXXXX,lX XX XXX)( XXx)' XX X XAX XXXXXX XXX X XXXXX XXX X

X XXX XXXXXXXXXXX XXX XXXXXXXX XX XXX X XX XXXXX )(XX XXX)( XX XXXX XX XXX XXXXX XXX XXX XXX XXXXX XXXXX XXXX XX XX XXX XXXXXXX XX XX XX X XXXXXXX XX)'X XX X XXX XX XX),). XAXX XUX XX XXXX XXX XXX X lOtXXXXXXXXXX

ZSEb..,YbL90Z0ILE..,911016!iiO!iiLBOILElILELb..,£090 6BL9LliEZZIlEil<79b..,Z9Zbl1."..,SE6LlliEibS..,Z!t..,LI IiiWBI:IBiiQIii8QSQQQaBSQI:I'dijliitltl\lSQ8t18b81:l'dOaaaSBB OOOOOOOOOUOUOUUOOUUOOOUOuOlJlJOOliOuO QUailOD

3nOIHJO'W0'3IH ill&O' .... 1 30 ~lSS3~dHl

ISW1:)W S ... S 00':1'101 '1 ElU NOlJ.nJ3X~

O'3J.SCllJ ~ MJI.l'f'll0'3saO ;1) :J!IWN

XXX XXXXX XXX XXXXX • XXXXX XXX XXX XXXXXXXXX •• XXXXX • XX),XX),). XX).)(A ),XU .. AAX ).X). ).XX )().).U,A). AX). XXXXXXX XXX XXXXX+ 0 XXXXXXXXX XXXXXX.XXXXX XXXXXXXXX XXXXX)(XXXXXXX XXXXXXXXX XXXXXXXXXXXXXXX XXXXXXXXXXX XXXXXXXXX .... X XXXXXXXXXXX xxxxx)(xxxi XXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXX XXXXXXXXXx).),xu. XXXiOtXXXX).), XXxxxx).xxn XXXX)(XXXXXXXXAXXXXXXX I

XXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXX XXXXXXX XX XXXXXXXXXXXXXXXXXX),XXXXXXXX XXXX X).): Xxu X).llXXXXX),XX).X)(.).XXXXX:X I XXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXX.k.XXJ( XXXXX xxxxxn XXXXXXXXXXXAU XXXX)'XXX..:X XXX)( XX XX): X XH XXX X XX.k.XXX XXXX ).XXXXXXXXXXX I xxxnxXJ(xxxxXXX)(XXXXXX XXXXXXXXxnXl(l{~l{~_XXXXXx l(~lt~~~Jtxxxxxxxxxxxxxxxxxxxxxxxxxxxxx XXXXXXXX)(,),XXXX).XXXXXXXXXXXXXXXXXX I XXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXA XXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXX),XX XX)(X XX). XXX)(. XXXXXXXXXXXXXX),XX)(.). XXX I XXXXXXXXXXXXXX XXXXXXXX XXX XXX XXX XXXXX XXXX Xxxxx XXX X XXX XXXXXXX XXX X X XX X X xux XX XXXXXXXX XX).X ).XX XXX X XX xxxx XXX X xx"( XXXXX X xxx, XXXXXX)(XXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXX XXXXXXX XXXXXXXXXXXXXXHXXXXXXXXXXXXXX ).XXX xxxxxXXXXXXXXXXX)'XXX XAXXXX XXX I xxxxxxxx xxxxxx XXXXXX XX XXX XXX XX XXXXXX X XXXXXXXX XXX XXXX Xxxxxxxxn XX XX xxxx XX X)!,X)( XX U xx)(.). XX X.k.X).),)..k.X XX XX XX XX XXXX X XXx:...xx ux J XXXXXXXXXJCXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXX XXX X XXX XXXXXXXXXXXXXX )F.XXXXXXX)(,X .k.XXXXX XXXXX XXX xxxx XXXXXXXXXXXXXAXXXXXXX).+ ~a·o XXX Xl&: XXXlI:X xxxxxxxxxxxx XXX XXX XX XXX XXX X XXXXXX XX XXX XXXX XXXX XXXXXX XXXX J(.XXXXXXX XX XX XX XX XX)()(X XXX X X XX XX XXX X X X X X XX XXXX XX X XXX I XXXXXXXX XXX xxx XX XXXXX X X XXXXX XXXXXXXX X XXXX X X XX XXXX XXX XXXXXXX XX XXX XX ).XX); XX),). XX xXXX xx),x Xx)' X)()(), X)' X XX)(X AX XXX X). x). XXX XX). xXX I xxxxx XXX XXXXXX XXXXXXX X XXXXXXXXXXXXXX X XXXXX X XX XXX xxxx XX XXXXXXXX XX).X )...(xx.xx xxxx ).xx .... XX xxxx .... .k.). .... x );,);,XXX )(AXX XX X X XXX Xli.). XXX XXX I xxxxx XXX XXXXXX XXXXXXXX X XXXXXXX XXX X xxxxxx X X XXX XXX X XXX XXX X XXX XXX XX X), XX XXXX XX XX XltXX XX xxxxx XXX X x),x XX XXX X XX X X XXX)(.AX XXX XXX J XXXXX XXX XXXXXXXXXXXXXX XXX XXX XXXXXXXX XXXX XX XXX XXXX XXX XXX),XXXXXXX)(XX xxxx XXXXXXXXXX XX xxxxx XXX ~ X XX XXXXXX XXX XliX Yo XXX XXX XXX J XXXXX xxxxx XXXX XX XXXXXx XXX XXX X XXXX XXX X XX XX xxxx X)tX X XXX XXXX XXX XXX X XXX X XXX XXX X XXXXXX xxxx XXX x)!,).). XXX X X XlI);,X XX X),).X X XXX XXX).X X I XXXXXXXXXXXXXXXXXXXXX X XXX XXX XX XXXXXX XX)( X xxxxx XXX xxxx XX X XXXX XXX XXXX .k.XXX XXX.k. XX X XX X XXXX XX X XXX xnx X x).x nx)()(x XXXXXX XX)' X X)' I XXXXXXXX XXXXXXXXXXXX X X X XXXX X XX XXXXXXX XX XXX X XX XXXX XXX X XXX XXX)'xX XXXX x).xx).x xx),), XXX). XXXXXX X ":xx X x:.. X XX xxxxx:.c.), .... x)()r. )!,xx XXXAX.k.1 XXXXxxxxxxxxxxxxx XXXXX XXX XXXX XXXXXXX X XXX XXXXX XXX XXXX XXXXXXXXXXXX XX XXXX XX XX XX XXXX XX XXXX X XXX X XX)( XX xxxx XXX X XXX XXX XXX X X X i XXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXX XXXXXXX XXXXXXXXXXXXXX).XXXXXXXX),XXXXX).XXXXX xxxxxx).. XXXXXX X)(XXXXX XXXXXXXXX + 1· 0 XXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXX XXXxxxx XXXXXXXXXXXXXXXXXXXX)(), XX ),lIXllX),XXXXX ),XX ),X),XXxx).),XXXX),X)();,XXXXXXXXX I X XX xx XXX XX xxxxxxxxxx XX XXXXXXXX XX XXXXXXXXXXX XXXXXX X XXX XXX xxxxxn X XXX XX XXXX X)..k.X XXX X XX XX xx);, X),X X X.k.). XX X)( >.X XX).). XX X X)( X XXX XXX I XX XXXXXXXXXXXX XXXXXXXX X XXXXX XX XXX XXX X XXXXXX XX XXX X X xxx XXXXXXX XXX XXXX XXXX XX X X XX XXXX XX X X xx X XXX X XX X X)(XXX XXX ..:xXX .k.XXX xxx XXX I XXXXX XXX XXXXXX XXXXXXX X XXX XXX XX XXX XXX X X XXXXXXXXXXX XXxx XXXXXXXXXX X XXX AXXX).XX).)(), XXX X XX ),xxx X x>.x X X .... A X)' XlI).X XX X )..XXXXXX XXX XXX I XX xxx XXX Xx XXXX XX XXXXXX XXXXXXXX XXX xxxx XXX XXX XXXX)()( X XXX XXX xxXXXXX XXXX ),XXX xx XX XX X);XX XX).X),),), X).X XX X). XX Xx)')' xx X X ).XX ),lIX xxx XX). I XX XXX XXX XXXXXX XXXXXXX X XXXXXX XXXXXXXX X XXXXX X XXX XXX X X XX XXXXXXXXXlt). ""XX)' X XX)!,X x). XX XXXX x..::u x .... x XX). X)(UXX XX)')"}'X XX XA)(XX X XXX X XlI; I XX X XX XXXXXXXXXXXXX XXXX XXXXXX XXXXXXXX X XXXX X X XXXX XX XXxx XXXXXXX xxx XX XX XXXX Xx X X xx XXX X XX..(x X)!' X XX)' X).XX XXXXX ).XX XX XXX XXX XXX xxx I xxxxx xxx XXXXXXXXXXXXXX XXXXXXXXXXXXXX X XXXXX XXXXXXXXXXX XXXXXXXXXXXXXX)'X XXXX X XXX X X XX XX x). XXX xxx X X X X XX X XXX XX X X XXX XXX XXX X X X I XXXxx XXX XXX XXX X)(XXXXX X X XXXXX XXXXXXXXXXX)(XX X XXXXXX XXXX XXXXXXXXXX X).X)!, XXXX XX X), xx xxxx XX xxxx X XXX X X).)( XX),XXXXXX XXXX XXXXXXX X)... I XXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXX XXXX).X ).).)r.).),XX),X). X)().X),).)(). X Xl\). xx X),XXX)(XXXXXXXX XXXXX X + S 1 ·u XXX XX XXX XXXXXXXXXXXX XX XXX XXX XXXXX XXXX XXXXX X XXXXXX XXx)( XXXXXXXX XX XX XX ),XXX).X XX..:X xXX X XX),X >.X). x>. .... X xn x). u XX Xxxx X XX XXXXXX X XX I XXXXXXXXXXXXXXXXXXXX X X X XX XXX XX XXXX XX X XXXXX XXXXXXXX XXX X XX XXXX xXX XXXX XXXX XX X X XX XXXX xX X X XX X XXX X XXX XX XX),). XX.k.lI XX X Xli X XXX XXX I XXXXX XXX XXXXXX XXXXX)(XX XXX XXXXXXXX XXX X XXX X)()( XXXXXXX XXX XXXXXXX XXXXX XXXX Xx XX XX XX XXXX X),XXX)( X XXX XX).). XX XXX X XX X)( Xx X X XX XX X XXX I XXXXX XXX XXXXXXXXXXX)(XX XXXXXX XXX XXXXX XXX XXX X xx XX XX X XXX XXX X X)(X X XXXX XX X XX;'" XX XX)..X XXx)')'x X)(X).),);,).).), X lI.X XX)(XX X XX)( X);;)' X)()' X XXX XX X I xxxxx XXX XXXXXX XXXXXX X X XXX XXXXXXXX XXX X XXX XX X XXX XXXX XXX XXX)!,XXX XXX XXX). XX xx),x X X XX XXXl\ XX). X XX X X)!,A AXHXX XUX X X X X X X X Xli X XXX)' XX I XXXXX XXX XXX XXX XXXX XXX X X XXXXX XXXXXXXX X XXXXXXXX XXXXX X XX XXX XXXX xxxn XX ...xXX XX JC). XX ),XX)I. X)( XX XXX Xxx XXXX XX ).XXX XX AX X..( X X).). XXX xx,\ I XXX XXXXXXXXXXXXXXXXXXX XXXXXXX XXXXXXX XXX XXXXXXXXXX XXXX XXXXXXXXXX).X XXXXXX )'XXX XXXXXX XX).X XX X XXX XXX X XXXX X X XXX)( XX X X).XXXX XXX I xxxxx XXX XXXXXX XX X XXXX X XXXXXllXXX XXXX).X XXXX XX XXXXXXX XXX XXXX XXX XXX :XXX); )(XXX).X XXXX XXXX Xxxx X).X XXX X). XX XX XX)(). X .... XX XX)' X"X XlU..( XX I XXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXX XXXXX XXX XXXX XXXX XXXXXXXXXXXX XX ).XXXX), XX XX XX XX),x X),).X X XXX XX),).)(X .k.xX)( XXX X). x ), XAX XX)(XX X I XXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXX XXXXXXXXXXXXXXXXX XXXXXXXXXXXXX .... -,x).XXlI.Xxxxx)!,X). XXAXX).). XXX Xli).). X,J(X,\X XXXXX XXXX)(XXXX XXX + Z· 0 XX XXX XX X XXXXX XXX XXXXXX XXX XXXXXXXX XXX X XXXXX X XXXXXXX XXX X XXXXXx XXX Xl( XX XX XXXX X X XX XXX)' X"': XXX).). XXx x),xx XX XX)( X XX XX XXX)()' xxxx XXx I xxxxx XXXXXXXXXXXX XXXXX XXXXXXXXXXX XXX X XXXX XXXXXXXXX Xxx. X XXXXXXXXXXXXX ).XXX XXX XX). XXXX X);' Xx .... XX XXX X XX:.. xx).x). X XXX X X X XXXX XX X XXX I XXXXX)()(XXXXXXXXXXXXXX X XxX XXX XXXXX XXX XXXXXX X XXX XXAX XXX XXXXXXXXXXXX X){, .... XXX .k.XXX X). X X).). X). XX XXX X)'X). X),X :xX ).XX). XX X X XXXX), X)(){,). X XX I XX XXX XXXXXXXXX xxxxXXX X XXX XXXXXXXX XXX X XXXX XX XXX XXX)( XXX XXX X XXX ),XX XX XX l\A XX),X X .... ).X .... XXX).Ao X)I. X)!,). UJ().)().X A": x>. X X XX).). XX). XXX XX).).X), I XXX XXXXXXXXXXX XXXXXX X X XXX XXXXX XXXXXX X XX XXX X XXXX XXXXXX XXXXX XX XXXXXU ).XXX XX XX X);; XXXX XX XXXX X XXX X).AX XX)'X XX XX X X XX..( Xli XXX X ).AX I xxxxx XXX XXXXX XXXXXXXX X)(XXX XXXXX XXX XX). X XXXX XXXXXXXXXX XX XXXXXXXX X). X). XX X X XX XX Xx AX ).XX). XX XXXXX :xX). XXXX ).XXX X X XX X X XX). Xli X XX)' XXX,

Z!i.,.9Z9i£OSE1SLIi£zoalS!t"6Z"'b""'95Z9ij8£!ii!iiLZb~bl:j"'''1Lb9£01b9 llSLl .,fE9ZSZSL9SS"'''L.,S,!ii9''1S9ij68BB.,S,E29IZblbL6LS,6i.1111!tSb€OSlb 16!.La iQSQQ8Q880Qil99S989seS60ililijQ 9loSS\:j8't1Sgij 1:j1:l8IiU'd'a tI'riI:lOijij I ij'ilCl::loJ\l~G

000000000 00000000000 000000000000000000 00000 OOBOO 000 o

CJ31$n1J b(J NOlJ.M}t3sao :W ;lWVN

3nC:lIl-OklVa3IH 3k1'ilklV.1 30 Jl.lJl~SjkldWI

Zl .,861 l(Z HJkI'''' 'A"'UI'ti~ ll~al IS'r'1Jw S"S G~JwW ,1 30 NOI.J.il)3X3

I I I

SZ·O

, • n e ,

Page 5: Macros for Clustering Large Data Sets when the … for Clustering Large... · Introduction MACROS FOR DETERMINING THE OPTIMAL CLUSTERS IN LARGE DATA SETS B. GERARDIN J.L. MOLLIERE

\

\

2) Measuring the stability of the results

As is well known, non hierarchichal algorithms always give, for a fixed number of clusters, only local

solutions to the general problem of maximizing the R 2 coefficient. In some algorithms the local

optimum obtained depends essentially on the choice of the initial "seeds" or "kernels" or "centres" of

the different clusters, which are often randomly chosen among the observations.

The method used in PROC FASTCLUS, directly inspired of Mac Queen's K-means algorithm is much

less sensitive to a random choice, since in fact initial seeds are searched for by the algorithm itself

during the first iteration, in such a way that they guarantee a good initialization. However the optimum

obtained still depends on the rank order of the observations on the input file.

A natural question is then to ask about the stability of the results when the initial order of the

observations is changed. This question is answered in MACRO MCLASl (and MCLAS).

For getting a new result with the same clustering procedure (FASTCLUS) a new file is generated where

the observations are ranked following a random arrangement of their ranks on the initial file. This will

be done using random number function UNIFORM.

MACRO MCLASl may be used on the same data, with two options of random rearrangement of the

initial file: repeatable (the random sequence will always be the same in different executions) or non

repeatable (random sequence initialized by a computer clock observation);

The hierarchy obtained in MCLASl may be kept on two output SAS Data Sets making it possible to

completely define the clusterings at any level of the tree, on the initial observations and variables.

The practical importance of finding out the good number of clusters requires that the method gives

some measure of its own robustness, which is proposed here in MACRO MCLASl.

49

Page 6: Macros for Clustering Large Data Sets when the … for Clustering Large... · Introduction MACROS FOR DETERMINING THE OPTIMAL CLUSTERS IN LARGE DATA SETS B. GERARDIN J.L. MOLLIERE

II. The Optimisation phase (fixed number of clusters)

MACRO MCLAS2

MACRO MCLASl normally gives an optimum number of clusters. Then remains the problem of getting

the best possible clusters, i.e. maximizing the R 2 criterion. MACRO MCLASl gives a very good starting

solution consisting of the partition corresponding to the optimum level of the hierarchichal tree.

Optimization can be achieved either with MACRO MCLAS2 using PROC FASTCLUS on the centres of

the previous clustering, or with MACRO MCLAS2B using a new SAS procedure (written in FORTRAN

77) PROC ZTRANS applying the exchange method due to S. Regnier (PARIS 1964-1966). It can easily be

proved that such an exchange method which realizes all transfers of an observation from one cluster to

another, making the R2 criterion decrease, is able to improve the optimum obtained by the iterative

algorithm of FASTCLUS, the inverse not being possible.

These two MACROS must be executed after the MACRO MCLASl (or MCLAS). They use the output

data sets of MACRO MCLASl as input data sets and only require the user to specify the number of

clusters he has chosen at the end of the previous step (MACRO MCLASl).

They produce an output SAS data set which includes the final clustering variable. This data set has the

structure of the OUT data set of the PROC F ASTCLUS.

III. On the comparaison of different clusterings

(fixed number of clusters).

MACRO MCLAS

The sequence which we have just shown, of running MACRO MCLASI for determining the number of

clusters, and then MACRO MCLAS2 for optimizing the clusters at the optimal level of the hierarchichal

tree, could seem to solve completely the user's pratical problem.

Of course a solution has been found, and probably a good enough one, at a very cheap cost. However it

may be very useful, and in some cases necessary, to go further.

The main result at the end of this step of our investigation is the indication given by MACRO MCLASI

of the optimal number of clusters.

On the other hand the optimized clusters obtained by MACRO MCLAS2 must only be considered as one

solution corresponding to a local optimum of the optimization problem. MACRO MCLAS will provide a

means for getting nearer to the global optimum.

Based on theoritical results by E. Diday (1974) this approach will define the new notion of "strong

pattern" which, indirectly, will also show itself to be useful for studying the structure of the population

of observations.

50

Page 7: Macros for Clustering Large Data Sets when the … for Clustering Large... · Introduction MACROS FOR DETERMINING THE OPTIMAL CLUSTERS IN LARGE DATA SETS B. GERARDIN J.L. MOLLIERE

~ r. li ~ r t ~ t. r: f ~~

" ;~

~ ;": ~

f f~ ~.

~'

K " ~,

~

i.: . :~ \~-

~~:

~: ..

r • ~ I· j' ,. £;: l~.

t· t~: (.

E ~;

~ !!~ I;.

!~ r,

" E ~ ;t.' ~. !;.. :~ ); i;. ~-

t " ~ ... ~ f ~ r..;

.?~ h

~ '{: :,..:. f of,'

I i

i

I ,.

~

, \

1) Different trials using FASTCLUS

(fixed number of clusters).

Direct use of PROC F ASTCLUS is now possible since the number of clusters has been determined.

PROC MCLASS will run five executions of PROC FASTCLUS in the same conditions, only changing

the order (which, in each trial, will be randomly generated from the initial order) of the

observations on the input file.

Of course the five clusterings obtained will be different. This is due to the fact that the solution

given by PROC FASTCLUS is always a local optimum depending on the initialization. In fact only

the best solution maximizing the R 2 criterion is required, and perhaps the optimized clustering

given by MACRO MCLAS2 was better than any of these here obtained.

The real interest of these repeated trials is more to permit us to evaluate how much the

FASTCLUS procedure is dependent on the initialization, i.e. to measure its robustness with regard

to our particular data. A good stability of the results means that the data is strongly structured. On

the other hand a great instability corresponds to a population of . observations not easy to cluster,

that is weakly structured.

A natural way of measuring the differences between different partitions in the same number of

classes is to consider the "~trong patterns"resulting from these partitions ; the definition and

practical interest of this notion will now be outlined .

2) Considering the "strong patterns"

The "strong patterns" resulting of several partitions pi, p2, ... pn, are clusters forming a new .. A 2. . f pI p2 pn partItIOn , mtersection 0 , ,... •

As a consequence of this definition, the observations of a same "strong pattern" are always

together in the same cluster in each of the different partitions considered.

Strong patterns therefore seem very adequate to reveal, inside the population, stable parts (large

strong patterns) and unstable parts (small strong patterns), the case where all strong patterns would

be small corresponding to a population having no real structure and then unable to be clustered.

In MACRO MCLAS, all the strong patterns resulting from the five partitions are calculated and

their numbers of observations printed. Emphasis is put on two quantities which seem to be

characteristic enough of the structure of the data:

a) the total number of strong patterns. The higher this number, the weaker the structure of the

population.

51

Page 8: Macros for Clustering Large Data Sets when the … for Clustering Large... · Introduction MACROS FOR DETERMINING THE OPTIMAL CLUSTERS IN LARGE DATA SETS B. GERARDIN J.L. MOLLIERE

! I I L

b) the weight of the strong patterns of a minimum importance (i.e. the number of observations of

which is at least 2,5 % of the total number of observations) in the total population. This weight

is expressed as a percentage.

The partition corresponding to the strong patterns will be kept on an output SAS data set of the

MACRO MCLAS (analogous to the OUT data set of the PROC FASTCLUS).

IV. Clustering the strong patterns

MACRO MCLAS

1) Some theoretical results (E. DIDA Y, 1971}).

pJ£ will represent a partition of the set E of all the observations, corresponding to a local optimum

(as given by PROC FASTCLUS).

Let us consider the strong patterns Ai' ••• Am obtained for (q-1) partitions pJ£I, ••• pJ£q-1 in k clusters

and pJ£q a new partition (the classes of which are pi, •.. P~).

Let P(j/i) ==~ card (A.n pJ£~) car i I J

This quantity expresses the probability that has an element to be in r*f knowing it is Ai"

One can measure the information carried by P knowing AI' .•• A : q n

m card A. JL l(pJ£q/r*I, ••• pJ£q-l) == 2·- I L P(j/i) log P(j/i)

i=l card E j =1

If the partition of the strong patterns AI' ••• Am is 3thinner than the partition pJ£q, the information

brought forward by pJ£ q is zero (since P(j/i) == I or 0).

\ I ( n/ In-I) The invariance of the strong patterns is assured for n == q if.". n)q one has I P P , ...• P == 0

We can then note that the number

n J(k) == > I(pJ£q/p*1 , •.• pJ£q-l) gives an idea on the value of the choice of the number of k classes

q;r

requested; the smallest value or J(k) corresponds to the best choice of k.

Theorem

If l(pJ£n/pJ£l, ••• pJ£n-l) == 0 n such that q n N where N is the maximum number of local optima,

then the partition of the strong patterns pJ£ln... ,1 pJ£q is thinner than the partition

corresponding to the global optimum.

2) Using information I

In MACRO MCLAS the number q of partItIOns is fixed at five. This number will rarely be high . ..) ·f· I (pl£q/pl£l pJ£q-l) 0 enough (only if clusters are very diSJOint for veri yIng , ..• , " .

52

Page 9: Macros for Clustering Large Data Sets when the … for Clustering Large... · Introduction MACROS FOR DETERMINING THE OPTIMAL CLUSTERS IN LARGE DATA SETS B. GERARDIN J.L. MOLLIERE

" i}

~. " r f; " f

t r, ~.~

~ ~~ , :.c ;~ l" F t h F \~

J; L J: t L ;.;, ;,;"

1; } R ~j--,;.

," t·

f ~ 1:

~ J\

~ ¥. ~ ~ t·;

~ i l ~ ~:

\ l-\ ..

Nevertheless it will be interesting to look at the decreasing of the information brought by plEq,

from q = 2 to q = 5. These information quantities are therefore computed and printed i'n the

MACRO MCLAS and also their sum J (k). This last number may be very useful if there is still some

doubt about the number of classes. It will be easy to run MACRO MCLASS for the two numbers in

balance, e.g. k 1 and k2 and choose from the two values that for which J (k) is minimum.

An important ability of MACRO MCLAS is that it can be used not only for determining the strong

patterns of the five first partitions, but also for bringing forward new information of five new

partitions to already existing strong patterns (resulting of n previous partitions). For this the

MACRO only requires to know the partition of the existing strong patterns as a variable on an input

data set. The information quantity I (plEq/plEl, .,. , plEq-l) is then calculated for the values q = n+l,

... q = n + 5.

3) Hierarchichal clustering

Another important result expressed by the theorem of paragraph a) is that the optimal clustering

(corresponding to the global optimum) can be obtained as a partition P, where every class of P is

the union of some strong patterns. Of course the theorem refers to the strong patterns resulting

from a number of partitions equal to q such that: I (plEq/plEl, ... , plEq-l) = o.

Even if one is far enough from this value of q, it is worthwhile trying to cluster the strong patterns

in the best possible way in k classes. There is a chance, in doing this, to get nearer to the global

optimum.

The last problem is then to get a good clustering of the strong patterns.

MACRO MCLASS carries out a hierarchichal clustering using PROC CLUSTER with the Ward's

method. An important remark is that the hierarchichal tree (with the R2 and CCC criteria) built on

the strong patterns should not be used for determining the optimal number of clusters in the

population of observations.

However, when the MACRO MCLASI does not give any significant results, the hierarchichal tree

of the strong patterns can bring useful supplementary information for determining the optimal

number of clusters.

The condition for the initial partitions not to impose a prior structure to the hierarchichal

clustering of the strong patterns is that they should be in a high enough number. This will be the

case when the population of observations cannot be easily clustered.

On the other hand, it cannot be proved that the partition obtained at the level of k classes of the

hierarchy of the strong patterns necessarily improves the best of the initial partitions in k classes.

However, after optimization in MCLAS2, the final partition in most cases shows a better value of

the R 2 criterion than the initial partitions.

53

Page 10: Macros for Clustering Large Data Sets when the … for Clustering Large... · Introduction MACROS FOR DETERMINING THE OPTIMAL CLUSTERS IN LARGE DATA SETS B. GERARDIN J.L. MOLLIERE

On the contrary, the exchange clustering method applied to the strong patterns, and initialized

with the positions of them in the classes of the best of the initial partitions, necessarily guarantees

the increasing of the R 2 criterion of that partition. This idea suggests that PROC ZTRANS

(carrying out this transfer algorithm) is used instead of PROC CLUSTER for clustering the strong

patterns, when the number or cluster is known; this methodology is realized in MACRO MCLASB.

Coming back to the printed outputs of MACRO MCLAS, it seemed interesting to compare the

aggregations of the strong patterns (especially those of high number of observations) in the

hierarchichal tree with their clustering inside the initial partitions. Therefore cross-tables of the

strong patterns with each of the initial partitions are produced by MACRO MCLAS.

v. Selecting a more stable part of the population of observations

A relative instability of the results often appears when running several times MACRO MCLASI or MACRO

MCLAS. For example, as far as the number of clusters is concerned, some executions will give significant

results (local maximum on the plot of the CCC in agreement with the indication of the semi partial R 2)

confirming the existence of one (sometimes more than one) optimal value, and other executions will be

much less informative, giving indications which are too weak.

Obtaining significant results may even in some cases be very difficult. This is clearly due to the structure

of the population of observations which can or cannot be easily clustered. Obviously a structured population

in quite disjoint clusters would show a great stability of the results obtained by MACRO MCLASI or

MCLAS, whatever the initialization could be.

As was pointed out in paragraph III 2), strong patterns of a high number of observations, if existing, will

correspond to a strongly structured part of the whole population.

Having established, from several trials, that the instable part (strong patterns of small number of

observations) always consists of approximately the same set of observations, it is suggested that these

observations are removed from the population.

Running the clustering MACROS again, on the stable part of the initial population, should lead to more

stable results and then clearly give an optimal number of clusters.

Unstable observations would then be classified in the clusters obtained on the stable part of the population.

The strong patterns then still prove to be very useful for revealing the observations (or groups of

observations) to which the clustering procedure is the more sensitive.

54

Page 11: Macros for Clustering Large Data Sets when the … for Clustering Large... · Introduction MACROS FOR DETERMINING THE OPTIMAL CLUSTERS IN LARGE DATA SETS B. GERARDIN J.L. MOLLIERE

B. EXAMPLES

The same population will be clustered on two different sets of questions, the first set structuring the

population strongly, whereas on the second set no prior structure is present. The observations consist of 1095

individuals questionned in a survey on the themes of energy-crisis, nuclear power stations, and the image, in

the public opinion, of the national electricity producing company (Electricite de France). The variables

considered in the first example are an homogenous set of questions about nuclear power stations and nuclear

energy, which are all answered on a common scale in five modalities: high agreement, mean agreement, low

agreement, no agreement at all, and no response.

In the second example, variables are questions intended to analyze the image of the company in the public

opinion. They form a much less homogenous set of variables, essentially qualitative (no graduated scale of

response) and with various numbers of modalities.

In the two cases the correspondance analysis method has been used to summarize the set of qualitative

answers in a small number of quantitative coordinates on the principal components (five in the first population,

five in the second one). These new variables, called factors, will be those considered for the clustering of the

individuals.

In example I of the population questionned on the nuclear theme, correspondance analysis revealed a Guttman

scale of the responses which were all highly correlated. A unique and common scale graduating the intensity of

agreement with each of the assertions proposed, is sufficient to differentiate the individuals. This strong

unique dimension will induce a very pronounced typology, approximately corresponding to the different degrees

of agreement on the common scale, with extra-clusters for degrees of indifference.

On the other hand, the correspondance analysis of the population questionned on the company-image theme

will show at least four different dimensions. This second example, for the clustering MACROS, does not

contain a strong prior structure, as was contained in example I along a unique dimension.

It wil be interesting to test the methodology developed gy MACROS MCLASI and MCLAS on these two

examples.

55

Page 12: Macros for Clustering Large Data Sets when the … for Clustering Large... · Introduction MACROS FOR DETERMINING THE OPTIMAL CLUSTERS IN LARGE DATA SETS B. GERARDIN J.L. MOLLIERE

Example 1 : Nuclear theme Results from MACRO MCLASl, corresponding to ten different random initializations, are compiled in the

following table:

Trial n° Optimal number Optimal number R2 value CCC value

of clusters by clusters by at the optimal at the opti-

the CCC R2 level mal level

1 (9) 5 - -

2 ® (1) 0.6468 12.65

3 9 6 (9) - -

4 - 5 - -

5 ® 5 (6) 0.6210 12.73

6 ® 5 (6) 0.6329 16.14

7 ® ® 0.6373 17.42

8 (9) 5 - -

9 ® (6) 0.6203 12.53

10 (8) 8 - -

56

Page 13: Macros for Clustering Large Data Sets when the … for Clustering Large... · Introduction MACROS FOR DETERMINING THE OPTIMAL CLUSTERS IN LARGE DATA SETS B. GERARDIN J.L. MOLLIERE

For the CCC criterion only true local maxima are indicated. A number of clusters between parentheses

indicates no local maximum but only a stage of the curve at this level.

For the R 2 criterion the choice is made or the level preceding a "significant" increase of the semi-partial R 2

at the next node, which means that the R 2 decreased substantially more after this merging of two classes than

after the preceding ones. The numbers labeled on the curve are the numbers of clusters after this merging of

two classes, causing a decrease of the R2 equal to the value of the coordinate of the node on the horizontal

semi-partial R 2 axis.

Such an interpretation can be made, for example, on figure 2. When two choices are possible, the second one is

put between parentheses in the table.

In the case of non agreement of the two criteria, it is recommended to choose the value indicated by the CCC,

if corresponding to a true local maximum j the R 2 criterion being of course more subjective.

The main results of MACRO MCLASI, from this table, are:

the strong indication of the number of 6 clusters (local maxima observed in four cases out of 10, and

generally in good agreement with the R 2 criterion)

the weak presumption of the existence of 7 clusters.

the bad inference that could have been made of 5 clusters, only considering the R2 criterion.

This partition in five classes is a rather crude one corresponding to the number of modalities of the different

questons, which is the first national clustering of the population on a common scale: high agreeement, mean

agreement, low agreement, no agreement at all, and response.

a question about the number of 9 clusters because of its repeated occurence, although this number does

not give a true local maximum. From other respects, this number has shown to have some pertinency, if

a thinner partition was desired.

Results from MACRO MCLAS are now collected in the two following tables, the first for initial partitions of 6

clusters, the second for initial partitions of 7 clusters.

57

Page 14: Macros for Clustering Large Data Sets when the … for Clustering Large... · Introduction MACROS FOR DETERMINING THE OPTIMAL CLUSTERS IN LARGE DATA SETS B. GERARDIN J.L. MOLLIERE

Initial partitions of 6 clusters

rial nO Optimal Optimal R2 at CCC at Number Number of Maximum Infor-

number number the 6 the 6 of strong obser- R2 of mation

of clusters clusters clusters clusters patterns vations the in i- J

by the by level level in impor- tial

CCC R2 tant partitions

"strong

patterns"

I 7 7 0.6407 13.47 31 1024 0.6612 missing

2 7 7 0,6498 15.38 27 1055 0.6615 0.324

3 7 7 0.6443 14.32 40 1005 0.6613 0.417

4 7 7 0.6482 15.11 29 1038 0.6616 0.348

o . .zo 5 7 7 0.6293 11.18 44 1039 0.6612 ..

Initial partitions of 7 clusters

Trial nO R2 at CCC at

the 7 the 7

clusters clusters

level level

1 8 8 0.6715 15.90 48 939 0.6867 0.453

2 8 8 0.6708 16.21 45 901 0.6878 0.402

3 (]) (7) (§8.43) (18.60) (29) 1047 0.6860 (0.109i

4 9 9 0.6740 18.10 47 899 0.6875 0.381

5 8 8 0.6740 16.91 40 968 0.6875 0.279

58

Page 15: Macros for Clustering Large Data Sets when the … for Clustering Large... · Introduction MACROS FOR DETERMINING THE OPTIMAL CLUSTERS IN LARGE DATA SETS B. GERARDIN J.L. MOLLIERE

Several remarks can be made:

the number k of clusters of the initial partitions very often induces an optimal level of the hierarchy of the

strong patterns at (k + 1) clusters.

In the case of a structured population of observations with small number of strong patterns, the hierarchichal

tree of these strong patterns must not be used for choosing the optimal level clusters in the population.

the best clustering of the strong patterns in k clusters is obtained when the optimal level of the hierarchy

obtained from the initial partitions of k clusters also corresponds to the number of k clusters.

It can also be noted in this case (trial nO 3 for k = 7) the small number of the strong patterns (29 is the

smallest value of the five trials).

bad results as to the clustering of the strong patterns in k clusters, often correspond to a high number of

strong patterns, especially of those with a few number of observations.

Another criterion of the best clustering of the strong patterns is the lowest value of the cumulated

information J.

The confirma.tion of the optimal number of 6 clusters can be given by MACRO MCLAS. Comparing the values

of functil"l} (k) (cumulated information brought by successive partitions in k classes), it is found that J

(6) c::; 0.47- < J (7)z 0':12. (the information J is calculated for the intersection of 15 initial partitions)

Example 2 : Company - image theme

The same methodology was applied, using first MACRO MCLASI for suggesting values of the optimal number

of clusters, then MACRO MCLAS for determining the choice and giving the final clustering.

With respect to the population of example 1, this new population, as could be expected, will show some

different results.

Figure 4 shows the plot of the CCC by MACRO MCLASI giving very negative values of the CCC, which means

that there are no significant clusters in the population.

59

Page 16: Macros for Clustering Large Data Sets when the … for Clustering Large... · Introduction MACROS FOR DETERMINING THE OPTIMAL CLUSTERS IN LARGE DATA SETS B. GERARDIN J.L. MOLLIERE

r: ,.

c u B 1 C

C L u S T £. R 1 II 6

C R 1 T

t R J 0 PI

E'lCECUTJON DE LA MACRO SAS M(LAS 1 18 :46 FFlYOAY, I1~RCH 30. 1 CJB4

GR,HUE PU CR HERE E..N F()NCT ION OU NOMSIl:[ DE (LA sse ~

PLOT OF ece ~Cl SVMBOL 15 V,HUE Of' Net

1 0.0 •

1 I 1 1

-2.5 • 1 I 1 1

-5.0 • 1 1 I 1

-1.5 • 1 1 1 1 a

-10.0 • 1 1 1 1

-12.5 • 1 1 1

3 3 3 ) 3

3 3 3 3 2 3

Z 2 Z. 2 2

2 2 I 2 Z

1 -15.0 •

1 1 1 1

-11.5 • I· 1 1 9 1 1

-%0.0 • 1 6 7 1 1 I

-Z2. S • 1 I 1 1

-2.5.0 ... ---.----- + -----+ --- --+ -----+---- _ .-----+ -----+ -----+ -----+-----+--- -+----- +----- +-- ---+ -----+ ---+ ----.-----+----- ,-- -II 13 15 17 19 21 23 25 2.7 29 31 ~, ~5 31 19

NUMBER OF ClUSTERS

Figure 4. Plot of eee showing no significant clusters. Population 2

Four trials by MACRO MCLASI have been run, the results of which are the following:

Trial n° Optimal number Optimal number R2 value CCC value

of clusters by clusters by at the level at the level

the CCC R2 of 6 clusters of 6 clusters

I I (10) 'j (7) 0.4402 - 21.42

I 2 (9) 6 (5) 0.4462 - 20.31

3 - 5 0.4604 - 17.53

4 (9) 7 (5) 0.4457 - 20.38

60

Page 17: Macros for Clustering Large Data Sets when the … for Clustering Large... · Introduction MACROS FOR DETERMINING THE OPTIMAL CLUSTERS IN LARGE DATA SETS B. GERARDIN J.L. MOLLIERE

\

\,

No clear indication emerging from this table, the number of six clusters has been chosen and four trials by

MACRO MCLAS executed with initial partitions of 6 clusters, the results of which are presented below.

Initial partitions of 6 clusters

, R2 at Trial nO Optimal Optimal CCC at Number Number of Maximum Infor-

number number the 6 the 6 of strong obser- R2 of mation

of clusters clusters clusters clusters patterns vations the in i- J

by the by level level in impor- tial

CCC R2 tant partitions

"strong

patterns"

1 6 6 0.1l976 - 10.18 711 853 0.5159 0.7119

2 6 6 0.5055 @ ® 965 0.5160 ~

3 7 7 0.1l952 - 11.57 109 719 0.5159 0.882

II 9 (7) 6 (9) 0.1l8112 - 13.60 96 852 0.5159 0.864

5 - 5 (6) 0.1l550 - lll.lli 119 237 0.50~2 1.287

With respect to the population 1 we can note the high number of strong patterns obtained.

Another interesting remark (also true for population 1) is that clustering the strong patterns gives a much

better value of the CCC than clustering the observations (MACRO MCLASl). The best value of - 8.23 here

obtained must be compared to the best value of - 17.53 by MACRO MCLASI.

However these values are not at all satisfactory for determining an optimal number of clusters.

Further investigations must be made, using the strong patterns, for removing observations or groups of

observations, in order to increase the CCC values of the clusters obtained on the remaining population.

Only selecting, for the new population, important strong patterns obtained from a large enough number of

partitions, will definetely guarantee better values of the criteria.

61

Page 18: Macros for Clustering Large Data Sets when the … for Clustering Large... · Introduction MACROS FOR DETERMINING THE OPTIMAL CLUSTERS IN LARGE DATA SETS B. GERARDIN J.L. MOLLIERE

Conclusion

The three MACROS MCLASl, MCLAS2, MCLAS give the user a lot of elements for choosing the optimal

number of clusters and then determining the best clusters.

The first basic idea developped in MACRO MCLASI is that of mixed clustering, combining use of non

hierarchichal and hierarchichal techniques.

A second basic notion, of E. DIDA Y, is that of strong patterns, used in MACRO MCLAS, and which brings

valuable information for determining the number of clusters, getting nearer to the globally optimal clustering,

and knowing the structure of the population.

The user has the possibility to run trials with different random initiaJizations, either to measure the stability

of the indications about the number of clusters, in MACRO MCLASI, or to explore a larger set of good local

optima in MACRO MCLAS.

When difficulties arise for getting some certainty of an optimal number of clusters, or for obtaining stability

of the clusters, it is possible, using strong patterns, to detect observations or groups of observations which are

sources of unstability.

These MACROS are tools for helping the user to analyze his data, either quickly, only using MACRO MCLASl,

or in more detail, using afterwards MACRO MCLAS, which will give better solutions. In both cases MACRO

MCLAS2 is still able to improve the clusters obtained.

62

Page 19: Macros for Clustering Large Data Sets when the … for Clustering Large... · Introduction MACROS FOR DETERMINING THE OPTIMAL CLUSTERS IN LARGE DATA SETS B. GERARDIN J.L. MOLLIERE

APPENDIX * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * MACRO %HCLAS1 * MACRO VAR. A PASSER : * ENTREE = TABLEAU INITIAL , PAR DEFAUT LE DERNIER CREE * VAR = LISTE DES VARIABLES NUf.mRIQUES <==== OBLIGATOIRE * RAND= VALEUR DETEID1INANT LE TIRAGE ALEATOIRE <= OBLIGATOIRE * tlAXC = NOHBRE DE CLASSES , 100 PAR DEFAUT * ARBRE = TABLEAU CREE PAR CLUSTER * CLASSIF1 = TABLEAU CREE PAR FASTCLUS * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *; U1ACRO HCLAS1 (ENTREE= _LAST_,

/*

*/

VAR=, RAND = , HAXC=100, ARBRE=ARBRE, CLASSIF1=CLASSIF1);

APPEL DE L' EN-TETE DE LA MACRO MCLAS1

%INCLUDE FICHIER(MCLAS1T); TITLE EXECUTION DE LA llACRO SAS HCLAS1

%INCLUDE FICHIER(RANDOMa); %RANDml (ENTREE=&ENTREE ,RAND=&RAND) ;

PROC FASTCLUS DATA=&ENTREE OUT=&CLASSIF1 MEAN=FASTMEAN SHORT r-1AXC=&MAXC tlAXITER=10; VAR &VAR;

PROC CLUSTER DATA=FASTMEAN OUTTREE=&ARBRE SHIPLE PRINT=&tlAXC; TITLE3 EXECUTION DE LA CLASSIFICATION HIERARCHIQUE ;

VAR &VAR; FREQ FREQ RlISSTD RMSSTD

DATA TREE; SET &ARBRE;

IF tICL < 40; PROC PLOT DATA=TREE (RENArlE= ( CCC =CCC NCL =NCL));

TITLE3 GRAPHE DU CRITERE EN FONCTION DU-NOMBRE DE CLASSES; PLOT CCC * NCL = NCL;

PROC PLOT DATA=TREE (RENArlE= ( . RSQ =R2 NCL =NCL)); TITLE3 GRAPHE DU NOBERE DE CLASSES-EN FONCTION DU R2; PLOT NCL*R2 = NCL;

PROC PLOT DATA=TREE(RENAME=( SPRSQ =VR2 NCL =NCL)); TITLE3 GRAPHE DU NOHBRE DE CLASSES EN-FONCTION DU SEtU-PARTIAL R2; PLOT NCL*VR2 = NCL;

PROC TREE DATA=&ARBRE; TITLE3 IMPRESSIon DE L"ARBRE HIERARCHIQUE;

HEIGHT SPRSQ; %r1END rICLAS1; -

63

Page 20: Macros for Clustering Large Data Sets when the … for Clustering Large... · Introduction MACROS FOR DETERMINING THE OPTIMAL CLUSTERS IN LARGE DATA SETS B. GERARDIN J.L. MOLLIERE

********************************************************************** * r1ACROSAS PER11ETTANT DE PERTURBER L 'ORDRE INITIAL DES INDIVIDUS DU * TABLEAU D'ENTREE PAR UN TIRAGE ALEATOIRE SUIVANT UNE LOI DE PROBA * BILITE UNIFOmm

* * * * *

* * * * * * *

ENTREE:DERNIER TABLEAU CREE,EN L'OCCURENCE LE TABLEAU * CONTENANT LES FACTEURS SUR LES INDIVIDUS *

RAND:PARAHETRE DETERrlINANT LE TYPE DE TIRAGE ALEATOIRE * UTILISE(REPETABLE OU HOH) ,PAR DEFAULT LE TIRAGE * ALEATOIRE HE S' EFFECTUE PAS *

**********************************************************************. , %~lACRO RANDOn (ENTREE = _LAST ,

RAND= ) ; DATA TAB;

SET &ENTREE END=FIN; IF FIN THEH DO;

NOBS=PUT CN_, 8.) ; END;

CALL SYHPUT ( 'NN' , NOBS) ; RUN; %LET N=%EVAL(&NN);

DATA RANDO; %IF %LENGTH(&RAND)-=O %THEN %DO;

DO J=1 TO &H; K1=INT (UNIFOm1(&RAND) *&N) +1;

OUTPUT RANDO; END;

KEEP K1; DATA TAB;

HERGE TAB RANDO; PROC SORT DATA=TAB OUT=&ENTREE;

BY K1; %END;

%11END RA~JDOr1;

64

Page 21: Macros for Clustering Large Data Sets when the … for Clustering Large... · Introduction MACROS FOR DETERMINING THE OPTIMAL CLUSTERS IN LARGE DATA SETS B. GERARDIN J.L. MOLLIERE

APPENDIX 2

* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * HACRO HCLAS2 * HACRO VAR. A PASSER : * ENTREE = TABLEAU SORTI DE !lCLASl -* ARBRE = TAELEAU CREE PAR %r1CLAS1, PAR DEFAUT CLUSTREE * NCLASSE = NIVEAU DE COUPURE DE L'ARBRE <==== OBLIGATOIRE * CLASSIFl TABLEAU CREE PAR %!lCLAS1, PAR DEFAUT CLASSIFl * CLASSIF2 = TABLEAU FINAL DE LA PROC FASTCLUS, * PAR DEFAUT CLASSIFl

* *

CLASFF VARIABLE DE CLASSE PAR DEFAUT CLUSTER

* * * * * * * * * * * * * * * * * * * * * * %!lACRO tlCLAS2 (ENTREE=CLASSIF1,

ARBRE=ARBRE,

/*

*/

VAR=, NCLASSE=, CLASSIF1=CLASSIF1, CLASSIF2=CLASSIF2, CLASFF=CLUSTER);

EN-TETE DE LA. MACRO !lCLAS2

%INCLUDE FICHIER (rICLAS 2T) ; TITLE EXECUTION DE LA !lACRO HCLAS2;

* * * * * * * * * * * *. ,

PROC TREE DATA=&ARBRE LEVEL=&NCLASSE NOPRINT OUT=TREEOUT; TITLE3 IMPRESSION DE L' 'ARBRE HIERARCHIQUE;

DATA NTREE ; TITLE3 COUPURE DE L"ARBRE AU NIVEAU DE &NCLASSE CLASSES

LENGTH ACLUS 4; SET TREEOUT; CCLUS = SUBSTR(_NArlE_,3,3); ACLUS = CCLUS; KEEP ACLUS CLUSTER;

PROC SORTT OUT=NTREE ; BY ACLUS;

PROC SORTT DATA=&CLASSIFl OUT=ENTREE(REtJAHE=(&CLASFF=ACLUS)); BY &CLASFF;

DATA SEEDO(KEEP=CLUSTER &VAR); tIERCE ENTREE NTREE; BY ACLUS;

PROC SUlHlARY DATA=SEEDO; TITLE3 RECUPERATIOlJ DES NOYAUX PAR LA PROCEDURE SurIr1ARY; CLASS CLUSTER; VAR &VAR; OUTPUT OUT=SEED !IEAN=&VAR;

PROC PRINT DATA=SEED(FIRSTOBS=2 RErJA!1E=(_FREQ_=EFFECTIF CLUSTER=CLASSE));

TITLE3 IrlPRESSIOlJ DES CLASSES AVANT LA DERNIERE CLASSIFICATION; ID CLASSE; VAR EFFECTIF;

PROC FASTCLUS DATA=&ENTREE SEED=SEED(FIRSTOBS=2) OU';'=&CLASSIF2

llAXC=&NCLASSE rlAXITER=lO; TITLE3 EXECUTIOn DE LA DERNIERE CLASSIFICATION;

VAR &VAR; %!!END rlCLAS2;

65

Page 22: Macros for Clustering Large Data Sets when the … for Clustering Large... · Introduction MACROS FOR DETERMINING THE OPTIMAL CLUSTERS IN LARGE DATA SETS B. GERARDIN J.L. MOLLIERE

~ ~ ! l ~ \

~l

ENDNOTES

1- At any level the Ward's method joins the classes A and B minimizing

G A' G B centres of the classes

nA, nB numbers of observations of the classes

The height of each node (cluster) in the tree will be equal to this pseudo-distance, called the semi partial R 2

2 _ The intersection of two partitions is the set of the parts obtained in taking the intersection of each class of

one by all the classes of the other.

3- A partition P is said to be thinner than a partition P' of E if every class of P' is the union of classes of P.

REFERENCES

1- P. COLLOMB . Classification par transfert (1977)

Electricite de France. Direction des Etudes et Recherches - Note HI/2578-02

2- E. DIDA Y Optimisation in non hierarchichal clustering (1974)

Pattern Recognition - Pergamon Press 1974

Vol. 6, pp 17 - 33

3- SAS User's Guide: Statistics (1982) : SAS Institute, Inc. Cary, N.C.

4- SAS Technical Report: A 108 Cubic Clustering Criterion (1983)

SAS Institute, Inc. Cary, N.C.

5- W.F. de la Vega; M. Renaud; S. Regnier, Techniques de la classification automatique (1964-1966)

Distributed by the Centre de Calcul de la Maison des Sciences de l'Homme . PARIS.

66