10
ELSEVIER Pattern Recognition Letters 18 (1997) 133-142 Pattern Recognition Le~ers Self-organizing neural networks for spatial data G. Phanendra Babu * Real World Computing Partnership, Novel Function ISS Laboratory, Institute of Systems Science, National University of Singapore, Kent Ridge, Singapore 0511 Received 18 September 1996 Abstract In this paper, we present a self-organization neural network approach for spatial data visualization and spatial data indexing. Spatial data is typically used to represent multi-dimensional objects. Generally, for efficient processing such as indexing and retrieval, each multi-dimensional object is represented by an isothetic minimum bounding rectangle. Direct visualization of these multi-dimensional rectangles, denoting spatial objects, is not possible, if the number of dimensions exceeds three. Many linear and non-linear mapping techniques have been proposed in the literature for mapping point data, i.e., data that are points in multi-dimensional space. These approaches map points in higher-dimensional space to lower- dimensional space. Making use of these point data mapping approaches is a computationally intensive task as the number of points to be mapped is very large. In this paper, we propose a Kohonen's self-organization neural network approach for clustering spatial data. Cluster prototypes associated with nodes in the network are mapped into lower dimensions for data visualization using a non-linear mapping technique. We explain the applicability of this approach for efficient indexing of spatial data. © 1997 Elsevier Science B.V. Keywords: Spatial data; Clustering; Kohonen network; Indexing 1. Introduction Pattern classification methods can be categorized into three broad classes: statistical methods, symbolic methods and neural network methods. These methods are in turn classified into: supervised and unsuper- vised (clustering) (Duda and Hart, 1973). In statis- tical methods, supervised techniques try to find class boundaries such that new patterns can be classified correctly. Generally, a classification problem is for- mulated as an optimization problem, such as minimiz- ing the classification error, and is solved using differ- ent approaches. The class label of each pattern is as- * E-mail: [email protected]. sumed to be known a priori. Some of the classification approaches assume (need) the underlying probabil- ity distribution of patterns. On the contrary, clustering approaches partition the patterns into disjoint clusters such that any two patterns in a cluster are more sim- ilar than patterns in different clusters. Clustering ap- proaches involve defining a similarity measure which quantifies the closeness between patterns and a crite- rion function which quantifies the quality of clusters (Anderberg, 1973; Jain and Dubes, 1988). The opti- mal partition is characterized by the global optimum of the criterion function. Symbolic approaches address the issue of learning classification rules from train- ing samples in order to classify new patterns. Several symbolic learning approaches have been proposed in 0167-8655/97/$17.00 (~) 1997 Elsevier Science B.V. All fights reserved. PH S01 67-8655 (97) 00003-2

Self-organizing neural networks for spatial data

Embed Size (px)

Citation preview

ELSEVIER Pattern Recognition Letters 18 (1997) 133-142

Pattern Recognition Le~ers

Self-organizing neural networks for spatial data G. Phanendra Babu *

Real World Computing Partnership, Novel Function ISS Laboratory, Institute of Systems Science, National University of Singapore, Kent Ridge, Singapore 0511

Received 18 September 1996

Abstract

In this paper, we present a self-organization neural network approach for spatial data visualization and spatial data indexing. Spatial data is typically used to represent multi-dimensional objects. Generally, for efficient processing such as indexing and retrieval, each multi-dimensional object is represented by an isothetic minimum bounding rectangle. Direct visualization of these multi-dimensional rectangles, denoting spatial objects, is not possible, if the number of dimensions exceeds three. Many linear and non-linear mapping techniques have been proposed in the literature for mapping point data, i.e., data that are points in multi-dimensional space. These approaches map points in higher-dimensional space to lower- dimensional space. Making use of these point data mapping approaches is a computationally intensive task as the number of points to be mapped is very large. In this paper, we propose a Kohonen's self-organization neural network approach for clustering spatial data. Cluster prototypes associated with nodes in the network are mapped into lower dimensions for data visualization using a non-linear mapping technique. We explain the applicability of this approach for efficient indexing of spatial data. © 1997 Elsevier Science B.V.

Keywords: Spatial data; Clustering; Kohonen network; Indexing

1. Introduct ion

Pattern classification methods can be categorized into three broad classes: statistical methods, symbolic methods and neural network methods. These methods are in turn classified into: supervised and unsuper- vised (clustering) (Duda and Hart, 1973). In statis- tical methods, supervised techniques try to find class boundaries such that new patterns can be classified correctly. Generally, a classification problem is for- mulated as an optimization problem, such as minimiz- ing the classification error, and is solved using differ- ent approaches. The class label of each pattern is as-

* E-mail: [email protected].

sumed to be known a priori. Some of the classification approaches assume (need) the underlying probabil- ity distribution of patterns. On the contrary, clustering approaches partition the patterns into disjoint clusters such that any two patterns in a cluster are more sim- ilar than patterns in different clusters. Clustering ap- proaches involve defining a similarity measure which quantifies the closeness between patterns and a crite- rion function which quantifies the quality of clusters (Anderberg, 1973; Jain and Dubes, 1988). The opti- mal partition is characterized by the global optimum of the criterion function. Symbolic approaches address the issue of learning classification rules from train- ing samples in order to classify new patterns. Several symbolic learning approaches have been proposed in

0167-8655/97/$17.00 (~) 1997 Elsevier Science B.V. All fights reserved. PH S01 67 -8655 ( 9 7 ) 00003-2

134 G.P. Babu/Pattern Recognition Letters 18 (1997) 133-142

the literature (Michalski and Stepp, 1983; Cheng and Fu, 1985; Fisher, 1987).

On the other hand, neural network approaches are fundamentally parallel processing approaches and adopt their basic working methodology from the neu- ronal activities in brain. These approaches facilitate the use of small processors working together for solv- ing difficult classification and optimization problems. Based on the type of network architecture and neuron updating principle, these approaches are divided into: feedforward, feedbackward and self-organizing net- works (Lippman, 1987). In feedforward networks, a pattern is fed to the input and the associated class label is expected to be produced as output. These input-output associations are learned by adjusting the weights of network connections. In feedback net- works, the initial state of neurons is obtained from the input patterns and states of neurons are updated col- lectively converging to a final state which is treated as output. Self-organizing or competitive networks can be related to clustering approaches in statistical pat- tern recognition. These approaches build the model representatives (cluster centers) either incrementally (ART) (Carpenter and Grossberg, 1987) or simul- taneously (SOM) (Kohonen, 1982) from the input patterns. Some of the neural network approaches proposed in the literature include the multilayer per- ceptron (Rumelhart and McClelland, 1986), the Hop- field network (Hopfield, 1982), the ART network (Carpenter and Grossberg, 1987) and Kohonen's self-organizing neural network (Kohonen, 1982).

All these classification techniques assume that pat- terns are points in multi-dimensional space. One of the possible ways of extending these techniques for spatial data (Samet, 1990) is to decompose spatial data into a set of point data, i.e., represent rectan- gle with a set of points(vertices), and employ exist- ing mapping techniques (Jain and Dubes, 1988). But mapping large volumes of data is very computation- ally intensive. For N hyper-rectangular objects in d- dimensional space, it requires to map 2 a N points into 2-dimensional space. In contrast, similarity measures can be defined between spatial patterns and the clus- tering of these patterns can be performed. By using ex- isting mapping techniques, cluster prototypes or rep- resentatives can be mapped into lower dimensions for visualizing the spatial data distribution (Laurini and Thompson, 1992). This approach maps spatial data

as point data in lower-dimensional space giving an easy way to understand their arrangement in higher- dimensional space. In addition, cluster representatives can be used for efficient spatial data joins in spatial data bases (Abel and Chin Ooi, 1993).

In this paper, we investigate the development of Ko- honen's self-organizing neural network for spatial data clustering called Self-Organizing Map for Spatial Data (SOMSD). Applications of the proposed approach, spatial data visualization, and spatial data indexing are presented. The organization of the paper is as follows. Section 2 deals with the spatial data representation and a similarity measure between spatial data and point data. Section 3 presents the SOMSD approach and the algorithm along with examples. Spatial data indexing with SOMSD is described in Section 4. Conclusions follow in the last section.

2. Spatial data

A point datum in d-dimensional space is repre- sented by a d-dimensional vector. In fact, spatial data item representation is application dependent. For ex- ample, a spatial object (spatial data item) can be repre- sented with a list of vertices of a polygon, see Fig. 1. A simple representation for indexing and retrieval of spa- tial objects is a minimum bounding rectangle (MBR) (Samet, 1990). A multi-dimensional object is covered with an MBR and the two opposite corner vertices of the rectangle are used to represent it. This facilitates fixed-size representation and an easy way to handle ob- jects simultaneously preserving its extent and location. Most of the existing spatial data indexing techniques, R-trees (Guttman, 1984), make use of this MBR rep- resentation for indexing spatial objects. Let x g and Xs d be point and spatial objects in d-dimensional space.

a is represented by a d-dimensional vector and Xs d is Xp

represented by a 2d-dimensional vector with two op- posite corners of the MBR. These concepts are illus- trated in Fig. 1.

For making use of self-organizing networks for clus- tering spatial data, we need to define a similarity mea- sure between point and spatial data items. Each clus- ter representative is a point vector. As a spatial object is represented by an MBR, we have to define a dis-

d d tance measure, D(xp , X s ) , between x d and x d. There are many possible distance measures. We define the

G.P. Babu/Pattern Recognition Letters 18 (1997) 133-142 135

AI B BI - - ~ C 1

i A ~E

i ~ ~ ~ _ G F

DI H Cl

A2 : :~ t I

i I i

I I I I ,, i I !

I I I I i I

. . . . . . . . . . . . . . . .

D2

B2

C2

Polygon = {A, B,C, D, E, F, G, H, II Circle = {C,R}

Minimum bounding rectangle • {Ol ,BI } Minimum bounding rectangle = [D2,B2} Fig. 1. Minimum bounding rectangles for different shapes.

distance measure as the average of the sum of the dis- tances between the point and each of the vertices of the MBR. This distance measure is given below:

D(Xp, x s ) = v v 0 1 1,2 ..... { , )

Ixpd(1) - vl(xsd(1)) -- (1 -- Vl)(Xsd(1 + d ) ) [ p

+ . . . + l /p

Ixpd(d)--Vd(Xds(d))--(1--Vd)(Xds(2d))lP ,

(1)

where p denotes the pth norm between the point and the vertex. This distance measure computes the indi- vidual distances between the point and each of the ver- tices and the sum is averaged. The number of possible vertices, 2 d, are generated by changing the values of Vl . . . . . Yd. If p is set to 2, the distance measure turns out to be the Euclidean distance measure. The vertices of the MBR are used in the distance measure as they play an important role in the human perception of an MBR. Other distance measures have been detailed in (Babu, 1994). We make use of this distance measure in SOMSD. In the following section, we present the SOMSD approach.

3. S O M S D

Kohonen's self-organizing neural network is a con- nected network of small processing units called nodes.

Each node is connected to a set of neighboring nodes forming a grid in the selected number of dimensions. The arrangement of nodes in the network can be either rectangular, see Fig. 2, or hexagonal. Each node is as- sociated with a vector which acts as its representative. Kohonen's neural network associates each of the input patterns to a representative output, i.e., a node. This approach can be seen as model building, i.e., forming cluster representatives, which in a way can be viewed as vector quantization by minimizing the quantization error (Linde et al., 1980). Nodes in the network are initialized with random vectors in d dimensions. Each of the input patterns is assigned to the nearest node and the vectors associated with that node and nearby nodes are updated such that they move towards the in- put pattern. One interesting feature of Kohonen's net- work is that spatially nearby patterns are associated with the nearby models in the network. This facilitates knowing the data associations in higher dimensions by mapping models into lower dimensions using data projection techniques.

Most of the experiments assume data as feature vec- tors, i.e., point data, and model feature vectors are learned by training the network with the input pat- terns. The notation followed in this paper is presented below.

L e t x = {11,12 . . . . . l d , r l , r 2 . . . . . rd} be a pat- tern representing an MBR with lower left corner point, {ll, 12 . . . . . ld}, and upper right corner point { r l , r2 . . . . .

Let Xl, x2 . . . . . xN be N patterns in R 2d space. Let Cl, c2 . . . . . Cn be n centers in R d space.

136 G.P. Babu/Pattern Recognition Letters 18 (1997) 133-142

F •

,= ,~ , o l

'o3 ~ ; " o o4 ~ • - Node

o2 Neighbochood(o) = {o l ,o2,o3,o4l ;

S O M with rec tangular t o p o l o g y

Fig. 2. An example of a network with rectangular node connections.

Let C], 6"2 . . . . . Cn be n clusters each containing patterns assigned to the associated node.

The outline of Kohonen's network approach, SOM, is presented below.

1. Select initial representatives, ¢1, c2 . . . . . Cn, for n nodes in the network, randomly.

2. For each input pattern xi, 1 <. i <~ N, perform the following steps. (a) Determine a node, l, in the network such that

Dist(cl, xi) = minq Dist(Cq, xi). (b) Update the node representatives in the net-

work:

Ck( t + 1) =Ck(t) + ~5( l , k , t ) * a( t)

* ( x i - c k ( t ) ) , Vk,

where t denotes the iteration number, ~(1, k, t) and a ( t ) denote the neighborhood gain and gain function, respectively.

3. If a ( t + 1) is 0.0 then terminate else go to Step 2.

A typically used neighborhood gain function is Gaussian and is given by

6(1, k, t) = exp - ( II - kl2/2o-~),

where ovt decreases with the increase in time. Many variations of neighborhood gain functions are pro- posed in the literature to suit the requirements. This neighborhood function produces an impact of the in- put sample on the neighborhood nodes such that it decreases as I1 - kj increases. This is often called a "bell shaped" kernel. Both o" t and ce(t) are suitable decreasing functions of time. Note that the variance,

o't, decreases with time, thus reducing the neighbor- hood dependency. We select o-t and ce(t) as exponen- tionally decaying functions which are given by

O. t = Or 0 ( O r f / O . 0 ) , / t f ,

or(t) = a (0 ) ( t r f /a(0)) t / t f - - 0.00001,

where tr0 and trf are initial and final variances, and tr(0) and O~(/f) are initial and final gain values. Here tf is the maximum number of iterations in which the network is updated. In our experiments o'0 and trf are set to 0.5n and 0.1n, and or(0) and af are set to 1.0 and 0.00001. The value of a ( t ) changes from 1.0 to 0.0.

This SOM creates models, i.e. node representatives, which partition the feature space, see Fig. 3. The input data is partitioned into disjoint clusters such that each pattern is assigned to its nearest model. The cluster, Ci, associated with the ith model is obtained by

Ci = {xj I Dist(ci, x j ) = minDist(Cq, Xj) }. q

These clusters are disjoint. This approach can be extended to develop an

SOMSD that generates feature maps for spatial data. For spatial data, we assume the models are point vectors. Using the defined similarity measure, we assign spatial data to the nearest nodes and nodes are updated by using the centroid of the MBR. The canonical steps in SOMSD are given below.

1. S e l e c t i n i t i a l representatives, Cl,C2 . . . . . Cn (¢i E

•d), for n nodes in the network randomly. 2. For each input pattern d 1 ~<i~<N(xsd C R zd) Xsi ,

perform the following steps.

G.P. Babu/Pattern Recognition Letters 18 (1997) 133-142 137

x - o ~ r o ~ l s

Fig. 3. An example of space partition by three cluster centers.

' s o p - :: . . . . . i . . . . . : : !

:. ..... .i i i .i i

1(~ ~0 4O SO ~0 ~ ~0 ~ 100

Fig. 4. SOM for data set DATA1.

(a) Determine a node, l, in the network such that D (ct, xs d ) = minq D ( ¢q, xdsi )"

(b) Update the node representatives in the net- work:

c k ( t + 1 ) = c k ( t ) + 8 ( l , k , t ) * a ( t )

* ( h i - C k ( t ) ) , Vk,

where t denotes the iteration number, 6(I , k, t) and a ( t ) denote the neighborhood gain and gain function, respectively. Here hi ( l ) = ~(Xs,(l)l d ÷ x d ( l + d ) ) is the centroid of the MBR associated with pattern i.

3. I f a ( t ÷ 1 ) is 0.0 then terminate else go to Step 2.

A node representative is moved towards the centroid of the MBR as it is equivalent to taking the average movement of the node representative towards each of the vertices. This can be verified easily.

Hodo M a t r i x (5X51:

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

66 5'5 38 41 26 2"/ 20 29 30 32 33 35 36 3"/ 39 40 42 45 46 48 50

.............. 56 ~;--~--~ ................ ~--~--~?---~--~--~-

.............. ~--~ ....... ~8--~ ...............................

-:~;--~--;~ ................ ;~--;?-~; .... ;-Ti--i; .... V--~--7- 98 58 12 I~ 14

16 17 20

-~--~:7--:r~---~ ................................................ 52 54 64 2 18 55 3 4 8

80 81 82 6~ ~9 "/0 61 9 10 l S 83 85 86 ?,2 ?3 ")4 i 21 22 23 87 88 89 24 25 91 92 93 J 94 95 96 99 I 0 0

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Node l ~ a d 8 (~w~ttix) :

1 1 0 2 17 i 3 0 3 3 0 2 2 0 0 4 0 3 4 9

20 1 9 4 11

Fig. 5. Data assignment to nodes for DATA 1.

1,o[ .............................. . ............................ i ............................. i o o i 0 tz i

~ 0 D r-~ C l 20 . . . . . . . . . . . . . . . . . . . . . . . . i

Q i

i ; 50 100 '1.50

Fig. 6. SOM (5 x 5) for data set DATA2.

h - c = ½(xd(1) ÷ xds(l ÷ d ) ) - c

= ~ d ( Z VlXsa(1)-F ( 1 - v,)Xsa(1 + d ) v O l ],...,v~e{ , }

+ . . . + VdXds(d) + (1 -- V d ) x d ( 2 d ) ) -- C.

(2)

138 G.P. B a b u / P a t t e r n Recogn i t i on Le t ters 18 (1997) 1 3 3 - 1 4 2

14o[- i

o o!

IN

4~

r'n

D . . . . o i r - t :

0 oi

o0 oi

3

Q

? ?

i , i

60 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

50

40

30

2o i i i s:: ::

10

0

-10

- ~ - ~ -40 -30 -20 -10 10 20 30 40 50

Fig. 7. SOM ( 15 × 15) for data set DATA2. Fig. 9. SOM for data set DATA3.

3.1. Selection of initial seeds

The initial node representatives, Cl, c2,. • •, c, may be selected in several ways. In fact, the quality of clusters obtained depends on the choice of these ini- tial representatives or seeds. One approach is to select them randomly in the hyperspace of the patterns and the other approach is to use heuristics such as select- ing well separated pattern centers. We have used ran- dom selection of these representatives in all our ex- periments. Further investigation is required to analyse the behaviour of the proposed approaches with respect to the changes in seed selection methods (Babu and Murty, 1993; Babu et al., 1997).

4. Spatial data visualization

The self-organizing map projected in two or three dimensions makes the visualization of spatial data pos- sible. We demonstrate the results obtained on three se- lected data sets. The first data set, DATA1, consists of four clusters of rectangles in two-dimensional space. This data set is used to form a self-organizing map using SOMSD with a 5 × 5 network configuration. The map obtained along with the data is shown in Fig. 4. Each cluster consists of 25 rectangles in the ranges 1-25, 26-50, 51-75 and 76-100. The data as- signment to the nodes is shown in Fig. 5. Each node in the network is represented with a box in the rect- angular grid and all assigned nodes are listed in the respective box. The second data set, DATA2, is se-

lected in such a way that the spatial data distribution is uniform in the two-dimensional space. The self- organizing maps obtained with two different networks of configurations 5 × 5 and 15 × 15 are shown in Figs. 6 and 7, respectively. Observe the organization of the map with respect to the data distribution. Associating each node with the number of data items assigned to it facilitates knowing how the data is organized. The third data set, DATA3, consists of four clusters in 3 dimensions and is shown in Fig. 8 by projecting it into each of the three two-dimensional planes. The self-organizing map, projected in two dimensions by Sammon's method (Samon, 1969), is shown in Fig. 9 which reveals the data spread in three dimensions. The data assignments to nodes in the network are shown in Fig. 10. In these three data sets, rectangles are al- most non-overlapping and spread out. In order to find how the map forms for highly overlapping rectangles, we consider a fourth data set, DATA4, which consists of 10 overlapping rectangles in two dimensions, see Fig. 11. The spatial data assignments to nodes is given in Fig. 12. Note that the spatial data organization per- ceptually agrees with the map formed.

In this approach, the number of patterns assigned to each node may not be the same. In fact, some nodes can be empty, i.e., no patterns are assigned to them. But some nodes can be overloaded if the distribution of patterns forming clusters is uneven. Further en- hancements to SOMSD can be undertaken for solving these problems by using a selective multi-resolution approach (Sabourin and Mitiche, 1993) which tries to expand the number of nodes incrementally. In the fol-

G.P. Babu/Pattern Recognition Letters 18 (1997) 133-142 1 3 9

rf ................................................................................................................................................................................. .... o ..... . . . . . . . . . . .......... ..... . . . . . . . . . . . .... ..... .... .... . . . . .................................. .... i .......................... ...........

Fig. 8. Data projected on each of the three axes (x, y and z).

- ~ - - ~ - - ~ ; - - - ~ ; - - ; ; . . . . . . . ~ ; - - ~ - - ~ ; . . . . ; - - ~ ; - - i ~ . . . . ; - - S - - ~ i -

54 55 57 13 16 1 7 2 2 25 58 59 60 61 62 63 64 66 6"I "11 ~2 ~3 "75

- ~ ; - - ~ . . . . . . . ; ; - - ~ ; . . . . . . . . ~ - - - ; - - - ; . . . . ~ - - - ~ - - - ; . . . . ; - - ~ . . . . .

20 18 2 3

I

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I "19 85 86 89 42 49

................................................................ i 76 ~ 80 77 82 83 29 31 33 26 97 28 81 84 90 I 8? 88 34 41 45 I 30 32 35

I 92 92 93 50 36 37 38 94 95 96 I 1 39 40 43 9? 98 99 44 46 4?

[ 1 0 0 48 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Node Load ( H e t r t x ) :

1 9 2 3 6 5 2 2 4 5 2 0 0 0 0 0 3 1 0 O 2

16 5 0 ? 16

Fig. 10. Data assignment to nodes for DATA3.

115 . . . . . . . . . . .............................. : ......................................................

i

1 1 0 . . . . i . . . . . ! . . . . . . . . I . . . . . . . . . . I

1 0 0 • • i . . . . . . i i . . . . . . .

i

100 . . . . . ! . . . . : . . . . . . . : . . . . . . . . . .

! i i i

i i i i i i

? 5 i ~ i : : i i i !

15 i i i 7 40 50 56 60 65 70 75 80

Fig. 11. Data set, DATA4.

~ode Matrix [5X5) :

................................................................ i 6 I I I 9 ~ 2 I

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . J 3 I I I I

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I 5 I I I 10 I 1 I

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I I r I I I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . [

7 I I I 8 I 4 I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I

N o d e LoadB ( M a t r i x ) :

! I 1

I 0 o i I 0 0 0 0 0 1 0 0 1 1

Fig. 12. Data assignment to nodes for DATA4.

lowing section, we present an application of SOMSD in the areas of spatial data indexing.

5. Spatial data indexing with S O M S D

Indexing of spatial data is generally carried out with range trees, R-tree data structures, as they handle range queries effectively. These R-trees are completely bal- anced trees guaranteeing a minimum access time of logarithmic order, O(C log N) where N is the size of spatial data and C is a constant. All terminal nodes point to spatial data. Each non-terminal node in the tree has a corresponding MBR which encloses all its chil- dren MBRs. The root of the R-tree consists of at least two children and the number of children of each of the non-terminal nodes can be in the range [By~2, B f ] , where Bf is the branching factor. As an R-tree does not consider the data distribution on the whole, it forms large MBRs (for nodes at top levels of the tree) with a lot of empty space in it. See Fig. 13 for an example of an R-tree. This empty space leads to traversing the tree unnecessarily or through multiple paths for find- ing out whether a given rectangle is available or not. The more empty space there is (because of spatially separated clusters of data), the more unnecessary the

140 G.P. Babu/Pattern Recognition Letters 18 (1997) 133-142

!i _ !

~ c

i •

Fig. 13. An example of an R-tree.

R-trim

traversal of the tree is. Instead of that, we can cluster the data and find the cluster identifiers and use them in indexing the data. In fact, the empty space covered by MBRs associated with intermediate nodes depends on the order of data insertion in the R-tree (see (Samet, 1990) for more details on R-trees). The spatial index- ing with SOMSD is performed as follows.

i" ~ / ~ s a ~ ~ * t

?,._.no. , ~ ~ " O -Node with zero load

• Loaded node

..'" i -'"" Node associated R-tree • ""'.

i ~ i i i i Poinums to spatial objects.

Fig. 14. An example of a hybrid tree.

1. Generate clusters CI . . . . . Cn using SOMSD. 2. For each cluster, i, generate an associated R-tree,

Ri-tree, using the data in Ci. 3. Eliminate empty R-trees. 4. Index R-trees using the clusters models, i.e., rep-

resentatives, with any one of the point data index- ing techniques such as k-d-trees, k-d-B-trees, and B-trees (Samet, 1990).

This hybrid tree approach first focuses on finding the nearby cluster and then uses the associated R-tree to find the data item. Typically handled queries in range trees are range queries. For a given input rectangle find all spatial data that intersect with it. In the limiting case, this range query turns out to be a point query. One important point to be noted in self-organizing maps is that the data that are overlapping with a given query rectangle will be assigned to nearby nodes, i.e., connected nodes in the network. An example of a hy- brid tree structure is presented in Fig. 14. One of the possible methods of indexing nodes in the network which corresponds to constructed R-trees is using ag- glomorative clustering of node representatives (Spath, 1980). The index tree structure consists of a set of non-terminal and terminal nodes. Terminal nodes are nothing but the root nodes of the corresponding R- trees and also maintain the minimum enclosing rect- angle, whereas non-terminal nodes are the cluster rep-

resentatives of their children. Note that all nodes with empty R-trees are eliminated in this index creation. Let the number of non-empty nodes be P, the associ- ated clusters be a~ . . . . . a l e where (.Ji Q) = Uj Cj and

the branching factor be f . Each Q/1 has a correspond- ing cluster representative, q~.

1. Set l = 1. 2. Partition, q] . . . . . qte/v, into P/ l f / f clusters such

that each cluster has f members. Let these clusters be a l +1 , l

. . . . QPl(l+l )s" 3. Compute the associated cluster centers qtl+i . . . . .

ql+ l Pl(l+l)f"

4. Repeat the above two steps until P/(l + 1) f ~ f .

The above procedure forms nodes in the index tree by clustering the network node representatives. This procedure is repeated until a complete tree is formed. Each node in the index tree maintains a list of its chil- dren and the associated centroids. The storage space required for each non-terminal node is O(d * f ) . The traversal of the index tree is as follows. For a given range query, i.e. a rectangle, start from the root and find the child which is nearest to it using the distance measure, D () . This process of traversal by finding the nearest child is continued until a terminal node is ob- tained. That terminal node points to the root of the R-

G.P Babu/Pattern Recognition Letters 18 (1997) 133-142 141

tree which has to be traversed using an R-tree traver- sal technique. (See (Guttman, 1984; Samet, 1990) for more details.) It should be noted that once a termi- nal node is reached, we need to find the topologically nearby terminal nodes in the network. This is to iden- tify other nodes in the network that may have spatial data assigned to them which intersect with the query rectangle. The computational complexity of searching the index tree for finding the nearest node in the tree is O(logf P ) and that for searching the R-tree de- pends on the number of spatial data assigned to that node. If the number of spatial data items assigned to P nodes is equal to NIP then the total complexity is O(logf P q-const*logrl N/P) where rf is the branch- ing factor in the R-tree and const is a constant. In the worst case, assuming the skewed case, i.e., N - P + 1 items are assigned to a node and P - 1 items are as- signed to P - 1 nodes, the time complexity turns out to be O(logf P + const * logrl N). In such cases, we need to develop some approach for making the num- ber of spatial data items assignment to nodes in the network equal. This can be carried out by split and merge techniques. We briefly outline these methods and a detailed discussion is out of the scope of this paper. We define the load associated with nodes in the network.

The load of a node is the number of spatial data items assigned to that node.

We need to develop a procedure for node load bal- ancing. Here balancing means equal load on all non- empty nodes of the network. This is performed by moving the spatial patterns from overloaded nodes to underloaded nodes. Here a node is called overloaded if its load is more than NIP and underloaded if it is less than NIP. The procedure is as follows.

1. Find all overloaded nodes in the network. 2. For each overloaded node (from left to right and

top to bottom) perform the following steps. 3. Reassign excess farthest patterns to the nearest right

or bottom neighboring nodes. 4. Recompute the new centers due to reassignments.

This procedure is given for two-dimensional maps and for rectangular connectivity. This procedure can be extended to higher-dimensional networks.

6. Conclusions

In this paper, a self-organizing neural network ap- proach for spatial data clustering and visualization has been proposed. The proposed approach, SOMSD, makes use of a defined distance measure between point and spatial data (isothetic rectangle). This approach forms clusters of spatial data such that the organiza- tion of spatial data in higher dimensions can be visu- alized by projecting the cluster prototypes or models into two dimensions using point data mapping tech- niques. Four different data sets are considered for il- lustration. One of the applications of SOMSD for spa- tial data indexing is highlighted and discussed. Further work is in progress in utilizing SOMSD for efficient processing of spatial data joins in spatial databases.

References

Abel, D. and B. Chin Ooi, Eds. (1993). Advances in Spatial Databases, Lecture Notes in Computer Science, Third Internat. Symposium, Singapore. Springer, Berlin.

Anderberg, M.R. (1973). Cluster Analysis for Applications. Academic Press, London.

Babu, G.E (1994). Spatial data clustering. Technical report, Institute of Systems Science, National University of Singapore.

Babu, G.P. and M.N. Murty (1993). A near-optimal initial seed value selection in K-means algorithm using a genetic algorithm. Pattern Recognition Letters 14 (10), 763-769.

Babn, G.P., M.N. Murty and S.S. Keerthi (1997). Stochastic connectionist approach for pattern clustering. IEEE Trans. Syst. Man Cybernet., To appear.

Carpenter, G. and S. Grossberg (1987). A massively parallel architecture for a self-organizing neural pattern recognition machine. Comput. Vision Graphics Image Process. 31, 54-115.

Cheng, Y. and K.S. Fu (1985). Conceptual clustering in knowledge organization. IEEE Trans. Pattern Anal. Machine lntell. 7, 592- 598.

Duda, R.O. and P.E. Hart ( 1973 ). Pattern Classification and Scene Analysis. Wiley, New York.

Fisher, D.H. (1987). Knowledge acquisition via incremental conceptual clustering. Machine Learning 2, 139-172.

Guttman, A. (1984). R-trees: A dynamic index structure for spatial searching. In: Proc. ACM SIGMOD, 47-57.

Hopfield, J.J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proc. Nat. Acad. Sci. 79, 2554-2558.

Jain, A.K. and R.C. Dubes (1988). Algorithms for Clustering Data. Prentice-Hall, Englewood Cliffs, NJ.

Kohonen, T. (1982). Self-organized formation of topologically correct feature maps. Biol. Cybernet. 43, 56-69.

Laurini, R. and D. Thompson (1992). Fundamentals of Spatial Information Systems. Academic Press, London.

142 G.P Babu/Pattern Recognition Letters 18 (1997) 133-142

Linde, Y., A. Buzo and R.M. Gray (1980). An algorithm for vector quantizer design. IEEE Trans. Commun. 28, 84-95.

Michalski, R.S. and R.E. Stepp (1983). Automated construction of classification: conceptual clustering versus numerical taxonomy. IEEE Trans. Pattern Anal. Machine Intell. 5, 396-410.

Lippman, R.E (1987). An introduction to computing with neural networks. IEEE Acoust. Speech Signal Process., 4-22.

Rumelhart, D.E. and J.L. McClelland (1986). Parallel Distributed Processing, Vols. I and II. MIT Press, Cambridge, MA.

Sabourin, M. and A. Mitiche (1993). Modeling and classification of shape using a Kohonen associative memory with selective multi-resolution. Neural Networks 6, 275-283.

Samet, H. (1990). The Design and Analysis of Spatial Data Structures. Addison-Wesley, Reading, MA.

Sammon, J.W. (1969). A nonlinear mapping for data structure analysis. IEEE Trans. Comput. 18, 401-409.

Spath, H (1980). Cluster Analysis Algorithms. Ellis Horwood, Chichester, UK.