55
A Database Clustering Methodology and Tool Tae-Wan Ryu Department of Computer Science California State University, Fullerton Fullerton, California 92834 [email protected] Christoph F. Eick Department of Computer Science University of Houston Houston, Texas 77204-3010 [email protected] Abstract Clustering is a popular data analysis and data mining technique. However, applying traditional clustering algorithms directly to a database is not straightforward due to the fact that a database usually consists of structured and related data; moreover, there might be several object views of the database to be clustered, depending on a data analyst’s particular interest. Finally, in many cases, there is a data model discrepancy between the format used to store the database to be analyzed and the representation format that clustering algorithms expect as their input. These discrepancies have been mostly ignored by current research. This paper focuses on identifying those discrepancies and on analyzing their impact on the application of clustering techniques to databases. We are particularly interested in the question on how clustering algorithms can be generalized to become more directly applicable to real-world databases. The paper introduces methodologies, techniques, and tools that serve this purpose. We propose a data set representation framework for database clustering that characterizes objects to be clustered through sets of tuples, and introduce preprocessing techniques and tools to generate object views based on this framework. Moreover, we introduce bag-oriented similarity measures and clustering algorithms that are suitable for our proposed data set representation framework. We also demonstrate that our approach is capable of dealing with relationship information commonly found in databases through the bag-oriented clustering. We also argue that our bag-oriented data representation framework is more suitable for database clustering than the commonly used flat file format and produce better quality of clusters. to appear in Information Science in Spring 2005. 1

A Database Clustering Methodology and Toolceick/kdd/RE05.doc · Web viewThe underlined attributes are the keys in each relation. cid (customer id) is a foreign key in the relation

  • Upload
    dothu

  • View
    214

  • Download
    1

Embed Size (px)

Citation preview

A Database Clustering Methodology and Tool

Tae-Wan RyuDepartment of Computer Science

California State University, FullertonFullerton, California 92834

[email protected]

Christoph F. EickDepartment of Computer Science

University of HoustonHouston, Texas 77204-3010

[email protected]

AbstractClustering is a popular data analysis and data mining technique. However, applying traditional clustering algorithms directly to a database is not straightforward due to the fact that a database usually consists of structured and related data; moreover, there might be several object views of the database to be clustered, depending on a data analyst’s particular interest. Finally, in many cases, there is a data model discrepancy between the format used to store the database to be analyzed and the representation format that clustering algorithms expect as their input. These discrepancies have been mostly ignored by current research.

This paper focuses on identifying those discrepancies and on analyzing their impact on the application of clustering techniques to databases. We are particularly interested in the question on how clustering algorithms can be generalized to become more directly applicable to real-world databases. The paper introduces methodologies, techniques, and tools that serve this purpose. We propose a data set representation framework for database clustering that characterizes objects to be clustered through sets of tuples, and introduce preprocessing techniques and tools to generate object views based on this framework. Moreover, we introduce bag-oriented similarity measures and clustering algorithms that are suitable for our proposed data set representation framework. We also demonstrate that our approach is capable of dealing with relationship information commonly found in databases through the bag-oriented clustering. We also argue that our bag-oriented data representation framework is more suitable for database clustering than the commonly used flat file format and produce better quality of clusters.

Keywords and Phrases: database clustering, preprocessing in KDD, data mining, data model discrepancy, similarity measures for bags.

1 IntroductionCurrent technologies for collecting data such as scanners and other data collection tools have

generated a huge amount of data, and the volume of data is growing rapidly every year.

Database systems provide tools and an environment that manage and access the large volume of

data systematically and efficiently. However, extracting useful knowledge from databases is very

difficult without additional computer assistance and more powerful analytical tools. In general,

there is a significant gap between data generation and data understanding. Consequently, to appear in Information Science in Spring 2005.

1

automatic powerful analytical tools for discovering useful and interesting patterns from

databases are desirable. Knowledge discovery in data (KDD) is such a generic approach to

analyze and extract useful knowledge from databases using fully automated techniques.

Recently, many techniques and tools [HK01] have been proposed for this purpose. Popular

KDD-tasks include classification, data summarization, dependency modeling, deviation

detection, etc., are the popular techniques used in KDD.

The focus of this paper is database clustering. The goal of database clustering is to take a

database that stores information concerning a particular type of objects (e.g., customers or

purchases) and identify subgroups of those objects, such that objects belonging to the same

subgroup are very similar to each other, and such that objects belonging to different subgroups

are quite different from each other.

Figure 1. Example of Database Clustering

Suppose that a restaurant owner has a database that contains customer information and he wants

to obtain a better understanding of his main customer groups for marketing purposes. In order to

accomplish this goal, as depicted in Figure 1, the restaurant database will be first preprocessed

for clustering and a clustering algorithm is applied to the preprocessed data set; for example the

algorithm might reveal that there are three clusters in the customer database. Finally,

characteristic knowledge that summarizes each cluster can be generated, telling the restaurant

2

Summarization

Clustering

Preprocessing

Restaurant database

Object View for Clustering

Young at midnight

A Set of SimilarObject Clusters

White Collar for Dinner

Retired for

Lunch

owner that his major customer groups are young people that come at midnight, white collar

people that come for dinner, and retirees that come for lunch. This knowledge definitely will be

useful for marketing purposes, and for designing his menu.

The paper is organized as follows. Section 2 introduces the different steps that have to be

taken when clustering a database, and explains how database clustering is different from

traditional flat file data clustering. Based on the discussion of Section 2, Section 3 introduces a

“new” data set representation framework for database clustering that characterizes objects

through sets of tuples. Moreover, preprocessing techniques for generating object views based on

this framework are introduced. In Section 4 similarity measures for our bag-oriented knowledge

representation framework are introduced. Section 5 introduces the architecture and the

components of a database clustering environment we developed. Moreover, the problems of

generalizing traditional clustering algorithms for database clustering will be addressed in this

section. Section 6 reviews the related literature and Section 7 summarizes the findings and

contributions of the paper.

2 Database Clustering2.1 Steps of Database Clustering

Due to the fact that database clustering has not been discussed very much in the literature, we

think it is useful to discuss the different steps of database clustering first. In general, we consider

database clustering to be an activity that is conducted by passing through the following seven

steps:

(1) Define Object-View

(2) Select Relevant Attributes

(3) Generate Suitable Input Format for the Clustering Tool

(4) Define Similarity Measure

(5) Select Parameter Settings for the Chosen Clustering Algorithm

(6) Run Clustering Algorithm

(7) Characterize the Computed Clusters

The first three steps of the suggested database clustering methodology center on preprocessing

the database and on generating a data set that can be processed by the employed clustering

algorithm(s). In these steps, a decision has to be made what objects in the database (usually

databases contain multiple types of objects) and which of their properties will be used for the

3

purpose of clustering; moreover, the relevant information has to be converted to a format that

can be processed by the selected clustering tool(s). In the fourth step similarity measures for the

objects to be clustered have to be defined. Finally, in steps 5-7 the clustering algorithm has to be

run, and summaries of the obtained clusters are generated.

2.2 Differences between Database Clustering and Ordinary Clustering

Data collections are stored in many different formats such as flat files, relational or object-

oriented databases. The flat file format is the simplest and most frequently used format in the

traditional data analysis area. When using flat file format, data objects (e.g., records, cases,

examples) are represented through vectors in the n-dimensional space, each of which describes

an object, and the object is characterized by n attributes, each of which has a single value.

Almost all existing data analysis and data mining tools, such as clustering tools, inductive

learning tools, and statistical analysis tools, assume that data sets to be analyzed are represented

in a flat file format. The well-known inductive learning environment C4.5 [Quin93] and similar

decision tree based rule induction algorithm [Domi96], conceptual clustering algorithms such as

COBWEB [Fish87], AutoClass [Chee96], ITERATE [Bisw95], statistical packages, etc. make

this assumption.

Due to the fact that databases are more complex than flat files, database clustering faces

additional problems that do not exist when clustering flat files; these problems include:

Databases contain objects that belong to different types; consequently, it has to be defined

what objects in the database need to be clustered.

Databases contain 1:1, 1:n and n:m relationships between objects of the same and different

types.

The definition of object similarity is more complex due to the presence of bags of values

(or related information) that characterize an object.

Attributes of objects have different types which makes the selection of an appropriate

similarity measure more difficult.

The first two problems will be analyzed in more detail in the next two subsections; the third and

fourth problem will be addressed in Section 4.

2.3 Support for Object Views for Database Clustering

4

Because databases usually contain objects belonging to different classes, there can be several

ways of viewing a database depending on what classes of objects need to be clustered. To

illustrate the problems of database clustering, let us use the following simple relational database

that consists of a Customer and a Purchase table; a particular state of this database is shown in

Figure 2 (a). The underlined attributes in each relation represent the primary key in the relation.

It is not possible to directly apply a clustering algorithm to a relational database, such as the

one that is depicted in Figure 2 (a). Before a clustering algorithm can be applied to a database it

has to be determined what classes of objects should be clustered: should customers or purchases

be clustered? After it has been decided which objects have to be clustered, in the next step

relevant attributes have to be associated with the particular objects to be clustered. The

availability of preprocessing tools that facilitate the generation of such object-views is highly

desirable for database clustering, because generating such object-views manually can be quite

time consuming.

2.4 Problems with Relationships

In general, a relational database usually consists of several related relations (or of related classes

when using the object-oriented model), which frequently describe many to one, and many to

many relationships between objects. For example, let us assume that we are interested in

clustering the customers belonging to the relational database that was depicted in Figure 2 (a). It

is obvious that the attributes found in the Customer relation alone are not sufficient to

accomplish this goal, because many important characteristics of persons are found in other

“related” relations, such as the Purchase relation. Prior to clustering customers, the relevant

information has to be extracted from the relational database and associated with each customer

object. We call a data structure that stores the results of this process an object view. An example

of such an object view is depicted in Figure 2 (c). The depicted data set was generated by

grouping related tuples into a unique object (based on cid). The attributes p.pid, p.plocation,

p.ptype, and p.amount are called related attributes with respect to the Customer relation because

they had to be imported from a foreign relation, the Purchase relation in this particular case. (a) A data collection consisting of two relations, Customer and Purchase. The underlined attributes are the keys in each relation. cid (customer id) is a foreign key in the relation Purchase. oid is an order id, pgid is a product group id, ptype is a payment type (e.g., 1 for cash, 2 for credit card, and 3 for check). The cardinality ratio between two relations is 1:n.

5

Customer Purchase

(b) A single-valued data set created by performing an outer join on cid. The related attributes (from Purchase relation) have prefix p. For example, p.pgid is a pgid in Purchase relation.

(c) A multi-valued data set created by grouping related tuples into an object. For example, the three tuples that charactize Johnny are grouped into one “Johny” object using separate bags for his payment type, product group, and amount spent on each product group.

(d) A single-valued data set created by averaging of multi-valued attributes in (c). For the symbolic multi-valued attributes such as p.pgid, p.location, and p.ptye, we picked the first value in the set (arbitrarily) since we cannot calculate the averages.

Figure 2. Various representations of a data set consisting of two related relations

In general, as the example shows, object views frequently contain bags of values if the

relationship cardinality between the two relations is 1:n. Note that in a relational database n:m

relationships are typically designed to two 1:n relationships. Unlike a set, a bag allows for

duplicate elements, but the elements must take values in a same domain. For example, the bag

{400, 70, 200} for the amount attribute might represent three purchases, 400, 70, and 200 dollars

by the customer “Johny”. Ryu and Eick [Ryu98c] call such a data set in Figure 1 (c), a multi-

valued data set and use term single-valued data set for the traditional flat files in (a) or (b). They

use curly brackets to represent a bag of values with the cardinality of the bag greater than one

(e.g., {1,2,3}), null to denote an empty bag, and give its element, if the bag has one element.

Most traditional similarity measures for single-valued attributes cannot deal with multi-valued

6

cid name age gender p.pid p.ptype p.amount 1 Johny 43 M {p1,p2,p3} {1,2,3} {400,70,200} 2 Andy 21 F {p2,p3} {2,3} {390,100} 3 Post 67 M p1 1 30 4 Jenny 35 F null null null

cid name age gender 1 Johny 43 M 2 Andy 21 F 3 Post 67 M 4 Jenny 35 F

oid p g id cid ptype amount date 1 p1 1 1 400 02-10-96 1 p2 1 1 70 02-10-96 1 p3 1 1 200 02-10-96 2 p2 2 2 390 02-23-96 3 p3 2 3 100 03-03-96 4 p1 3 1 30 03-03-96

cid name age gender p.oid p.pgid p.ptype p.amount date 1 Johny 43 M 1 p1 1 400 02-10-96 1 Johny 43 M 1 p2 1 70 02-10-96 1 Johny 43 M 1 p3 1 200 02-10-96 2 Andy 21 F 2 p2 2 390 02-23-96 2 Andy 21 F 3 p3 3 100 03-03-96 3 Post 67 M 4 p1 1 30 03-03-96 4 Jenny 35 F null null null null null

cid name age gender p.pgid p.ptype p.amount 1 Johny 43 M p1 1 223 2 Andy 21 F p2 2 245 3 Post 67 M p1 1 30 4 Jenny 35 F null null null

attributes such as p.pid, p.plocation, p.ptype, and p.amount. Measuring similarity between bags

of values requires group similarity measures. For example, how do we compute similarity

between a pair of objects, “Andy” and “Post” for a multi-valued attribute p.amount,

{390,100}:30, or between “Andy” and “Johny”, {390,100}:{400,70,200}? One simple idea may

be to replace the bag of values for multi-valued attributes by a single value by applying certain

aggregate function (e.g., average, sum or count), as depicted in Figure 2 (d). Another alternative

would be to use an outer join with respect to cid attribute to obtain a single-valued data set, as

depicted in Figure 2 (b).

The problem with the first approach is that by applying the aggregate function frequently

valuable information may be lost. For example, if the average purchase amount is used to

replace the bag of individual purchase amounts, this approach does not consider other potentially

relevant information, such as total amount, and the number of purchases, in computing

similarity. Another problem is that aggregate functions are only applicable to numerical

attributes. Using aggregate functions for symbolic attributes, such as the attribute location or

ptype in the example database, does not make sense at all. In summary, the approach of

replacing bag of values by a single value faces a lot of technical difficulties.

If we look at the single-valued data set in Figure 2 (b) which has been generated by using an

outer join approach we observe a different problem. A clustering algorithm would treat each

tuple in the obtained single-valued data set as a separate object (e.g. Johny’s 3 purchases would

be considered to be different objects, and not as data that are related to the customer “Johny”),

which means no longer the 4 customers would be clustered, but rather the 7 purchases;

obviously, if our goal is to cluster customers, clustering purchases instead seems to be quite

confusing.

7

3 A Data Set Representation Framework for Database ClusteringIn the following a data set representation framework for database clustering is proposed;

similarity measures that are suitable in the context of the proposed framework will then be

introduced in Section 4. In general, the framework consists of the following mechanisms:

An object identification mechanism that defines what classes of objects will be clustered

and how those objects will be uniquely identified.

Mechanisms to define modular units based on object similarity have to be provided; each

modular unit represents a particular perspective of the objects to be clustered; similarity of

different modular units is measured independently. In the context of the relational data

model modular units are defined as procedures that associate a bag of tuples with a given

object. Using this framework, objects to be clustered are characterized by a set of bags of

tuples, one bag for each modular unit.

The similarity between two objects is measured as a weighted sum of the similarity of all

its modular units. To be able to do that a weight and a (bag) similarity measure have to be

provided for each modular unit.

Figure 3. An example of the bag-oriented clustering framework

To illustrate this framework, let us assume that we are still interested in clustering customers. In

this case the attribute cid of the relation Customer that uniquely identifies customers serves as

our object identification mechanism. After the object identification mechanism has been

selected, relevant attributes to define similarity between customers have to be selected. In the

particular case, we assume that we consider the customer’s age/gender information, the amount

of money they spend on various product groups, and the customer’s daily spending pattern to be

8

cid

1

2

3

4

Age Gender43 M

Age Gender21 F

Age Gender67 M

Age Gender35 F

Pid AmountP1 400P2 70P3 200

Pid AmountP2 390P3 100

Pid AmountP1 30

Pid Amount

Sum(amount) Date670 2/10/96

Sum(amount) Date390 2/23/96100 3/3/96

Sum(amount) Date30 3/3/96

Sum(amount) Date

relevant for defining customer similarity. In the next step, modular units to measure customer

similarity have to be defined. In this particular example, we identify three modular units each of

which characterizes customers through a set of tuples. For example, the customer with cid 1 is

characterized as a 43 years old male, who spent 400, 70, and 200 dollars on product groups p1,

p2, and p3, and who purchased all his goods in a single day of the reporting period, spending

total 670 dollars. There are different approaches to define modular units. When the relational

data model is used, modular units can be defined using SQL queries that associate customers

(using cid) with a set tuples that are specific for the modular unit. In the example, depicted in

Figure 3, the following three SQL queries associate customers with the characteristic knowledge

with respect to each modular unit:Modular Unit 1 := SELECT cid, age, gender FROM Customer;

Modular Unit 2 := SELECT Customer.cid, pgid, amount FROM Customer, Purchase WHERE Customer.cid=Purchase.cid;

Moduler Unit 3 := SELECT Customer.cid, sum(amount), date FROM Customer, Purchase WHERE Customer.cid=Purchase.cid GROUPED BY Customer.cid, date;

As we have seen throughout the discussions of the last two sections, many different object

views can be constructed from a given database. There are “simple” object views based on flat

file format, such as those in Figure 2 (b) and Figure 2 (d); in this section, a more complicated

scheme for defining object views has been introduced that characterizes object through sets of

bags of tuples. We claim that this data set representation framework is more suitable for database

clustering, and will present arguments to support our claim in Section 5.

When following the proposed methodology, object views based on the definition of modular

units are constructed. In the next step similarity measures have to be defined with respect to the

chosen object view that will be the subject of the discussions in the next section.

9

4 Similarity Measures for Database ClusteringIn the previous section, we introduced a data set representation framework for database

clustering. In this section, we will introduce several similarity measures that are suitable for the

proposed framework.

As discussed earlier, in the proposed framework each object to be clustered is described

through a set of bags of tuples—one bag for each modular unit. In the case of single-valued data

sets each bag degenerates to a single tuple. When defining object similarity for this framework

we assume that a similarity measure is used to evaluate object similarity with respect to a

particular modular unit. Object-similarity itself is measured as the weighted sum of the similarity

of its modular units. More formally:

Let

O be the set of objects to be clustered

a, b O

mi: O X denotes a function that computes the bag of tuples of the ith modular unit

I denotes the similarity function for ith modular unit

wi denotes the weight for ith modular unit

Based on these definitions, the similarity between two objects a and b can be defined as

follows: (a, b) = , where n is the number of modular units.

Figure 4 illustrates how the similarity measure is computed between two objects, Objecta and

Objectb with modular units in our similarity framework. Similarity computation between Objecta and Objectb

… …

1

2

n

Figure 4. Similarity framework

There are many similarity metrics and concepts proposed in the literature from variety of

disciplines including engineering, science [Ande73, Ever93, Jain88, Wils97] and psychology

10

w1 w2 wn

wn w2 w1

Objecta Objectb

Modular unitn

Modular unit2

Modular unit1

Modular unit2

Modular unitn

Modular unit1

[Ashb88, Shep62]. In this paper, we broadly categorize types of attributes into quantitative type

and qualitative type, and introduce existing similarity measures based on these two types, and

generalize those to cope with the special characteristics of our framework.

4.1 Similarity Measures for Quantitative Types

A class of distance functions, known as Minkowski metric, is the most popularly used

dissimilarity function for the quantitative attributes. It is defined as follows:

dr(a,b) = ( r)1/r, r 1 (1)

where a and b are two objects with m number of quantitative attributes, a = (a1, …, am) and b =

(b1, .., bm). For r = 2, it is the Euclidean metric, dr(a,b) = ; for r = 1, it is the city-

block (also known as taxicab or Manhattan) metric, dr(a,b) = , and for r = , it is the

dominance metric, d(a,b) = . The Euclidean metric is the most commonly used

similarity function of the Minkowski metrics. Wilson and Martinez [Wils97] discusses many

other distance functions and their properties.

One simple way to measure the similarity between modular units in our similarity

framework is to substitute group means for the ith attribute of an object in the formulae for inter-

object measures such as Euclidean distance, city-block distance, or squared Mahalanobis

[Jain88]. For example, suppose that group A has the mean vector = [ a1, a2, …, am] and

group B has the mean vector = [ b1, b2, …, bm], then the measure by Euclidean distance

between the two groups can be defined as

d(A,B) = (2)

The other approach is to measure the distance between their closest or furthest members,

one from each group, which is known as nearest neighbor or furthest neighbor distance

[Ever93]. This approach is used in hierarchical clustering algorithms such as single-linkage and

complete-linkage. The main problems with these two approaches are that the similarity is

insensitive to the quantitative variance and that it does not account for the cardinality of

elements in a group.

11

Another approach, known as group average, can be used to measure inter-group similarity.

In this approach, similarity between groups is measured by taking the average of all the inter-

object measures for those pairs of objects for which objects in the pair are in different groups.

For example, the average dissimilarity between group A and B can be defined as

d(A,B) = (3)

where n is the total number of object-pairs, which is n = na nb, na and nb are the number of

objects in the object ai and bj, respectively, and d(ai,bj) is the dissimilarity function for a pair of

objects ai and bj, ai A, bj B. Note that the dissimilarity function (usually distance function)

can be easily converted into a similarity function by reciprocating it.

4.2 Similarity Measures for Qualitative Types

Two coefficients, the Matching coefficient and Jaccard’s coefficient, are the most commonly

used similarity measures for qualitative type of attributes [Ever93, Jain88]. The Matching

coefficient is the ratio of the number of features the two objects have in common, to the total

number of features. Jaccard’s coefficient is the Matching coefficient that excludes negative

matches. For example, let m be the total number of features; m00 and m11 be the number of

common features and mismatching features; m01 and m10 be the distinctive features between two

objects. Then, the Matching coefficient and Jaccard’s coefficient are defined as (m00+m11)/m and

m11/(mm00), respectively (m01 and m10 are ignored). There can be other varied coefficients giving

weight to either matching features or mismatching features depending on the accepted practice.

The above coefficient measures can be extended to multi-valued qualitative of attributes.

Restle [Rest59] has investigated the concepts of distance and ordering on sets. There are several

other set-theoretical models of similarity proposed [Ashb88, Tver77]. Tversky [Tver77]

proposed his contrast model and ratio model that generalize several set-theoretical similarity

models proposed at that time. Tversky considers objects as sets of features instead of geometric

points in a metric space. To illustrate his models, let a and b be two objects, and ma and mb

denote the sets of features associated with the objects a and b respectively. Tversky proposed the

following similarity measure, called the contrast model:

S(a,b) =f(ma mb) f(ma mb) f(mb ma) (4)

12

for some , , 0; f is a set operator (usually the set cardinality is used). Here, ma mb

represents the features that are common to both a and b; ma mb, the features that belong to a but

not to b; mb ma, the features that belong to b but not to a. In the previous models, the similarity

between objects was determined only by their common features, or only by their distinctive

features. In the contrast model, the similarity of a pair of objects is expressed as a linear

combination of the measures of the common and the distinctive features. The contrast model

expresses similarity between objects as a weighted difference of the measures for their common

and distinctive features. The following similarity measure represents the ratio model:

S(a,b) = f(ma mb) / [f(ma mb) + f(ma mb) + f(mb ma)], , 0 (5)

In the ratio model, the similarity value is normalized to a value range of 0 and 1. The ratio

model generalizes a wide variety of similarity models that are based on the Matching

coefficients for qualitative type of attributes as well as several other set-theoretical models of

similarity [Eisl59]. For example, if = = 1, then S(a,b) becomes the Matching coefficient,

f(ma mb)/f(ma mb), discussed in section 2.1.2. Note that the set in Tversky’s model is a crisp

set. Santini et al. [Santi96] extend Tversky’s model to cope with fuzzy sets.

Wilson and Martinez [Wils97] discuss the Value Difference Metric (VDM) introduced by

Stanfill and Waltz (1986) and propose the Heterogeneous Value Difference Metric (HVDM) for

handling nominal attributes. Gibson et al. [Gibs98] introduce a sophisticated approach to handle

the similarity measure arising from the co-occurrence of values in a data set using an iterative

method for assigning and propagating weights on the qualitative values. Their approach may

handle the limited form of transitive similarity, e.g., if Oa is similar to Ob; Oc is similar to Od,

then Oa is considered to be similar to Od.

4.3 Similarity Measures for Mixed Types

In many real world problems, we often encounter a data set with a mixture of attribute types.

Specifically, if algorithms are to be applied to databases, it may not be sensible to assume a

single type of attributes since data can be generated from multiple tables with different

properties in a given database.

A similarity measure proposed by Gower [Gowe71] is particularly useful for data with mixed

types of attributes. This measure is defined as:

13

S(a,b) = (6)

where a and b are two objects with m number of attributes, a = (a1, …, am) and b = (b1, .., bm). In

this formula, is the normalized similarity index in the range of 0 and 1 between the

objects a and b as measured by the function for ith attribute, and wi is a weight for the ith

attribute. For example, the similarity index ( , ) can be any appropriate function of the

similarity measures defined in sections 4.1 and 4.2 depending on attribute types or applications.

Higher weights are assigned to more important attributes. As the reader might already observed,

our approach to assess object similarity defined in the formula (0) relies on Gower’s similarity

measure and associates similarity measures si with modular units that represent different facets

of objects.

Wilson and Martinez [Wils97] introduce a comprehensive similarity measure called,

HVDM, IVDM, and the WVDM for handling mixed types of attributes. The Gower’s similarity

framework to deal with mixed types of attributes can be extended to Wilson and Martinez’s

framework by adding appropriate similarity measures for each type of attribute defined in

HVDM.

4.4 Support for the Contextual Assessment of Similarity

The similarity measures we introduced so far do not take into consideration that attributes are

frequently interpreted in the context of other attributes. For example, let us consider that the data

set in Figure 2 in which customer “Johny” had purchases of three product groups, p1 for $400,

p2 for $70, and p3 for $200 and customer “Andy” spend $390 on product group p2 and $100 on

product group p3. If product Ids, p1, p2, and p3 are “TV”, “fruit”, and “Jewelry” respectively, it

might not be sensible to compute the similarity for the purchase amount attribute between

“Johny” and “Andy” without considering the type of product they bought because purchases of

fruit might not be considered to be similar to purchases of TV-related products, even if the

amount spent for each purchase is similar each other. That is, the similarity of the amount

attribute needs to be evaluated in the context of the product attribute. In the following, we will

introduce a new similarity measure for this purpose.

14

Let us assume that the similarity of attribute has to be evaluated in the context of attribute

, which we denote by: |. Then we can define the similarity between two objects having

attributes and as the similarity of attribute with respect to attribute. The new similarity

function is defined as follows:

/ , (7)

where is a matching function for the attribute and s is a similarity function for the attribute

, k is number of elements in a bag. The value of is 1 for qualitative attribute if both objects

take the same value for the attribute , otherwise, is 0 (i.e., no matching values). The value of

is between 0 and 1 for quantitative attribute (i.e., a normalized distance value) that represents

the degree of relevancy between the two objects on the attribute, . Note that the contextual

relationship between and , | is not commutative (e.g., | |). In addition,

theoretically we can expand and to a conjunctive list of attributes or disjunctive list of

attributes. Accordingly, the general form of | can be:

1 2 … p | 1 op 2 op … op n,

where op is either or , p and n are the number of attributes for the similarity computations

between two objects. However, since the similarity between two objects will be computed

attribute-by-attribute from the selected list of attributes, it can be rewritten as | 1 op 2 op …

op n. Some examples of contextual relationships can be | , | 1 2 … n, | 1 2

… n, | 1 2 3 … n, and so on. So, for a case of | 1 2, the value of is 1 for

qualitative attribute when both objects take the same value for attributes, 1 and 2. In this

definition, the information from the related multi-valued attributes is combined in an orderly

way to give a similarity value. This similarity measure is embedded into our similarity

framework.

Figure 5 illustrates how to compute the similarity considering the contextual assessment.

Figure 5 also illustrates how (k) and s

(k) are used when computing samount|pgid between example

objects “Johny” and “Andy”.

Objects Product Id () Amount ()

15

(1):TV = 0,

(2):Fruit = 1, (3):Jewelry = 1

Johny TV (p1)Fruit (p2)Jewelry (p3)

40070200

Andy Fruit (p2)Jewelry (p3)

390100

Assuming the city-block metric, the normalized similarity index can be computed as follows: s

(1) = 0.0, s(2) = 0.18, s

(3) = 0.5

The similarity between Johny and Andy in the context of product ID and amount is computed: s|a (Johny, Andy) = 0.34 Figure 5. Two objects with attributes that have contextual information and its example of

similarity computation

Note that the proposed contextual similarity is not designed to find any sequential patterns like

PrefixSpan [Pei01] or to measure transitive similarity [Gibs98] but to take the valid contextual

information into account of the similarity computation.

5 Architecture of Database Clustering System

Figure 6 depicts the architecture of our database clustering system that we are currently

developing. The system consists of three major tools: a data preparation tool, a clustering tool,

and a similarity measure tool. The data preparation tool is used to generate an object view from a

relational database based on the user’s requirements. The clustering tool guides the user to

choose an appropriate clustering algorithm for an application, from the library of clustering

algorithms that contains various algorithms such as nearest-neighbor, hierarchical clustering, etc.

Once a clustering algorithm has been selected, the similarity measure tool will assist the user in

constructing an appropriate similarity measure for his/her application and the chosen clustering

algorithm. When the user constructs an appropriate similarity measure, the system inquires

information about types, weights, and other characteristics of attributes, offering alternatives and

choices to the user, if more than one similarity measure seems to be appropriate.

16

Clustering tool

Similarity measure Tool

Data preparation tool

User interface

DBMS

Similarity measure

Default choice and domain information

Type and weight information

Library of similarity measures

A set of clusters

Object View

Library of clustering algorithms

Figure 6. Architecture for database clustering

In the case that the user does not provide the necessary information, default assumptions are

made based on the types of attributes (e.g., Euclidean distance is chosen for the quantitative

types and Tversky’s ration model is our default choice for the qualitative types). The range value

information for quantitative type of attributes can be easily retrieved from a given data set by

scanning the column vector of quantitative attributes. The range value information is used to

normalize the similarity index. Normalizing the similarity index is important in combining

similarity values of all attributes with possibly different types. Finally, the clustering tool takes

the constructed similarity measure and the object view as its input and returns a set (or a

hierarchy) of object clusters as its output.

5.1 A Framework to Generate Object Views from Databases

Processed data

17

User's interests and objectives

User Interface

Our generalized clustering algorithms

Database nameData set of interestObject attribute(s)Selected attributes

Conventional clustering algorithms

Bag-based Object View

Flat file-based object view

Structureddatabase

Figure 7. A framework for generating object views

Figure 7 illustrates the proposed framework for generating object views from a database. One of

the key ideas of the proposed research for dealing with the problems raised in Section 2.2 is to

develop a semi-automatic data preparation tool that generates object views from a (relational)

database based on the user’s interests and objectives. The tool basically automates the first three

steps of the database clustering methodology that was introduced in Section 2.1. The tool will be

interactive so that the user can define his/her object-view and the relevant attributes; based on

these inputs an object-view will be automatically generated by the tool. In order to generate an

object view from a database, our approach is first to enter a database name, to select a table

called a data set of interest, object attribute(s), and selected attributes. A data set of interest is

an anchor table for other related tables the user is interested in for clustering. The object

attribute(s) (e.g., usually a key attribute in a data set of interest) define the object-view of the

particular clustering task. An object in relational database is defined as a collection of tuples that

have the same value for all object attribute(s). The set of tuples is viewed to describe the same

object. Consequently, when generating object view, information in tuples that agree in the object

attributes should be combined into a single object in the format shown in Figure 2 (c), whereas

those that do not agree are represented as different objects in the generated object view. The

selected attributes are attributes in all the related tables the user has chosen. Although the tool

can generate an object-view in conventional flat-file format for conventional clustering

algorithms, the main format of the object-view in our approach is bag-based.

18

Figure 9 shows our implemented interface for the data preparation tool to generate an

object view from a relational database. We used Visual Basic to implement this tool. Using the

information provided by the user through the interface the algorithm to generate an object-view

works as follows: as the database name and the data set of interest are given, the attributes from

the data set of interest in the database are first extracted; next the related attributes in related

tables are selected through joining (usually outer join) with related tables; finally, the object

attribute(s) is selected from the attributes and the object-view is created by grouping the tuples

with the same values for the object attribute(s) into one object with the bags of values for the

related attributes [Ryu98c and Zehu98 give a more detailed description of the algorithm].

Figure 9. Interface for data preparation tool

5.2 Features of the Clustering Tool

Figure 10 shows the class diagram for our clustering tool in UML (Unified Modeling Language,

which is a notational language for software design and architecture [Mart97]). The class diagram

19

describes the developed classes, attributes, operations, and the relationships among classes.

GetAnalysisInfo class receives basic information from the user such as the name of the selected

data set, the interested attributes, data types for attributes, and the chosen similarity measure that

will be applied to the selected data set. ReadDataSetObjects class reads the selected data set.

Similarity Measure class defines our similarity measure. For the similarity measure in this

implementation, we chose the average dissimilarity measure for quantitative attributes and the

Tversky’s ratio model for qualitative attributes considering the contextual assessment of

similarity. Clustering class defines a clustering algorithm that uses the similarity measure

defined in Similarity Measure class.

Figure 10. Class diagram for the clustering tool

For the clustering algorithm, we chose the Nearest-neighbor algorithm, which is a partitioning

clustering method. In the nearest-neighbor algorithm, two objects are considered similar and are

put in the same cluster if they are neighbors or share neighbors. In this algorithm, an object o1,

from a set of objects in a data set D={o1, o2, o3,…, on} which is going to be partitioned into K

20

clusters, is assigned to a cluster C1. The nearest neighbor of the object oi among the objects

already assigned to cluster CJ is selected. And then, oi is assigned to CJ if the distance between oi

and the found nearest neighbor t (t is a threshold on the nearest neighbor algorithm, selected

by the user). Otherwise, the object oi is assigned to a new cluster CR. This step is repeated until

all objects in the analyzed data set are assigned to clusters. When using the Nearest-neighbor

algorithm the user should provide the threshold value to be used in the clustering process.

Threshold value sets the condition based on which two objects can share or be grouped together

in the same cluster. Consequently, the threshold value affects the number of generated clusters.

As the value of the threshold increases, fewer clusters are generated.

We selected the Nearest-neighbor algorithm because it is directly applicable for clustering

the proposed bag-based object-view. Other algorithms that compute similarity directly between

two objects can be also applicable to our framework. However, generalizing the clustering

algorithm, K-means is not trivial, because of difficulties in computing centroids for clusters of

objects that are characterized by sets of bags of values.

Figure 11. Clustering tool

21

Go

Figure 11 illustrates our implemented interface of the clustering tool. In order to cluster an

object-view generated from the data preparation tool, the user needs to select a tag for flat-file

based data set or bag-based data set, the attributes, types of attributes, corresponding weights,

threshold value, and the output of the clustering.

5.3 Database Clustering Examples

We used two relational databases, Movies database available in the UC-Irvine data set archive

[UCIML04] and Online customer database received from a local company [Ryu02]. In these

experiments, we generated three different data sets for each database using three different data

representation formats called, single-valued data set and average-valued data set for conventional

data representation and multi-valued data set for the proposed representation like the formats

shown in Figure 2 to see whether a data set representation affects the quality of clusters. For the

clustering algorithm, we chose the Nearest-Neighbor algorithm. For the similarity measure, we

used the formula (0) based on the Gower’s similarity function (6). For the multi-valued data set,

the function consists of the formula (3) for quantitative attributes, (5) for qualitative attributes,

and (7) for the contextual assessment. For the single-valued data set and average-valued data set,

the function consists of the formula (1) (Euclidean metric) for quantitative attributes and (5) for

qualitative attributes, but not the formula (7). However, other similarity measures can be also

incorporated into the proposed framework depending on the user’s choice.

The Movies database holds information about the movies that have been developed since

1900 such as their titles, types, directors, producers, and their years of release. The Table 1

illustrates the selected attributes, their types, and assigned weights for the Movies database.

22

Attributes Name

Properties Assigned Weight

Film_Id Single-valued, Qualitative 0

Year Single-valued, Quantitative 0.2

Director Single-valued, Qualitative 0.6

Category Multi-valued, Qualitative 0.7

Awards Multi-valued, Qualitative 0.5

Table 1. Movies database

The key attribute of this data set is “Film-id”. All the attributes in this data set except for the

attribute Year (We are not very interested in the year information) are qualitative type. The

attributes Year, Director, are single-valued; Category and Awards are multi-valued. The

empirically selected threshold value for the clustering algorithm in this experiment is 0.36

[Sala00].

The number of clusters generated by each technique is shown in Table 2. The same

clustering algorithm with the same similarity framework with slightly different similarity

formula was applied to three different data sets except for the average-valued data set. We did

not generate the average-valued data set for the Movies database, because most attributes in the

database are symbolic, which cannot be easily converted into meaning quantitative scale.

Approach Number of Clusters

Single-valued approach 136

Average-valued approach N/A

Multi-valued approach 130

Table 2. Number of clusters generated by each approach

Both the single-valued approach and the multi-valued approach produced the same clusters for

the single-valued objects. This is not surprising since the multi-valued approach is basically a

generalization of the single-valued approach. The number of clusters generated by multi-valued

approach is less than the single-valued approach.

23

Film_id category

director Awards Year

Asf10,T:I’ll be Home for Christmas

Comd D:Sanford Null 1998

Asf8,T:A Very Brady Sequel

Comd D:Sanford Null 1996

Atk10,T:Hilary and Jackie

BioP,Comd

D:A.Tucker Null 1998

Atk12,T:Map of Love

BioP,Comd

D:A.Tucker Null 1999

Table 3. Some objects in cluster-A from multi-valued approach

Moreover, as we expected, in the clustering result of the single-valued approach, the same

objects with multi-values were grouped into different clusters. Obviously, there is no such a

problem in the multi-valued approach. Some objects of a cluster generated by the multi-valued

approach are shown in Table 3. In this cluster, there are four different objects with similar

properties. Note that the attribute, category, has the highest weight. As Tables 4 and 5 illustrate,

some objects appeared in two different clusters generated by single-valued approach even though

they are in the same cluster shown in Table 3.

24

Film_id category Director awards year

Atk10,T:Hilary and Jackie BioP D:A.Tucker Null 1998

Atk12,T:Map of Love BioP D:A.Tucker Null 1999

Table 4. Some objects in cluster-B from single-valued approach

film_id category Director awards Year

Asf10,T:I’ll be Home for Christmas Comd D:Sanford Null 1998

Asf8,T:A Very Brady Sequel Comd D:Sanford Null 1996

Atk10,T:Hilary and Jackie Comd D:A.Tucker Null 1998

Atk12,T:Map of Love Comd D:A.Tucker Null 1999

Table 5. Some objects in cluster-C from single-valued approach

For example, the two objects, “Atk10, T: Hilary and Jackie” and “Atk12, T:Map of Love” are

grouped into different clusters by the single-valued approach. Note that these objects appear in

both clusters in the single-valued approach. This clustering result may confuse the data analysts.

We could not compare the quality of clusters for each data set since no class information for the

Movies database is available.

In the second experiment, we used an online customer database for a local Internet

company that sells the popular climate control products such as portable heaters, window, air

conditioners, etc. The size of this data set is 25,221 records after eliminating redundant,

incomplete, and inconsistent data. Ryu and Chang [Ryu02] have studied this database to identify

the characteristics of customers using decision tree [Quin93] and association rule mining

[Agra93] approaches. They found three major groups of customers for the company as shown in

Figure 12.

25

Figure 12. Three customer groups with higher buying tendency. The vertical axis represents the percentage of buyers.

So, the attributes and the weights for the selected attributes shown in Table 6 were

selected/assigned based on the analysis result by Ryu and Chang [Ryu02]. Attributes Name

Properties Assigned Weight

CustID Single-valued, Qualitative 0

Age Single-valued, Quantitative 0.7

Ethnic group Single-valued, Qualitative 0.7

Amount Multi-valued, Quantitative 0.6

PayType Multi-valued, Qualitative 0.6

City Single-valued, Qualitative 0.8

State Single-valued, Qualitative 0.8

Table 6. Online customer database

Again, in this experiment, we want to see whether the clusters generated by each data set

representation approach are compatible to the previous analysis result. The clustering result is

shown in Table 7. The multi-valued approach generates less number of clusters than other

approaches do. However, the number of clusters generated by each approach is much larger than

the three groups shown in Figure 12.

Table 7. Number of clusters generated by each approach

So we manually examined the contents of each cluster and found that many clusters can be

eventually merged into the three groups shown in Figure 12. This job was much easier for the

clusters generated by the multi-valued approach. However, for the clusters generated by the

Approach Number of Clusters

Single-valued approach 251

Average-valued approach 95

Multi-valued approach 81

26

single-valued approach, it was very difficult since the same objects with multi-values appear in

different clusters. For example, Table 8 shows some objects in a cluster-A generated by

single-valued approach.

CustID age Ethnic group Amount payType City State

12001 27 A 25 Credit Brooklyn NY

13100 30 A 30 Credit Newark NJ

12200 33 A 50 Credit Los Angeles CA

13200 29 B 55 Credit Bronx NY

Table 8. Some objects in cluster-A generated by single-valued approach

Table 9 shows some objects assigned to other clusters generated by single-valued approach. As

we can see, the customers 12001 and 13200 in Table 9 are represented as two different objects

and assigned to different clusters. They should have been grouped to either cluster-A or cluster-

B.

CustID age Ethnic group Amount payType City State

12001 27 A 280 Paypal Brooklyn NY

12005 30 B 125 Paypal Sunnyside NY

13200 29 B 280 Paypal Bronx NY

13200 29 B 235 Paypal Bronx NY

Table 9. Some objects in cluster-B generated by single-valued approach

There are 157 customers (out of 5,271 customers) grouped into more than one cluster like the

customers 12001 or 13200. Average-valued approach and multi-valued approach do not create

this type of confusion. However, the clustering result by average-valued approach is not as

accurate as the multi-valued approach, and even as the single-valued approach. One possible

reason is that in the average-valued approach the mapping from qualitative data to quantitative

data or the representative values (the first values picked from tuples for an object if the mapping

is not possible) for the qualitative attribute might be inappropriate (see the example format in

Figure 2 (d)). In summary, the overall quality of clusters generated by multi-valued approach is

27

better than other approaches. In addition, analyzing the clustering result generated by multi-

valued approach is much easier.

Intuitively, one can think that the run-time for multi-valued approach may take longer than

other approaches because of additional computation. However, the overall run-time including

preprocessing for each approach was not very different. This may be because the multi-valued

approach deals with much less number of records for clustering than the single-valued approach.

For the average-valued approach, it requires additional preprocessing time.

6 Related Work on Structural Data AnalysisIn this section, we conduct a literature review on approaches to deal with structural data sets. We

categorize those approaches into two general groups; the data set conversion approach that

converts a database to a single flat data set without modifying data mining methods, and the

method generalization approach that generalizes data mining algorithms so that they can directly

be applied to structured objects. We also discuss the previously proposed database clustering

algorithms.

6.1 Data Set Conversion Approach

In order to concert a structured data set into a single flat file, related data sets are usually joined

and various aggregate functions and/or generalization operators have to be applied to remove

multi-valued attributes (for example, by averaging sets of values or by storing generalizations)

before data mining techniques are applied to the given data set. Conventional data mining

techniques are then applied to the “flattened” data set without any need for generalization. Many

existing statistical analysis techniques and data mining techniques employ this approach

[Agra93, Shek96, Bisw95, Han96, Haim97].

Nishio et al. [Nish93] proposed generalized operators that can be applied to convert a set of

values into a higher-level of concept description that can encompass the set of values for data

mining techniques in an object-oriented database framework. For example, a set of values,

{tennis, soccer, volleyball}, can be generalized as a single value of the higher-level concept

28

“sports”. They categorize attributes into several types, such as single-valued attribute, set-valued

attribute, list-valued attribute, and structure-valued attribute; and they propose the generalization

mechanisms for each category of attribute.

Applying a generalization operator to the related values may be a reasonable idea, since the

generalized values for an attribute may keep the related information. However, it may not be

always possible to generalize a set of values into a correct and consistent high-level concept

description, specifically for quantitative attributes, since there can be several ways to generalize

for the same set of values. Moreover, in many application domains suitable generalization

hierarchies for symbolic attributes are not available. Gibson [Gib00] and similarly Ganti

[Gan99] introduce novel formalizations of a cluster for categorical attributes and propose

clustering algorithms for data sets with categorical attributes.

DuMouchel et al. [DuMo99] proposed a methodology that squashes flat files applying

statistical approaches mainly to resolve the scalability problem of data mining. The methodology

consists of three steps, grouping, momentizing, and generating. These steps describe the

squashing pipeline whereby the original data set is sectioned off into mutually exclusive groups;

within each group a series of low-order moments are computed; and finally these moments are

passed off to a routine that generates pseudo data that accurately reproduce the moments. They

claim that the squashed data set keeps the structure of the original data set.

6.2 Method Generalization Approach

The other way to cope with structured data sets is to generalize existing data mining methods so

that they can perform data mining tasks in structured domains. A few approaches that directly

represent structured data sets using more complex data structures and which generalize data

mining techniques for those data structures have been proposed [Gold95, Haye78, Step86,

Mich83, Wass85, Thom91, Kett95, Mana91, Mcke96, Hold94, Biss92, Kiet94, Kauf96] in the

literature. We only review those approaches we consider most relevant for database clustering in

this section.

Goldberg and Senator [Gold95] restructure databases for data mining by consolidation and

link formation. Consolidation relates identifiers present in a database to a set of real world

entities (RWEs) which are not uniquely identified in the database. This process can be viewed as

a transformation of representation from the identifiers present in the original database to the

29

RWE. Link formation constructs structured relationships between consolidated RWEs through

identifiers and events explicitly represented in the database. Both consolidation and link

formation may be interpreted as transformations of representation from the identifications

originally present in a database to the RWE’s of interest. McKearney and Roberts [Mcke96]

produce a single data set for data mining by generating a query, after analyzing dependencies

between attributes and relationships between data sets (e.g., relations or classes). This is

somewhat similar to our approach, except that our approach employs sets of queries (and not a

single query) to construct modular units for similarity assessment.

LABYRINTH [Thom91] is a system that extends the well-known conceptual clustering

system COBWEB [Fish87] to structured domains. It forms concept hierarchy incrementally and

integrates many interesting features such as incremental, probabilistic, unsupervised,

relationship, and component features, as used in earlier systems. For example, it learns

probabilistic concepts and also decomposes objects into sets of components to constrain

matching like MERGE [Wass85]. LABYRINTH can make effective generalizations by using a

more powerful structured representation language.

Ketterlin et al. [Kett95] also generalize COBWEB to cope with complex objects (or

composite objects, i.e., objects with 1:n relationships); that is, objects that have many other

related objects (or components) to deal with 1:n or n:m relationships in structured database. The

basic idea used in their system is to find a characterization of a cluster of composite objects

using component clustering; that is, components are clustered first, leading to a component-

clusters hierarchy, then composite objects are clustered.

The systems KBG [Biss92] and KLUSTER [Kiet94] both employ high-level languages

(respectively first-order logic and description logic). Both systems build a DAG (directed acyclic

graph) of clusters, instead of a hierarchy.

Ribeiro et al. [Ribe95, Kauf96] extend the discovery system INLEN [Kauf91] to discover

knowledge in multiple data sets. In order to discover knowledge across multiple data sets, they

include information on primary and foreign keys for the target data sets (e.g., relations); that is,

keys serve as the links across the data sets, as it is the case in our approach. INLEN first

discovers knowledge in each database or relation, then the discovered knowledge is associated

with related information using foreign key information; finally, all discovered knowledge for

each database is integrated into a single knowledge base.

30

Gibson et al. [Gibs98] proposed an approach for clustering categorical data based on

dynamical systems. The approach tries to handle the similarity measure arising from the co-

occurrence of values in a data set using an iterative method for assigning and propagating

weights on the qualitative values. Ganti et al [Gant99] proposed an improved approach called,

CACTUS for categorical data clustering based on the inter-attribute and the intra-attribute

summaries to compute the “candidate” clusters which can then be validated to determine the

actual set of clusters.

6.3 Database Clustering Algorithms

Several clustering algorithms such as CLARAN [Ng94], DBSCAN [Este96], BIRCH [Zhan96],

and STING [Wang97] for large databases have been proposed. However, most algorithms are

targeted for spatial databases, not for structural databases like business-oriented relational

databases or object-oriented databases. Moreover, like many of conventional clustering

algorithms, those algorithms also make one-tuple one-object assumption.

Bradley et al. [Brad98] proposed a scalable clustering framework for large databases based

on identifying regions of the data that are compressible, regions that must be maintained in

memory, and regions that are discardable. The framework focuses on the scalability of database

clustering algorithms.

7 Summary and ConclusionIn this paper, methodologies, techniques, and tools for clustering databases were introduced. One

critical problem of database clustering is the data model discrepancy between the representation

format used to store the target data and the input data format that clustering algorithms expect.

In most databases, data are stored in several tables or classes and related information are

represented as relationships among related tables or classes, while most traditional clustering

algorithms assume that input data are stored in a single flat file format. Based on this

observation, we showed that the traditional flat file format is not appropriate for storing related

information since it restricts each attribute in a data set to have a single value while once related

31

objects in related tables or classes are combined, objects are frequently characterized by bags of

values (or tuples). We proposed a better data representation format that relies on bags of tuples

and modular units for database clustering, and introduced similarity measures that are suitable

for our proposed representation framework. Moreover, we reviewed the features of a database

clustering environment that employs the proposed representation framework.

We proposed a unified framework for similarity measures to cope with mixed-type

attributes that may have set or bag of values and with object relationships commonly found in

databases. Most of the similarity measures that we recommended have been introduced in the

literature a long time ago; however, we also introduced a new similarity measure that allows for

defining attribute similarity in the context of other attributes.

We performed the experiment for clustering different types of data sets using the nearest-

neighbor algorithm and analyzed the results. In this experiment, we conducted a cluster analysis

for a data set represented in the various formats including single-valued data set, average-valued

data set, and multi-valued data set to see the effectiveness of the proposed framework. Based on

our analysis, we found that the proposed multi-valued data representation approach produced

clearer (better quality of) clustering result than the traditional data representation approach

(single-valued or average-valued data set). Interpreting clustering result generated by the

proposed approach is also easier. In the clustering result generated by traditional clustering

approach, as we expected, the same objects with multi-values are grouped in different clusters,

which may confuse data analysts.

Future Work and Issues

32

Although we claim that our introduced knowledge representation framework is useful in

representing related information, there are still several issues that are not yet analyzed or

understood in sufficient details. In general, the proposed representation framework may be well

applied to the clustering algorithms that compute the similarity directly between a pair of

objects. Nearest-neighbor clustering and DBSCAN [Este96] are such algorithms that belong to

that category. For other clustering algorithms such as K-means, COBWEB, etc., a major

modification of those algorithms would be required in order to cluster object-views based on

bags of tuples.

Generalizing decision tree algorithm such as C4.5 to cope with structural information seems

to be another challenging problem. One way to approach this problem would be to generalize the

decision tree learning algorithm so that it can be applied to bags of values (or tuples). Such a

generalization could reuse the preprocessing techniques that were described in this paper. This

would make it possible to directly apply concept learning algorithms to databases that consist of

multiple related data sets, such as relational databases, which is currently not available.

Another subject that needs to be investigated further is the scalability of the nearest-

neighbor clustering framework we proposed. When running our prototype system, we observed

that as the number of objects and/or the information associated with a particular object grows the

construction of the object similarity matrix becomes a performance bottleneck for our clustering

algorithm. We believe that in order to obtain better scalability of our methods special, efficient

data structures for object views need to be developed that facilitate the construction of object

views and the similarity computations for pairs of objects.

33

Bibliography

[Ande73] M.R. Anderberg, Cluster analysis for application, Academic Press, New York, 1973.[Agra93] R. Agrawal, T. Imielinski, A. Swami, Mining association rules between sets of items in

large databases, In Proc. ACM SIGMOD pp. 207-216, 1993. [Ashb88] F.G., Ashby, N.A. Perrin, Toward a unified theory of similarity and recognition,

Psychological review 95(1) 124-150, 1988. [Biss92] G. Bisson, Conceptual clustering in a first order logic representation, In proc. of the tenth

European conference on Artificial Intelligence, John Wiley & Sons, 1992.[Bisw95] G. Biswas, J. Weinberg, C. Li, ITERATE: A Conceptual clustering method for knowledge

discovery in databases, In Innovative Applications of Artificial Intelligence in the Oil and Gas Industry, B. Braunschweig, R. Day (Ed.), 1995.

[Brad98] P.S. Bradley, U. Fayyad, C. Reina, Scaling Clustering Algorithms to Large Databases, In Proc of 4th International conference on Knowledge Discovery and Data Mining (KDD-98), New York. 1998.

[Chee96] P. Cheeseman, J. Stutz, Bayesian Classification (AutoClass): theory and results, Advances in Knowledge Discovery and Data Mining, U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy (Ed.), AAAI/MIT Press, Cambridge, MA, pp. 153-180, 1996.

[Domi96] P. Domingos, Linear-time rule induction, In Proc. of the 2nd Int'l Conf. on Knowledge Discovery and Data Mining, Portland, Oregon, 1996.

[DuMo99] W. DuMouchel, C. Volinsky, T. Johnson, C. Cortes, D. Pregibon, Squashing Flat Files Flatter, In Proc. Of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-99), San Diego, California, 1999.

[Este96] M. Ester, H-P. Kriegel, J. Sander, X. Xu, A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise, In proceedings of the Second Knowledge Discovery and Data Mining conference, Portland, Oregon, 1996.

[Ever93] B.S. Everitt, Cluster Analysis, Edward Arnold co-published by Halsted Press and imprint of John Wiley & Sons Inc., 3rd edition, 1993.

[Fish87] D. Fisher, Knowledge acquisition via incremental conceptual clustering, In Machine Learning, 2 pp. 139-172, 1987.

[Gib00] D. Gibson, J. Kleinberg, P. Raghavan, Clustering Categorical Data: An Approach Based on Dynamical Systems, VLDB Journal 8 (3-4) pp. 222-236, 2000.

[Gant99] V. Ganti, J. Gehrke, R. Ramakrishnan, CACTUS - Clustering Categorical Data Using Summaries, In Proc. Of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-99), San Diego, California, pp. 73-83, 1999.

[Gibs98] D. Gibson, J. Kleinberg, P. Raghavan, Clustering Categorical Data: An Approach Based on Dynamical Systems, In Proc. of the 24th International Conference on Very Large Databases, New York, 1998.

[Gowe71] J.C. Gower, A general coefficient of similarity and some of its properties, Biometrics 27, pp. 857-872, 1971.

[Haim97] I.J. Haimowitz, O. Gur-Ali, H. Schwarz, Integrating and Mining Distributed Customer Databases, In Proc. of the 3rd Int'l Conf. on Knowledge Discovery and Data Mining, Newport Beach, California, 1997.

[Han96] J. Han, Y. Fu, W. Wang, J. Chiang, W. Gong, K. Koperski, D. Li, Y. Lu, A. Rajan, N. Stefanovic, B. Xia, O.R. Zaiane, DBMiner: A system for Mining Knowledge in Large Relational Databases, In Proc. of the 2nd Int'l Conf. on Knowledge Discovery and Data Mining, Portalnd, Oregon, 1996.

[Hart75] J.A. Hartigan, Clustering Algorithms. John Wiley & Sons, Inc., 1975. [Hayes78] F. Hayes-Roth, J. McDermott, An interference matching technique for inducing

abstractions, Communications of the ACM, 21, pp. 401-410, 1978.

34

[Han01] J. Han, M. Kamber, Data Mining Concepts and Techniques, Morgan Kaufmann Publishers, 2001.

[Hold94] L.B. Holder, D.J. Cook, S. Djoko, Substructure Discovery in the SUBDUE system, In Proc. of the AAAI-94 Workshop on Knowledge Discovery in Databases (KDD-94), Seattle, Washington, 1994.

[UCIML04] http://www.ics.uci.edu/AI/ML/Machine-Learning.html , 2004. [Jain88] A.K. Jain, R.C. Dubes, Algorithms for clustering data, Prentice Hall, Englewood Cliffs,

NJ, 1988. [Jarv73] R.A. Jarvis, E.A. Patrick, Clustering using a similarity measure based on shared near

neighbors, IEEE Transactions on Computers C22, pp. 1025-1034, 1973.[Kauf91] K.A. Kaufman, R.S. Michalski, L. Kerschberg, Mining for knowledge in databases: Goals

and general description of the INLEN system, In Knowledge Discovery in Databases, AAAI/MIT, Cambridge, MA, 1991.

[Kauf96] K.A. Kaufman, R.S. Michalski, A Method for Reasoning with Structured and Continuous Attributes in the INLEN-2 Multistrategy Knowledge Discovery System, In Proc. of the 2nd

Int’l Conf. On Knowledge Discovery and Data Mining, Portland, Oregon, 1996. [Kett95] A. Ketterlin, P. Gancarski, J.J. Korczak, Conceptual Clustering in Structured Databases: a

Practical Approach, In Proc. of the 1st Int’l Conf. On Knowledge Discovery and Data Mining, Quebec, Montreal, 1995.

[Kiet94] J.-U. Kietz, K. Morik, A polynomial approach to the constructive induction of structural knowledge, Machine Learning 14, pp. 193-217, 1994.

[Lu78] S.Y. Lue, K.S. Fu, A sentence-to-sentence clustering procedure for pattern analysis, IEEE Transactions on Systems, Man and Cybernetics SMC 8, pp. 381-389, 1978.

[Mana91] M. Manago, Y. Kodratoff, Induction of Decision Trees from Complex Structured Data, In Knowledge Discovery in Databases, AAAI/The MIT press, pp. 289-306, 1991.

[Mart97] F. Martin, S. Kendall, UML Distilled, Applying the Standard Object Modeling Language, Addison Wesley Longman Inc., 1997.

[Mich83] R.S. Michalski, R.E. Stepp, Learning from observation: Conceptual clustering, In Machine Learning: An Artificial Intelligence Approach, Morgan Kaufmann, San Mateo, CA, pp. 331-363, 1983.

[Ng94] R.T. Ng, J. Han, Efficient and Effective Clustering Methods for Spatial Data Mining, Proc. 20th Int. Conf. on Very Large Data Bases, Santiago, Chile, pp. 144-155, 1994.

[Nish93] S. Nishio, H. Kawano, J. Han, Knowledge Discovery in Object-Oriented Databases: The First Step, In Proc. of the AAAI-93 Workshop on Knowledge Discovery in Databases (KDD-93), Washington, 1993.

[Pei01] J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, M-C. Hsu, PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth, In Proc. of the 17th International Conference on Data Engineering, Heidelberg, Germany, 2001.

[Quin93] J.R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann, 1993.[Ribe95] J.S. Ribeiro, K. Kaufmann, L. Kerschberg, Knowledge Discovery from Multiple Databases,

In Proc. of the 1st Int’l Conf. On Knowledge Discovery and Data Mining, Quebec, Montreal, Canada, 1995.

[Ryu98a] T.W. Ryu, C.F. Eick, Discovering Discriminant Characteristic Queries from Databases through Clustering, In the Proc. of the Fourth International Conference on Computer Science and Informatics (CS&I'98) at Research Triangle Park, NC, 1998.

[Ryu98b] T.W. Ryu, Discovery of Characteristic Knowledge in Databases using Cluster Analysis and Genetic Programming, Ph.D. Dissertation, Department of Computer Science, University of Houston, Houston, 1998.

[Ryu98c] T.W. Ryu, C.F. Eick, Similarity Measures for Multi-valued Attributes for Database Clustering, In the Proc. of the Conference on SMART ENGINEERING SYSTEM DESIGN

35

Neural Networks, Fuzzy Logic, Evolutionary Programming, Data Mining and Rough Sets (ANNIE'98), St. Louis, Missouri, 1998.

[Ryu02] T.W. Ryu, W-Y. Chang, Customer Analysis Using Decision Tree and Association Rule Mining, In the Proc. of the International Conference on SMART ENGINEERING SYSTEM DESIGN: Neural Networks, Fuzzy Logic, Evolutionary Programming, Artificial Life, and Data Mining (ANNIE’02), ASME press, St. Louis, Missouri, 2002.

[Sala00] H. Salameh, Nearest-neighbor clustering algorithm for relational databases, Master of Science Thesis, Department of Computer Science, California State University, Fullerton, 2000.

[Shek96] E.C. Shek, R.R. Muntz, E. Mesrobian, K. Ng, Scalable Exploratory Data Mining of Distributed Geoscientific Data, In Proc. of the 2nd Int’l Conf. On Knowledge Discovery and Data Mining, Portland, Oregon, 1996.

[Shep62] R.N. Shepard, The analysis of proximities: Multidimensional scaling with unknown distance function. Part I Psychometrika 27, pp. 125-140, 1962.

[Stan86] C. Stanfil, D. Waltz, Toward memory-based reasoning, Communications of the ACM 29, pp. 1213-1228, 1986.

[Step86] R.E. Stepp, R.S. Michalski, Conceptual clustering: Inventing goal-oriented classifications of structured objects. In Machine Learning: An Artificial Intelligence Approach 2, Morgan Kaufmann, San Mateo, CA, pp. 471-498, 1986.

[Thom91] K. Thompson, P. Langley, Concept formation in structured domains, In Concept Formation: Knowledge and Experience in Unsupervised Learning, D.H. Fisher, M. Pazzani, P. Langley (Ed.), Morgan Kaufmann, 1991.

[Tver77] A. Tversky, Feature of Similarity, Psychological review 84 (4), pp. 327-352, 1977. [Wang97] W. Wang, J. Yang, R.R. Muntz, STING: A Statistical Information Grid Approach to

Spatial Data Mining, VLDB, pp. 186-195, 1997. [Wass85] K. Wasserman, Unifying representation and generalization: Understanding hierarchically

structured objects, Doctoral dissertation, Department of Computer Science, Columbia University, New York, 1985.

[Wils97] D.R. Wilson, T.R. Martinez, Improved Heterogeneous Distance Functions, Journal of Artificial Intelligence Research 6, pp. 1-34, 1997.

[Zhan96] T. Zhang, R. Ramakrishnan, M. Livny, BIRCH: an efficient database clustering method for very large databases, In Proc. of ACM-SIGMOD Int. Conf. On Management of Data, Montreal, Canada, pp. 103-114, 1996.

[Zehu98] W. Zehua, Design and Implementation Tool to Extract Structural Information from Relational Databases, Master of Science Thesis, Department of Computer Science, University of Houston, Houston, 1998.

36