Database Clustering and Summary Generation

Database Clustering and Summary Generation

Tae-Wan Ryu and Christoph F. EickTae-Wan Ryu and Christoph F. Eick

Tae-wan Ryu and Christoph F. EickDepartment of Computer Science

University of Houston

Talk Organization Database Clustering Problems of Database Clustering Extended Data Sets Similarity Measures for Sets and Bags An Architecture for Database Clustering Summary and Conclusion

Similarity Measures For Multi-valued Attributes for Database Clustering

Similarity Measures For Multi-valued Attributes for Database Clustering

Select/preprocessSelect/preprocess TransformTransform Data mineData mine Interpret/Evaluate/AssimilateInterpret/Evaluate/Assimilate

Data preparation

Data sources Selected/Preprocessed data Transformed data Extracted informationKnowledge

General KDD StepsGeneral KDD Steps

To develop methodologies, techniques, and tools to create summaries from databases using cluster analysis and genetic programming

Our approach Partition the database into groups of similar objects using cluster

analysis Find commonalities that objects belonging to each group share

using genetic programming

Research GoalResearch Goal

Summary Generation

Database Clustering

Young

Midnight Dinner Lunch

Clusters

< Example >

White color Retired

< Steps >

Database Restaurant database

Database Summary Generation Steps and Example

Database Summary Generation Steps and Example

Groups of similar objects

Summaries describingthe commonalitieswithin each group

Marriage

husband, 1:n wife, 1:n

ehusband, n:1 ewife, n:1 Employee superv, n:1 Department

works_for, n:1 works_loc, 1:n

Works_on Project control, 1:n Dept_loc

works_on, 1:n project, n:1

hssn wssn mdate numkid

name ssn address sex salary superssn dno

dnum dlocpname pnum ploc dnumessn pno hours

dnum dname

An Example Schema DiagramAn Example Schema Diagram

Preparing input data sets for clustering

Appropriate data selection and preparation from a database is important task

Key Problems How to support a user’s viewpoint including attribute selection Data model discrepancy between storage format and the input

format that clustering algorithms assume How to cope with structural information, especially 1:n and n:m

relationships

Preprocessing forDatabase ClusteringPreprocessing for

Database Clustering

Input Format for Data Mining AlgorithmsInput Format for Data Mining Algorithms

Data Format for Input Data Sets Single flat file format (basically, the data set has to be

stored as a single(!) relation) Complex and structured formats

Problem: Almost all existing data mining and clustering approaches assume that input data set is in single flat file format.

Person Purchase Joined result

(a) (b)

ptype (payment type): 1 for cash, 2 for credit, and 3 for check, the cardinality ratio is 1:n

(a) an example of Personal relational database, (b) a joined table from Person and Purchase relations

ssn name age sex 111111111 Johny 43 M 222222222 Andy 21 F 333333333 Post 67 M 444444444 Jenny 35 F

ssn location ptype amount date 111111111 Warehouse 1 400 02-10-96 111111111 Grocery 2 70 05-14-96 111111111 Mall 3 200 12-24-96 222222222 Mall 2 300 12-23-96 222222222 Grocery 3 100 06-22-96 333333333 Mall 1 30 11-05-96

name age sex ptype amount location Johny 43 M 1 400 Mall Johny 43 M 2 70 Grocery Johny 43 M 3 200 Warehouse Andy 42 F 2 300 Mall Andy 42 F 3 100 Grocery Post 67 M 1 30 Mall Jenny 35 F null null null

An Example Database to Illustrate the Problems with Relationship Information in Database Clustering

An Example Database to Illustrate the Problems with Relationship Information in Database Clustering

Applying aggregate functions or generalization

operators to convert a multi-valued attribute into a single valued attribute.

Problems User has to make a critical decision (e.g., which aggregate

function to use?) Valuable related information may be lost.

Existing ApproachesExisting Approaches

name age sex p.ptype p.amount p.location Johny 43 M {1,2,3} {400,70,200} {Mall, Grocery, Warehouse} Andy 21 F {2,3} {100,100} {Mall, Grocery} Post 67 M 1 30 Mall Jenny 35 F null null null

A converted table with a bag of values

Group similarity measures are needed.How to measure similarity between bags of values?

name age sex ptype amount location Johny 43 M 1 400 Mall Johny 43 M 2 70 Grocery Johny 43 M 3 200 Warehouse Andy 42 F 2 100 Mall Andy 42 F 3 100 Grocery Post 67 M 1 30 Mall Jenny 35 F null null null

Extended Data SetsExtended Data Sets

Structured database

Structured database

Manual transformationManual transformation Flat file Clustering algorithmsClustering algorithms

GeneralizedClustering algorithms

GeneralizedClustering algorithms

Automated preprocessingAutomated

preprocessingExtended data set

<Current approach>

<Proposed approach>

Approaches for Database ClusteringApproaches for Database Clustering

LABYRINTH (Thompson et al.)

Ketterlin’s extended COBWEB KATE (Manago et al.)

SUBDUE (Holder et al.)

INLEN (Ribeiro et al.)

KBG (Bisson et al.), KLUSTER (Kietz et al.)

Related WorkRelated Work

To alleviate the representational gab between databases on the one hand and input formats of clustering algorithms on the other hand

To design and implement semi-automatic tools to facilitate database clustering

To generalize clustering algorithms

Research Objectives for Database ClusteringResearch Objectives for Database Clustering

Databased1, d2, …, dn

Extended data setgenerator

Extended data setgenerator

User’sinterests and

objectives

Extendeddata set1

Generating Extended Data Sets Froma Structured Database

Generating Extended Data Sets Froma Structured Database

Group Similarity Measures

Mixed Types: qualitative, quantitative types.

Qualitative type: Tversky’s set-theoretical similarity models.

Contrast modelS(a,b) = f(AB) f(A B) f(B A),

where a and b be two objects, and A and B denote the sets of features for some , , 0; f is the cardinality of the set

Ratio model (e.g., normalized similarity)S(a,b) = f(AB) / [f(AB) + f(A B) + f(B A)], , 0

A Unified Similarity Measure for Clustering Extended Data Sets

A Unified Similarity Measure for Clustering Extended Data Sets

Quantitative type: group average Group average between group A and B

where n is the total number of object-pairs, d(a,b)i is the dissimilarity measure for the ith pair of objects a and b,a A, b B.

By taking the average of all the inter-object measures for those pairs ofobjects from which each object of a pair is in different groups.

n

ind(a,b)BAd i

1),( ,

Group Similarity Measures... continuedGroup Similarity Measures... continued

Gower’s similarity measure for data sets with mixed-types.

Extended similarity measure for multi-valued data sets with mixed-types.

where m = l + q. The functions, sl(a,b) and sq(a,b) are similarity functions for qualitative attributes and quantitative attributes respectively.

m

i

m

iwbaswbaS iiiii

1 1),(),( /

)11 1

(/]1

),(),([),(

q

jw

l

i

l

iw

q

jbaswbaswbaS jijjqjiili

A Framework for Mixed Type Similarity Measures for Extended Data Sets

A Framework for Mixed Type Similarity Measures for Extended Data Sets

Nearest-neighbor clustering DBSCAN Leader algorithm Hierarchical clustering

Clustering Algorithms for Extended Data SetsClustering Algorithms for Extended Data Sets

Data Extraction Tool

DBMS

Clustering Tool

User Interface

A set of clusters

Similarity measure

Similarity Measure Tool

Default choice and domain information

Library of similarity measures

Type and weight

information

ExtendedData set

Library of clustering algorithms

Database Clustering EnvironmentDatabase Clustering Environment

Processed Query result

data Query

User's interests and objectives

Pre-processor

Formtranslator

DBMS

Join form

Extendeddata set

Our datamining tools

Database nameRelationshipdefinitions

Data set of interestSelected attributesOther information

Flatfile

OtherData

miningtools

A More Detailed Tool ArchitectureA More Detailed Tool Architecture

A Join Template Form

Begin-spec Database-name: DB;Link-definitions: Link-list; Begin-joinDataset-of-interest: Dsetintrest;Selected-attributes: Attr-list;Objective-attributes: Obj-attr-list;Extended-data-set: E; End-joinEnd-spec

A Join Template FormA Join Template Form

Begin-spec DB-name: Company Link-definitions: superv(Employee.ssn, Employee.superssn), husband(Employee.ssn, Marriage.hssn), wife(Employee.ssn,Marriage.wssn), ehusband(Marriage.hssn, Employee.ssn), ewife(Marriage.wssn, Employee.ssn), works_on(Employee.ssn, Works_on.essn), project(Works_on.pno, Project.pnum), works_for(Employee.dno, Department.dnum), works_loc(Department.dnum, Dept_loc.dnum) Begin-join Dateset-of-interest: Employee Selected-attributes: ssn, sex, salary, superv.salary, wife.ewife.salary, works_on.hours, works_on.project.pname, works_for.works_loc.dloc Objective-attributes: ssn Output-data-set: E1 End-joinEnd-spec

An Example of the Interface of the Extended Data Set Generation Tool

An Example of the Interface of the Extended Data Set Generation Tool

Project the Data Set of Interest by Primary key and Selected Attributes

Join the Data set of Interest and related data sets to get all related attributes for each join-path

Group attributes together that describe the same object

Algorithm to Generate Extended Data SetsAlgorithm to Generate Extended Data Sets

Our approach uses database queries as our summary representation language.

Queries that compute the objects belonging to a cluster and no other objects are considered to be perfect summaries for a cluster.

An example query for a cluster(SELECT ssn name address

FROM person purchase

WHERE (amount-spent > 1000) and

(payment-type = ‘cash’)and

(store-name = ‘flea-market’))

“Typically, members in the cluster have spent more than $1,000 cash for shopping in a flea-market”

Summary RepresentationSummary Representation

Discussed the data model discrepancy between database storage format and input data format for traditional clustering algorithms

Discussed the problems of dealing with relationship information in database clustering

Presented a different way of representing related information using extended data sets

Introduced the design and architecture of an automatic tools to generate extended data sets from databases

Generalized the traditional similarity measures and present a framework to cope with extended data sets in similarity-based clustering

Summary and ContributionsSummary and Contributions

Architecture of MASSONArchitecture of MASSON

Schema informationObject set

Discovered query set

Domainknowledge

GP based discovery system

KB

DB

InterfaceGP engine

Query set

Query result

generate

select

apply

returnevaluate

user input system input

user input

user interface

system input

DBMS

Clusteringmodule

cluster g1

g2

gk

...

Initial generation generation2Generationn

selectioncrossovermutation


evolve evolve

Solution Q

selection

Initial population evolved population evolved population

q11, q12,..,q1m q21, q22,..,q2mqn1, qn2,..,qnm

n: number of generationm: the size of population

Evolution ProcessEvolution Process

Initial generation generation2Generationn



evolve evolve

Solution Q

selection

Initial population evolved population evolved population

q11, q12,..,q1m q21, q22,..,q2mqn1, qn2,..,qnm

n: number of generationm: the size of population

Evolution ProcessEvolution Process

Documents

Database Clustering and Summary Generation