29
Database Clustering and Summary Generation Tae-Wan Ryu and Christoph F. Eick

Database Clustering and Summary Generation

  • Upload
    petra

  • View
    55

  • Download
    0

Embed Size (px)

DESCRIPTION

Database Clustering and Summary Generation. Tae-Wan Ryu and Christoph F. Eick. Similarity Measures For Multi-valued Attributes for Database Clustering. Tae-wan Ryu and Christoph F. Eick Department of Computer Science University of Houston Talk Organization Database Clustering - PowerPoint PPT Presentation

Citation preview

Page 1: Database Clustering and Summary Generation

Database Clustering and Summary Generation

Tae-Wan Ryu and Christoph F. EickTae-Wan Ryu and Christoph F. Eick

Page 2: Database Clustering and Summary Generation

Tae-wan Ryu and Christoph F. EickDepartment of Computer Science

University of Houston

Talk Organization Database Clustering Problems of Database Clustering Extended Data Sets Similarity Measures for Sets and Bags An Architecture for Database Clustering Summary and Conclusion

Similarity Measures For Multi-valued Attributes for Database Clustering

Similarity Measures For Multi-valued Attributes for Database Clustering

Page 3: Database Clustering and Summary Generation

Select/preprocessSelect/preprocess TransformTransform Data mineData mine Interpret/Evaluate/AssimilateInterpret/Evaluate/Assimilate

Data preparation

Data sources Selected/Preprocessed data Transformed data Extracted informationKnowledge

General KDD StepsGeneral KDD Steps

Page 4: Database Clustering and Summary Generation

To develop methodologies, techniques, and tools to create summaries from databases using cluster analysis and genetic programming

Our approach Partition the database into groups of similar objects using cluster

analysis Find commonalities that objects belonging to each group share

using genetic programming

Research GoalResearch Goal

Page 5: Database Clustering and Summary Generation

Summary Generation

Database Clustering

Young

Midnight Dinner Lunch

Clusters

< Example >

White color Retired

< Steps >

Database Restaurant database

Database Summary Generation Steps and Example

Database Summary Generation Steps and Example

Groups of similar objects

Summaries describingthe commonalitieswithin each group

Page 6: Database Clustering and Summary Generation

Marriage

husband, 1:n wife, 1:n

ehusband, n:1 ewife, n:1 Employee superv, n:1 Department

works_for, n:1 works_loc, 1:n

Works_on Project control, 1:n Dept_loc

works_on, 1:n project, n:1

hssn wssn mdate numkid

name ssn address sex salary superssn dno

dnum dlocpname pnum ploc dnumessn pno hours

dnum dname

An Example Schema DiagramAn Example Schema Diagram

Page 7: Database Clustering and Summary Generation

Preparing input data sets for clustering

Appropriate data selection and preparation from a database is important task

Key Problems How to support a user’s viewpoint including attribute selection Data model discrepancy between storage format and the input

format that clustering algorithms assume How to cope with structural information, especially 1:n and n:m

relationships

Preprocessing forDatabase ClusteringPreprocessing for

Database Clustering

Page 8: Database Clustering and Summary Generation

Input Format for Data Mining AlgorithmsInput Format for Data Mining Algorithms

Data Format for Input Data Sets Single flat file format (basically, the data set has to be

stored as a single(!) relation) Complex and structured formats

Problem: Almost all existing data mining and clustering approaches assume that input data set is in single flat file format.

Page 9: Database Clustering and Summary Generation

Person Purchase Joined result

(a) (b)

ptype (payment type): 1 for cash, 2 for credit, and 3 for check, the cardinality ratio is 1:n

(a) an example of Personal relational database, (b) a joined table from Person and Purchase relations

ssn name age sex 111111111 Johny 43 M 222222222 Andy 21 F 333333333 Post 67 M 444444444 Jenny 35 F

ssn location ptype amount date 111111111 Warehouse 1 400 02-10-96 111111111 Grocery 2 70 05-14-96 111111111 Mall 3 200 12-24-96 222222222 Mall 2 300 12-23-96 222222222 Grocery 3 100 06-22-96 333333333 Mall 1 30 11-05-96

name age sex ptype amount location Johny 43 M 1 400 Mall Johny 43 M 2 70 Grocery Johny 43 M 3 200 Warehouse Andy 42 F 2 300 Mall Andy 42 F 3 100 Grocery Post 67 M 1 30 Mall Jenny 35 F null null null

An Example Database to Illustrate the Problems with Relationship Information in Database Clustering

An Example Database to Illustrate the Problems with Relationship Information in Database Clustering

Page 10: Database Clustering and Summary Generation

Applying aggregate functions or generalization

operators to convert a multi-valued attribute into a single valued attribute.

Problems User has to make a critical decision (e.g., which aggregate

function to use?) Valuable related information may be lost.

Existing ApproachesExisting Approaches

Page 11: Database Clustering and Summary Generation

name age sex p.ptype p.amount p.location Johny 43 M {1,2,3} {400,70,200} {Mall, Grocery, Warehouse} Andy 21 F {2,3} {100,100} {Mall, Grocery} Post 67 M 1 30 Mall Jenny 35 F null null null

A converted table with a bag of values

Group similarity measures are needed.How to measure similarity between bags of values?

name age sex ptype amount location Johny 43 M 1 400 Mall Johny 43 M 2 70 Grocery Johny 43 M 3 200 Warehouse Andy 42 F 2 100 Mall Andy 42 F 3 100 Grocery Post 67 M 1 30 Mall Jenny 35 F null null null

Extended Data SetsExtended Data Sets

Page 12: Database Clustering and Summary Generation

Structured database

Structured database

Manual transformationManual transformation Flat file Clustering algorithmsClustering algorithms

GeneralizedClustering algorithms

GeneralizedClustering algorithms

Automated preprocessingAutomated

preprocessingExtended data set

<Current approach>

<Proposed approach>

Approaches for Database ClusteringApproaches for Database Clustering

Page 13: Database Clustering and Summary Generation

LABYRINTH (Thompson et al.)

Ketterlin’s extended COBWEB KATE (Manago et al.)

SUBDUE (Holder et al.)

INLEN (Ribeiro et al.)

KBG (Bisson et al.), KLUSTER (Kietz et al.)

Related WorkRelated Work

Page 14: Database Clustering and Summary Generation

To alleviate the representational gab between databases on the one hand and input formats of clustering algorithms on the other hand

To design and implement semi-automatic tools to facilitate database clustering

To generalize clustering algorithms

Research Objectives for Database ClusteringResearch Objectives for Database Clustering

Page 15: Database Clustering and Summary Generation

Databased1, d2, …, dn

Extended data setgenerator

Extended data setgenerator

User’sinterests and

objectives

Extendeddata set1

Generating Extended Data Sets Froma Structured Database

Generating Extended Data Sets Froma Structured Database

Page 16: Database Clustering and Summary Generation

Group Similarity Measures

Mixed Types: qualitative, quantitative types.

Qualitative type: Tversky’s set-theoretical similarity models.

Contrast modelS(a,b) = f(AB) f(A B) f(B A),

where a and b be two objects, and A and B denote the sets of features for some , , 0; f is the cardinality of the set

Ratio model (e.g., normalized similarity)S(a,b) = f(AB) / [f(AB) + f(A B) + f(B A)], , 0

A Unified Similarity Measure for Clustering Extended Data Sets

A Unified Similarity Measure for Clustering Extended Data Sets

Page 17: Database Clustering and Summary Generation

Quantitative type: group average Group average between group A and B

where n is the total number of object-pairs, d(a,b)i is the dissimilarity measure for the ith pair of objects a and b,a A, b B.

By taking the average of all the inter-object measures for those pairs ofobjects from which each object of a pair is in different groups.

n

ind(a,b)BAd i

1),( ,

Group Similarity Measures... continuedGroup Similarity Measures... continued

Page 18: Database Clustering and Summary Generation

Gower’s similarity measure for data sets with mixed-types.

Extended similarity measure for multi-valued data sets with mixed-types.

where m = l + q. The functions, sl(a,b) and sq(a,b) are similarity functions for qualitative attributes and quantitative attributes respectively.

m

i

m

iwbaswbaS iiiii

1 1),(),( /

)11 1

(/]1

),(),([),(

q

jw

l

i

l

iw

q

jbaswbaswbaS jijjqjiili

A Framework for Mixed Type Similarity Measures for Extended Data Sets

A Framework for Mixed Type Similarity Measures for Extended Data Sets

Page 19: Database Clustering and Summary Generation

Nearest-neighbor clustering DBSCAN Leader algorithm Hierarchical clustering

Clustering Algorithms for Extended Data SetsClustering Algorithms for Extended Data Sets

Page 20: Database Clustering and Summary Generation

Data Extraction Tool

DBMS

Clustering Tool

User Interface

A set of clusters

Similarity measure

Similarity Measure Tool

Default choice and domain information

Library of similarity measures

Type and weight

information

ExtendedData set

Library of clustering algorithms

Database Clustering EnvironmentDatabase Clustering Environment

Page 21: Database Clustering and Summary Generation

Processed Query result

data Query

User's interests and objectives

Pre-processor

Formtranslator

DBMS

Join form

Extendeddata set

Our datamining tools

Database nameRelationshipdefinitions

Data set of interestSelected attributesOther information

Flatfile

OtherData

miningtools

A More Detailed Tool ArchitectureA More Detailed Tool Architecture

Page 22: Database Clustering and Summary Generation

A Join Template Form

Begin-spec Database-name: DB;Link-definitions: Link-list; Begin-joinDataset-of-interest: Dsetintrest;Selected-attributes: Attr-list;Objective-attributes: Obj-attr-list;Extended-data-set: E; End-joinEnd-spec

A Join Template FormA Join Template Form

Page 23: Database Clustering and Summary Generation

Begin-spec DB-name: Company Link-definitions: superv(Employee.ssn, Employee.superssn), husband(Employee.ssn, Marriage.hssn), wife(Employee.ssn,Marriage.wssn), ehusband(Marriage.hssn, Employee.ssn), ewife(Marriage.wssn, Employee.ssn), works_on(Employee.ssn, Works_on.essn), project(Works_on.pno, Project.pnum), works_for(Employee.dno, Department.dnum), works_loc(Department.dnum, Dept_loc.dnum) Begin-join Dateset-of-interest: Employee Selected-attributes: ssn, sex, salary, superv.salary, wife.ewife.salary, works_on.hours, works_on.project.pname, works_for.works_loc.dloc Objective-attributes: ssn Output-data-set: E1 End-joinEnd-spec

An Example of the Interface of the Extended Data Set Generation Tool

An Example of the Interface of the Extended Data Set Generation Tool

Page 24: Database Clustering and Summary Generation

Project the Data Set of Interest by Primary key and Selected Attributes

Join the Data set of Interest and related data sets to get all related attributes for each join-path

Group attributes together that describe the same object

Algorithm to Generate Extended Data SetsAlgorithm to Generate Extended Data Sets

Page 25: Database Clustering and Summary Generation

Our approach uses database queries as our summary representation language.

Queries that compute the objects belonging to a cluster and no other objects are considered to be perfect summaries for a cluster.

An example query for a cluster(SELECT ssn name address

FROM person purchase

WHERE (amount-spent > 1000) and

(payment-type = ‘cash’)and

(store-name = ‘flea-market’))

“Typically, members in the cluster have spent more than $1,000 cash for shopping in a flea-market”

Summary RepresentationSummary Representation

Page 26: Database Clustering and Summary Generation

Discussed the data model discrepancy between database storage format and input data format for traditional clustering algorithms

Discussed the problems of dealing with relationship information in database clustering

Presented a different way of representing related information using extended data sets

Introduced the design and architecture of an automatic tools to generate extended data sets from databases

Generalized the traditional similarity measures and present a framework to cope with extended data sets in similarity-based clustering

Summary and ContributionsSummary and Contributions

Page 27: Database Clustering and Summary Generation

Architecture of MASSONArchitecture of MASSON

Schema informationObject set

Discovered query set

Domainknowledge

GP based discovery system

KB

DB

InterfaceGP engine

Query set

Query result

generate

select

apply

returnevaluate

user input system input

user input

user interface

system input

DBMS

Clusteringmodule

cluster g1

g2

gk

...

Page 28: Database Clustering and Summary Generation

Initial generation generation2Generationn

selectioncrossovermutation

selectioncrossovermutation

evolve evolve

Solution Q

selection

Initial population evolved population evolved population

q11, q12,..,q1m q21, q22,..,q2mqn1, qn2,..,qnm

n: number of generationm: the size of population

Evolution ProcessEvolution Process

Page 29: Database Clustering and Summary Generation

Initial generation generation2Generationn

selectioncrossovermutation

selectioncrossovermutation

evolve evolve

Solution Q

selection

Initial population evolved population evolved population

q11, q12,..,q1m q21, q22,..,q2mqn1, qn2,..,qnm

n: number of generationm: the size of population

Evolution ProcessEvolution Process