Upload
petra
View
55
Download
0
Embed Size (px)
DESCRIPTION
Database Clustering and Summary Generation. Tae-Wan Ryu and Christoph F. Eick. Similarity Measures For Multi-valued Attributes for Database Clustering. Tae-wan Ryu and Christoph F. Eick Department of Computer Science University of Houston Talk Organization Database Clustering - PowerPoint PPT Presentation
Citation preview
Database Clustering and Summary Generation
Tae-Wan Ryu and Christoph F. EickTae-Wan Ryu and Christoph F. Eick
Tae-wan Ryu and Christoph F. EickDepartment of Computer Science
University of Houston
Talk Organization Database Clustering Problems of Database Clustering Extended Data Sets Similarity Measures for Sets and Bags An Architecture for Database Clustering Summary and Conclusion
Similarity Measures For Multi-valued Attributes for Database Clustering
Similarity Measures For Multi-valued Attributes for Database Clustering
Select/preprocessSelect/preprocess TransformTransform Data mineData mine Interpret/Evaluate/AssimilateInterpret/Evaluate/Assimilate
Data preparation
Data sources Selected/Preprocessed data Transformed data Extracted informationKnowledge
General KDD StepsGeneral KDD Steps
To develop methodologies, techniques, and tools to create summaries from databases using cluster analysis and genetic programming
Our approach Partition the database into groups of similar objects using cluster
analysis Find commonalities that objects belonging to each group share
using genetic programming
Research GoalResearch Goal
Summary Generation
Database Clustering
Young
Midnight Dinner Lunch
Clusters
< Example >
White color Retired
< Steps >
Database Restaurant database
Database Summary Generation Steps and Example
Database Summary Generation Steps and Example
Groups of similar objects
Summaries describingthe commonalitieswithin each group
Marriage
husband, 1:n wife, 1:n
ehusband, n:1 ewife, n:1 Employee superv, n:1 Department
works_for, n:1 works_loc, 1:n
Works_on Project control, 1:n Dept_loc
works_on, 1:n project, n:1
hssn wssn mdate numkid
name ssn address sex salary superssn dno
dnum dlocpname pnum ploc dnumessn pno hours
dnum dname
An Example Schema DiagramAn Example Schema Diagram
Preparing input data sets for clustering
Appropriate data selection and preparation from a database is important task
Key Problems How to support a user’s viewpoint including attribute selection Data model discrepancy between storage format and the input
format that clustering algorithms assume How to cope with structural information, especially 1:n and n:m
relationships
Preprocessing forDatabase ClusteringPreprocessing for
Database Clustering
Input Format for Data Mining AlgorithmsInput Format for Data Mining Algorithms
Data Format for Input Data Sets Single flat file format (basically, the data set has to be
stored as a single(!) relation) Complex and structured formats
Problem: Almost all existing data mining and clustering approaches assume that input data set is in single flat file format.
Person Purchase Joined result
(a) (b)
ptype (payment type): 1 for cash, 2 for credit, and 3 for check, the cardinality ratio is 1:n
(a) an example of Personal relational database, (b) a joined table from Person and Purchase relations
ssn name age sex 111111111 Johny 43 M 222222222 Andy 21 F 333333333 Post 67 M 444444444 Jenny 35 F
ssn location ptype amount date 111111111 Warehouse 1 400 02-10-96 111111111 Grocery 2 70 05-14-96 111111111 Mall 3 200 12-24-96 222222222 Mall 2 300 12-23-96 222222222 Grocery 3 100 06-22-96 333333333 Mall 1 30 11-05-96
name age sex ptype amount location Johny 43 M 1 400 Mall Johny 43 M 2 70 Grocery Johny 43 M 3 200 Warehouse Andy 42 F 2 300 Mall Andy 42 F 3 100 Grocery Post 67 M 1 30 Mall Jenny 35 F null null null
An Example Database to Illustrate the Problems with Relationship Information in Database Clustering
An Example Database to Illustrate the Problems with Relationship Information in Database Clustering
Applying aggregate functions or generalization
operators to convert a multi-valued attribute into a single valued attribute.
Problems User has to make a critical decision (e.g., which aggregate
function to use?) Valuable related information may be lost.
Existing ApproachesExisting Approaches
name age sex p.ptype p.amount p.location Johny 43 M {1,2,3} {400,70,200} {Mall, Grocery, Warehouse} Andy 21 F {2,3} {100,100} {Mall, Grocery} Post 67 M 1 30 Mall Jenny 35 F null null null
A converted table with a bag of values
Group similarity measures are needed.How to measure similarity between bags of values?
name age sex ptype amount location Johny 43 M 1 400 Mall Johny 43 M 2 70 Grocery Johny 43 M 3 200 Warehouse Andy 42 F 2 100 Mall Andy 42 F 3 100 Grocery Post 67 M 1 30 Mall Jenny 35 F null null null
Extended Data SetsExtended Data Sets
Structured database
Structured database
Manual transformationManual transformation Flat file Clustering algorithmsClustering algorithms
GeneralizedClustering algorithms
GeneralizedClustering algorithms
Automated preprocessingAutomated
preprocessingExtended data set
<Current approach>
<Proposed approach>
Approaches for Database ClusteringApproaches for Database Clustering
LABYRINTH (Thompson et al.)
Ketterlin’s extended COBWEB KATE (Manago et al.)
SUBDUE (Holder et al.)
INLEN (Ribeiro et al.)
KBG (Bisson et al.), KLUSTER (Kietz et al.)
Related WorkRelated Work
To alleviate the representational gab between databases on the one hand and input formats of clustering algorithms on the other hand
To design and implement semi-automatic tools to facilitate database clustering
To generalize clustering algorithms
Research Objectives for Database ClusteringResearch Objectives for Database Clustering
Databased1, d2, …, dn
Extended data setgenerator
Extended data setgenerator
User’sinterests and
objectives
Extendeddata set1
Generating Extended Data Sets Froma Structured Database
Generating Extended Data Sets Froma Structured Database
Group Similarity Measures
Mixed Types: qualitative, quantitative types.
Qualitative type: Tversky’s set-theoretical similarity models.
Contrast modelS(a,b) = f(AB) f(A B) f(B A),
where a and b be two objects, and A and B denote the sets of features for some , , 0; f is the cardinality of the set
Ratio model (e.g., normalized similarity)S(a,b) = f(AB) / [f(AB) + f(A B) + f(B A)], , 0
A Unified Similarity Measure for Clustering Extended Data Sets
A Unified Similarity Measure for Clustering Extended Data Sets
Quantitative type: group average Group average between group A and B
where n is the total number of object-pairs, d(a,b)i is the dissimilarity measure for the ith pair of objects a and b,a A, b B.
By taking the average of all the inter-object measures for those pairs ofobjects from which each object of a pair is in different groups.
n
ind(a,b)BAd i
1),( ,
Group Similarity Measures... continuedGroup Similarity Measures... continued
Gower’s similarity measure for data sets with mixed-types.
Extended similarity measure for multi-valued data sets with mixed-types.
where m = l + q. The functions, sl(a,b) and sq(a,b) are similarity functions for qualitative attributes and quantitative attributes respectively.
m
i
m
iwbaswbaS iiiii
1 1),(),( /
)11 1
(/]1
),(),([),(
q
jw
l
i
l
iw
q
jbaswbaswbaS jijjqjiili
A Framework for Mixed Type Similarity Measures for Extended Data Sets
A Framework for Mixed Type Similarity Measures for Extended Data Sets
Nearest-neighbor clustering DBSCAN Leader algorithm Hierarchical clustering
Clustering Algorithms for Extended Data SetsClustering Algorithms for Extended Data Sets
Data Extraction Tool
DBMS
Clustering Tool
User Interface
A set of clusters
Similarity measure
Similarity Measure Tool
Default choice and domain information
Library of similarity measures
Type and weight
information
ExtendedData set
Library of clustering algorithms
Database Clustering EnvironmentDatabase Clustering Environment
Processed Query result
data Query
User's interests and objectives
Pre-processor
Formtranslator
DBMS
Join form
Extendeddata set
Our datamining tools
Database nameRelationshipdefinitions
Data set of interestSelected attributesOther information
Flatfile
OtherData
miningtools
A More Detailed Tool ArchitectureA More Detailed Tool Architecture
A Join Template Form
Begin-spec Database-name: DB;Link-definitions: Link-list; Begin-joinDataset-of-interest: Dsetintrest;Selected-attributes: Attr-list;Objective-attributes: Obj-attr-list;Extended-data-set: E; End-joinEnd-spec
A Join Template FormA Join Template Form
Begin-spec DB-name: Company Link-definitions: superv(Employee.ssn, Employee.superssn), husband(Employee.ssn, Marriage.hssn), wife(Employee.ssn,Marriage.wssn), ehusband(Marriage.hssn, Employee.ssn), ewife(Marriage.wssn, Employee.ssn), works_on(Employee.ssn, Works_on.essn), project(Works_on.pno, Project.pnum), works_for(Employee.dno, Department.dnum), works_loc(Department.dnum, Dept_loc.dnum) Begin-join Dateset-of-interest: Employee Selected-attributes: ssn, sex, salary, superv.salary, wife.ewife.salary, works_on.hours, works_on.project.pname, works_for.works_loc.dloc Objective-attributes: ssn Output-data-set: E1 End-joinEnd-spec
An Example of the Interface of the Extended Data Set Generation Tool
An Example of the Interface of the Extended Data Set Generation Tool
Project the Data Set of Interest by Primary key and Selected Attributes
Join the Data set of Interest and related data sets to get all related attributes for each join-path
Group attributes together that describe the same object
Algorithm to Generate Extended Data SetsAlgorithm to Generate Extended Data Sets
Our approach uses database queries as our summary representation language.
Queries that compute the objects belonging to a cluster and no other objects are considered to be perfect summaries for a cluster.
An example query for a cluster(SELECT ssn name address
FROM person purchase
WHERE (amount-spent > 1000) and
(payment-type = ‘cash’)and
(store-name = ‘flea-market’))
“Typically, members in the cluster have spent more than $1,000 cash for shopping in a flea-market”
Summary RepresentationSummary Representation
Discussed the data model discrepancy between database storage format and input data format for traditional clustering algorithms
Discussed the problems of dealing with relationship information in database clustering
Presented a different way of representing related information using extended data sets
Introduced the design and architecture of an automatic tools to generate extended data sets from databases
Generalized the traditional similarity measures and present a framework to cope with extended data sets in similarity-based clustering
Summary and ContributionsSummary and Contributions
Architecture of MASSONArchitecture of MASSON
Schema informationObject set
Discovered query set
Domainknowledge
GP based discovery system
KB
DB
InterfaceGP engine
Query set
Query result
generate
select
apply
returnevaluate
user input system input
user input
user interface
system input
DBMS
Clusteringmodule
cluster g1
g2
gk
...
Initial generation generation2Generationn
selectioncrossovermutation
selectioncrossovermutation
evolve evolve
Solution Q
selection
Initial population evolved population evolved population
q11, q12,..,q1m q21, q22,..,q2mqn1, qn2,..,qnm
n: number of generationm: the size of population
Evolution ProcessEvolution Process
Initial generation generation2Generationn
selectioncrossovermutation
selectioncrossovermutation
evolve evolve
Solution Q
selection
Initial population evolved population evolved population
q11, q12,..,q1m q21, q22,..,q2mqn1, qn2,..,qnm
n: number of generationm: the size of population
Evolution ProcessEvolution Process