CACTUS-Clustering Categorical Data Using Summaries

Advisor ： Dr. HsuGraduate ： Min-Hung Lin

IDSL seminar 2001/10/30

Outline Motivation Objective Related Work Definitions CACTUS Performance Evaluation Conclusions Comments

Motivation Clustering with categorical

attributes has received attention Previous algorithms do not give a

formal description of the clusters Some of them need post-process

the output of the algorithm to identify the final clusters.

Objective Introduce a novel formalization of a cl

uster for categorical attributes. Describe a fast summarization-based

algorithm CACTUS that discovers clusters.

Evaluate the performance of CACTUS on synthetic and real datasets.

Related Work EM algorithm [Dempster et al., 1977]

Iterative clustering technique STIRR algorithm[Gibson et al., 1998]

Iterative algorithm based on non-linear dynamical systems

ROCK algorithm[Guha et al., 1999] Hierarchical clustering algorithm

DEF:Support

DEF:Strongly Connected

DEF:Strongly Connected(cont’d)

Formal Definition of a Cluster

Formal Definition of a Cluster (cont’d) is the cluster-projection of C on C is called a sub-cluster if it

satisfies conditions (1) and (3) A cluster C over a subset of all

attributes is called a subspace cluster on S; if |S| = k then C is called a k-cluster

DEF:Similarity

Inter-attribute Summaries

Intra-attribute Summaries

Experiments

Result STIRR fails to discover

clusters consisting of overlapping cluster-projections on any attribute

clusters where two or more clusters share the same cluster projection

CACTUS correctly discovers all clusters

CACTUS Three-phase clustering algorithm

Summarization Phase Compute the summary information

Clustering Phase Discover a set of candidate clusters

Validation Phase Determine the actual set of clusters

Summarization Phase Inter-attribute Summaries

Intra-attribute Summaries

Clustering Phase Computing cluster-projections on

attributes Level-wise synthesis of clusters

Computing Cluster-Projections on Attributes Step 1 :pairwise cluster-projection

Step 2 :intersection

Computing Cluster-Projections on Attributes (cont’d)

Cluster-projection

Level-wise synthesis of clusters

Level-wise synthesis of clusters (cont’d) Generation procedure

Level-wise synthesis of clusters (cont’d)

Candidate cluster

Validation Some of the candidate clusters may not hav

e enough support because some of the 2-cluster may be due to different sets of tuples.

Check if the support of each candidate cluster is greater than the threshold: times the expected support of the cluster.

Only clusters whose support on D passes the threshold are retained.

Validation Procedure Setting the supports of all candidate c

lusters to zero. For each tuple increment the sup

port of the candidate cluster to which t belongs.

At the end of the scan, delete all candidate clusters whose support is less than the threshold.

Extensions Large Attribute Value Domains Clusters in Subspaces

Performance Evaluation Evaluation of CACTUS on Synthetic an

d Real Datasets Compared the performance of CACTU

S with the performance of STIRR

Synthetic Datasets The test datasets were generated usin

g the data generator developed by Gibson et al.(1 million tuples, 10 attributes, 100 attributes values for each attribute)

Real Datasets Two sets of bibliographic entries

7766 entries are database-related 30919 entries are theory-related

Four attributes: the first author, the second author, the conference, and the year.

Attribute domains are {3418,3529,1631,44},{8043,8190,690,42},{10212,10527,2315,52}

Real Datasets (cont’d)

Database-relatedTheory-related

Mixture

Results CACTUS is very fast and scalable(only

two scans of the dataset) CACTUS outperforms STIRR by a facto

r between 3 and 10

Conclusions Formalized the definition of a cluster f

or categorical attributes. Introduced a fast summarization-base

d algorithm CACTUS for discovering such clusters in categorical data.

Evaluated algorithm against both synthetic and real datasets.

Future Work Relax the cluster definition by allowing

sets of attribute values are “almost” strongly connected to each other.

Inter-attribute summaries can be incremental maintained=>Derive an incremental clustering algorithm

Rank the clusters based on a measure of interestingness

Comments Pairwise cluster-projection is the NP-c

omplete problem A large number of candidate clusters i

s still a problem

CACTUS-Clustering Categorical Data Using Summaries

Documents

Categorical Data Analysis 1 Running head: Categorical Data ...nlp.stanford.edu/manning/courses/ling289/Jaeger07catdata.pdf · Categorical Data Analysis 1 Categorical Data Analysis:

Cactus Pete Orchid Cactus Epiphyllum Hybrids Catalog

Cactus Tutorial - Cactus Code — Welcomecactuscode.org/documentation/tutorials/cactus_tutorial.pdf · What is Cactus Cactus is a framework for developing portable, modular applications,

Categorical Data Analysis Independent (Explanatory) Variable is Categorical (Nominal or Ordinal) Dependent (Response) Variable is Categorical (Nominal

Cactus Avenue / Interstate 15 Interchangepwgate.co.clark.nv.us/arra/tiger/cactus/Cactus at I-15 Interchange...CACTUS AVENUE / INTERSTATE 15 INTERCHANGE Organization ... NDOT and CCPW

categorical variables, time and regular expressions ...cjd11/charles_dimaggio/DIRE/... · categorical variables Outline 1 categorical variables coding categorical variables 2 dates

Arizona Barrel Cactus –Ferocactus wislizeni · Arizona Barrel Cactus –Ferocactus wislizeni. Other common names: Candy barrel cactus, Fishhook barrel cactus, Southwestern barrel

CATEGORICAL REPRESENTATIONS OF CATEGORICAL GROUPS · CATEGORICAL REPRESENTATIONS OF CATEGORICAL GROUPS 533 4. A categorical group is discrete if there is at most one morphism between

Categorical Data

The Cactus Patch - Bakersfield Cactus · The Cactus Patch Volume 22 Number 06 June 2019 The Cactus Patch is the official publication of the Bakersfield Cactus & Succulent Society

Categorical syllogism

Categorical propositions

Categorical 3

SPSS Workshop Day 2 – Data Analysis. Outline Descriptive Statistics Types of data Graphical Summaries –For Categorical Variables –For Quantitative Variables

CATÁLOGO 2017/18 Cactus/Suculentas - Bienvenidos · 7 - Cactus artificial flower Ø5,5 cm 8 - Cactus Ø5,5 cm 9 - Cactus Ø8,5 cm 10 - Cactus artificial flower Ø8,5 cm 11 - Cactus

Categorical Syllogisms

Comparing categorical data - Haese Mathematics categorical data Chapter18 Contents: A Categorical data B Examining categorical data C Comparing and reporting categorical data D Data

Categorical Data Analysis 1 Running head: Categorical Data Analysis

Non-categorical perception of a categorical ruleprojects.chass.utoronto.ca/ngn/pdf/Wannapaper2003.pdfNon-categorical perception of a "categorical" rule ... the often-cited claim by

4 CITROEN ca CACTUS CITROEN C4 CACTUS · 4 4 4 CITROEN ca CACTUS CITROEN C4 CACTUS . Created Date: 10/6/2016 2:11:17 PM