A k-mean clustering algorithm for mixed numeric and categorical data

Intelligent Database Systems Lab

國立雲林科技大學National Yunlin University of Science and Technology

Presenter : Shao-Wei ChengAuthors : Amir Ahmad, Lipika Dey

DKE 2007

N.Y.U.S.T.

Outline

Motivation Objective Methodology Experiments Conclusion Comments

N.Y.U.S.T.

I. M.Motivation

The traditional k-mean algorithm is limited to numeric data. The Huang’s cost algorithm tried to cluster mixed numeric

and categorical data

The cluster center is represented by the mode of the cluster. Use the binary distance between two categorical attribute values. The significance(weight) of numeric attribute is taken to be 1, and γj is

a user-defined parameter.

N.Y.U.S.T.

Objectives

This paper attempts to alleviate the short-comings of Huang’s cost algorithm. Propose a new representation for the cluster center. Computing distance between two categorical values by the overall

distribution of categorical attribute. The parameter is defined by the contribution of a categorical

attribute.

N.Y.U.S.T.

Cost function

The Huang’s cost algorithm

The proposed cost algorithm

Methodology

The distance between De Niro and Stewart is ?

N.Y.U.S.T.

Methodology

N.Y.U.S.T.

Methodology

Significance of numeric attribute

The numeric attributes need to be discretized. equal width discretization

N.Y.U.S.T.

Methodology

Algorithm① Initialization.

② Computing the cluster centers.

③ Assign the data element to the cluster whose center is closest to it

④ Repeat 2 and 3, until clusters do not change or for a fixed number of iterations.

N.Y.U.S.T.

Evaluation method

Data sets Iris – all numeric attributes Vote – all categorical attributes Heart disease data – mixed data set Australian credit data – mixed data set

Experiments

N.Y.U.S.T.

I. M.Experiments

N.Y.U.S.T.

I. M.Conclusion

This paper introduced a new distance measure for categorical attribute values and proposed a modified k-mean algorithm for clustering mixed data sets.

The results obtained with this algorithm over a number of real-world data sets are highly encouraging.

Future work Other methods for discretizing numeric valued attributes. Other implementations of k-mean algorithm.

N.Y.U.S.T.

Comments

Advantage The view of overall attributes is good.

Drawback …

Application Mixed data sets clustering.

A k-mean clustering algorithm for mixed numeric and categorical data

Documents

Enhancing the selection of a model-based clustering with ... · PDF fileEnhancing the selection of a model-based clustering with external categorical variables Jean-Patrick Baudry

EnsCat: Clustering of categorical data via ensembling

Extensions to the K-means Algorithm for Clustering Large Data Sets with Categorical Values

CLUSTERING ALGORITHMS FOR CATEGORICAL DATA USING …

On Clustering Massive Text and Categorical Data Streamscharuaggarwal.net/KAISsampl.pdf · 2012-10-01 · On Clustering Massive Text and Categorical Data Streams 3 variety of similarity

Diagnosis of Psychopathology using Clustering and Rule ... · Use of traditional k-mean type algorithm is limited to numeric data. A clustering algorithm based on k-mean paradigm

A Link-Based Cluster Ensemble Approach for Categorical Data Clustering

Holo-Entropy Based Categorical Data Hierarchical Clustering · Holo-Entropy Based Categorical Data Hierarchical Clustering 305 phases, based on the conventionalsteps in agglomerativeclustering.The

1 Clustering Categorical Data The Case of Quran Verses Presented By Muhammad Al-Watban IS 598

Clustering Categorical Data

1 Topic - univ-lyon2.freric.univ-lyon2.fr/.../en_Tanagra_Clustering_Mixed_Data.pdf1 Topic Clustering algorithm for mixed data (numeric and categorical attributes), using the latent

Tecnologías del lenguaje para Explainable-AI y su impacto ...Categorical Linguistic Data Matrix Dendrogram ab c g h Distance for numeric vars Categorical distance combination Ontology

A Hierarchical Clustering Algorithm for Categorical Sequence Data

k-mw-modes: an algorithm for clustering categorical matrix ......Page 1 of 34 Accepted Manuscript k-mw-modes: an algorithm for clustering categorical matrix-object data FuyuanCaoa,LiqinYua,JoshuaZhexueHuangb,JiyeLianga,∗

New link based approach for categorical data clustering

Clustering in Hilbert simplex geometry - arXiv.org e-Print … · 2017-05-02 · Clustering in Hilbert simplex geometry Frank Nielsen Ke Suny Abstract Clustering categorical distributions

Input: Concepts, Attributes, Instances. 2 Module Outline Terminology What’s a concept? Classification, association, clustering, numeric prediction

ROCK: A Robust Clustering Algorithm for Categorical Attributes · ROCK: A Robust Clustering Algorithm for Categorical Attributes S. Guha, R. Rastogi and K. Shim S. Guha, R. Rastogi

Applied Soft Computing - INAOEariel/K-Harmonic means type... · Clustering Categorical cal attributes Numeric to attributes Mixed data K-Harmonic means clustering ... means value

1 Fast Density Clustering Algorithm for Numerical Data and ...ivsn-group.com/article/chenjinyin/6393652.pdf · means-type algorithm for clustering data with mixed numeric and categorical