Data Clustering 1 – An introduction Data Clustering – An IntroductionSlide 1

Data Clustering 1 – An introduction

Data Clustering – An Introduction Slide 1

The Data Explosion

“If you feel like you are drowning in information, it’s because you are.”

Advance of IT and the Internet Massive increase in ability to:

Record: Electronic records and forms Store: Data Warehouses (as we have seen) Analyse: Data Mining and Visualisation (more later)

Risk of Information Overload

Data Clustering – An Introduction

The Aims of Data Mining

Classification Categorising Risk-Return of Stocks

Association Identify products that tend to sell together

Detection Identify profiles of customers

Prediction Forecasting Market Performance

Database Technology Timeline

1960s: Data collection, database creation, IMS and network

1970s: Relational data model, relational DBMS implementation

1980s: RDBMS, advanced data models (extended-relational,

OO, deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.)

1990s—2000s: Data mining and data warehousing, multimedia

databases, and Web databases

From Data to Knowledge

Common to break down the process of learning from data into the following:

Data, Information and Knowledge

From Data to Knowledge

Data: Raw numbers

Information: Data with context or meaning

Knowledge: Data Structures / Patterns (Knowledge must be useful)

Data Mining / Intelligent Data Analysis

“Data mining is applying Machine Learning techniques to historical data to improve future decisions” Tom Mitchell 1997

Knowledge Discovery

Knowledge Discovery in Databases (KDD)

The Process (from Advances in KDD and Data mining):

Data Knowledge

Target Data

Pre-processedData

TransformedData

Patterns

Data Mining - Tools

Typical tools Statistical Analysis

Summarisation Outlier Detection Correlation Regression Clustering

Association Rules Time Series Models Decision Trees (classification)

Data Mining - Applications

Some successful examples of its use:

Pharmaceutical companies – Drug Discovery

Credit card companies – Fraud Detection

Transportation companies - Routing Large consumer package goods

companies (to improve the sales process to retailers)

Hospital Organisation – Decision AnalysisData Clustering – An Introduction

Examples of Data Mining Tools

We will now look at some core techniques commonly used for analysing and mining business warehouses

Correlation Visualisation Clustering Regression

ClusteringAn example in biology…

Things that are brown and run away

Things that are green and don’t run away

animals

plants

Clustering

An example in biology…– Kingdom– Phylum–Class–Order– Family–Genus– Species

Hierarchical clustering (more later)

Clustering

The process Extract features (colour, movement,

sensory organs etc): more later

Cluster into categories

Consolidation

ClusteringClustering: to partition a data set into subsets

(clusters), so that the data in each subset share some common trait - often similarity or proximity for some defined distance measure.

• The process of organizing objects into groups whose members are similar in some way.

• A cluster is therefore a collection of objects which are “similar” between them and are “dissimilar” to the objects belonging to other clusters.

• Unsupervised: No need for the ‘teacher’ signals, i.e. the desired output.

Cluster 1

Cluster 2

Supervised and Unsupervised LearningUnsupervised learning: learning without the

desired output (‘teacher’ signals).Supervised learning: learning with the desired

output.• Clustering is one of the widely-used unsupervised learning

methods.• Other unsupervised learning:

• Dimensionality reduction (factor analysis, principal component analysis, independent component analysis …)

• Time serious modelling• Source separation

• Supervised learning:• Classification• Regression

Patterns, Clusters and Features (1)

Patterns: physical objectsClusters: categories of objectsFeatures: attributes of objects

animals

plants

Colour: brown, green, …

Patterns, Clusters and Features (2)

100 150 200 250 300500

Top speed [ml/h]

Sports cars

Medium market cars

Lorries

Pattern

Feature-1 values

Features’ space

cluster

Feature-2 values

Creating vehicles’ clusters

Social networks

• Marketing • Terror networks• Allocation of resources in a company /

university

Gene networks

• Understanding gene interactions• Identifying important genes linked to disease

How to do clustering?

Cluster 1

Cluster 2

What we know: patterns represented by their feature vectors,

General case:is in the d -dimensional domain of the feature vectors

What we need to find out: the clusters

Pattern Similarity

Ai xxd

• A key concept in clustering: similarity.• Clusters are formed by similar patterns.• In computer science, we need to define

some metric to measure similarity. One of the commonly adopted similarity metrics is distance.

A general definition of distance (between pattern A and B):• b=2: Euclidean distance• b=1: Manhattan distance

The shorter the distance, the more similar the two patterns.

Pattern Similarity & Distance Metrics

Many methods are designed to work

on Distance Metrics, e.g. K-Means They assume that the Triangle

Inequality holds:“the sum of the lengths of any two sides must be greater than

the length of the remaining side”

Pattern Similarity & Distance Metrics

Distance Metrics Euclidean Correlation Minkowski Manhattan Mahalanobis

Relationship Metrics How Long is a Piece of String? Often Application Dependant

K-Means Clustering

Algorithm 1: K-Means Clustering

1. Place K points into the feature space. These points represent initial cluster centroids.

2. Assign each pattern to the closest cluster centroid.3. When all objects have been assigned, recalculate

the positions of the K centroids.4. Repeat Steps 2 and 3 until the assignments do not

change.

K-Means Clustering

Interactive Demo: http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html

Discussions (1)

1. How to determine k, the number of clusters?

Discussions (2)

2. Any alternative ways of choosing the initial cluster centroids?

Discussions (3)

3. Does the algorithm converge to the same results with different selections of initial cluster centroids? If not, what should we do in practice?

Reading

Chapter 9, Section 9.3: David Hand “Principles of Data Mining”, MIT Press

Chapter 8: Pang-Ning Tan “Introduction to Data Mining”

Anil Jain: “Data Clustering: 50 Years Beyond K-Means”, Pattern Recognition Letters

In the lab:

Examine a piece of JAVA code for K-Means clustering

Explore the use of K-Means on some Toy datasets

Visualise the clusterings using an EXCEL macro

Data Clustering 1 – An introduction Data Clustering – An IntroductionSlide 1

Documents

Data Clustering

Visualization with Data Clustering · Visualization with Data Clustering Diva Seminar Winter 2006. Visualization with Data Clustering Terreaux Patrick 2 Contents Context Data clustering

Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Clustering Gene Expression Data

Data Clustering Seminar

3.2 Data Mining - Clustering

Hierarchical clustering for gene expression data analysishomes.di.unimi.it/valenti/SlideCorsi/MB0910/HierarchicalClustering.pdf · Clustering of Microarray Data 1. Clustering of gene

Data Clustering: K-means and Hierarchical Clustering

Clustering Multidimensional Data

Clustering Categorical Data

Data Clustering 2 – K Means contd & Hierarchical Methods Data Clustering – An IntroductionSlide 1

Data Clustering in C++

Data Clustering Analysis, from Objectiveswebdocs.cs.ualberta.ca/~zaiane/postscript/pakdd02-tut.pdf · Data Clustering Analysis, from simple groupings to scalable clustering with constraints

Clustering for Data Mining

Survey of Clustering Data Mining Techniquesrvetro/vetroBioComp/Clustering/Berkhin2006a A... · Survey of Clustering Data Mining Techniques Pavel Berkhin Accrue Software, Inc. Clustering

FUZZY CLUSTERING 2009/2010. 2 What is Data Clustering? Fuzzy C-Means Clustering Subtractive Clustering Data Clustering Using the Clustering GUI

Clustering Search Log Data

LNCS 3776 - Data Clustering: A User’s Dilemmabiometrics.cse.msu.edu/Publications/Clustering/JainLawClustering05… · Data Clustering: A User’s Dilemma 3 performing clustering,

Clustering: Partition Clustering. Lecture outline Distance/Similarity between data objects Data objects as geometric data points Clustering problems and

RDF data clustering