Data Clustering 1 – An introduction Data Clustering – An IntroductionSlide 1

Preview:

Citation preview

Data Clustering 1 – An introduction

Data Clustering – An Introduction Slide 1

The Data Explosion

“If you feel like you are drowning in information, it’s because you are.”

Advance of IT and the Internet Massive increase in ability to:

Record: Electronic records and forms Store: Data Warehouses (as we have seen) Analyse: Data Mining and Visualisation (more later)

Risk of Information Overload

Slide 2Data Clustering – An Introduction

The Aims of Data Mining

Classification Categorising Risk-Return of Stocks

Association Identify products that tend to sell together

Detection Identify profiles of customers

Prediction Forecasting Market Performance

Slide 3Data Clustering – An Introduction

Database Technology Timeline

1960s: Data collection, database creation, IMS and network

DBMS

1970s: Relational data model, relational DBMS implementation

1980s: RDBMS, advanced data models (extended-relational,

OO, deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.)

1990s—2000s: Data mining and data warehousing, multimedia

databases, and Web databases

Slide 4Data Clustering – An Introduction

From Data to Knowledge

Common to break down the process of learning from data into the following:

Data, Information and Knowledge

Slide 5Data Clustering – An Introduction

From Data to Knowledge

Data: Raw numbers

Information: Data with context or meaning

Knowledge: Data Structures / Patterns (Knowledge must be useful)

Slide 6Data Clustering – An Introduction

Data Mining / Intelligent Data Analysis

“Data mining is applying Machine Learning techniques to historical data to improve future decisions” Tom Mitchell 1997

Slide 7Data Clustering – An Introduction

Knowledge Discovery

Knowledge Discovery in Databases (KDD)

The Process (from Advances in KDD and Data mining):

Data Knowledge

Target Data

Pre-processedData

TransformedData

Patterns

Slide 8Data Clustering – An Introduction

Slide 9

Data Mining - Tools

Typical tools Statistical Analysis

Summarisation Outlier Detection Correlation Regression Clustering

Association Rules Time Series Models Decision Trees (classification)

Data Clustering – An Introduction

Slide 10

Data Mining - Applications

Some successful examples of its use:

Pharmaceutical companies – Drug Discovery

Credit card companies – Fraud Detection

Transportation companies - Routing Large consumer package goods

companies (to improve the sales process to retailers)

Hospital Organisation – Decision AnalysisData Clustering – An Introduction

Slide 11

Examples of Data Mining Tools

We will now look at some core techniques commonly used for analysing and mining business warehouses

Correlation Visualisation Clustering Regression

Data Clustering – An Introduction

ClusteringAn example in biology…

Things that are brown and run away

Things that are green and don’t run away

animals

plants

Data Clustering – An Introduction Slide 12

Clustering

An example in biology…– Kingdom– Phylum–Class–Order– Family–Genus– Species

Hierarchical clustering (more later)

Data Clustering – An Introduction Slide 13

Clustering

The process Extract features (colour, movement,

sensory organs etc): more later

Cluster into categories

Consolidation

Data Clustering – An Introduction Slide 14

ClusteringClustering: to partition a data set into subsets

(clusters), so that the data in each subset share some common trait - often similarity or proximity for some defined distance measure.

• The process of organizing objects into groups whose members are similar in some way.

• A cluster is therefore a collection of objects which are “similar” between them and are “dissimilar” to the objects belonging to other clusters.

• Unsupervised: No need for the ‘teacher’ signals, i.e. the desired output.

Cluster 1

Cluster 2

x1

x2

Supervised and Unsupervised LearningUnsupervised learning: learning without the

desired output (‘teacher’ signals).Supervised learning: learning with the desired

output.• Clustering is one of the widely-used unsupervised learning

methods.• Other unsupervised learning:

• Dimensionality reduction (factor analysis, principal component analysis, independent component analysis …)

• Time serious modelling• Source separation

• Supervised learning:• Classification• Regression

Data Clustering – An Introduction Slide 16

Patterns, Clusters and Features (1)

Patterns: physical objectsClusters: categories of objectsFeatures: attributes of objects

animals

plants

Colour: brown, green, …

Patterns, Clusters and Features (2)

100 150 200 250 300500

1000

1500

2000

2500

3000

3500

Top speed [ml/h]

Weig

ht

[kg]

Sports cars

Medium market cars

Lorries

Pattern

Feature-1 values

Features’ space

cluster

Feature-2 values

Creating vehicles’ clusters

Data Clustering – An Introduction Slide 19

Social networks

• Marketing • Terror networks• Allocation of resources in a company /

university

How to do clustering?

Cluster 1

Cluster 2

x1

x2

What we know: patterns represented by their feature vectors,

e.g.

370

851

.

.x

dx

x

x

2

1

x

General case:is in the d -dimensional domain of the feature vectors

What we need to find out: the clusters

Pattern Similarity

bd

i

bBi

Ai xxd

1

1

• A key concept in clustering: similarity.• Clusters are formed by similar patterns.• In computer science, we need to define

some metric to measure similarity. One of the commonly adopted similarity metrics is distance.

A general definition of distance (between pattern A and B):• b=2: Euclidean distance• b=1: Manhattan distance

The shorter the distance, the more similar the two patterns.

Data Clustering – An Introduction Slide 23

Pattern Similarity & Distance Metrics

Many methods are designed to work

on Distance Metrics, e.g. K-Means They assume that the Triangle

Inequality holds:“the sum of the lengths of any two sides must be greater than

the length of the remaining side”

Data Clustering – An Introduction Slide 24

Pattern Similarity & Distance Metrics

Distance Metrics Euclidean Correlation Minkowski Manhattan Mahalanobis

Relationship Metrics How Long is a Piece of String? Often Application Dependant

25

K-Means Clustering

Algorithm 1: K-Means Clustering

1. Place K points into the feature space. These points represent initial cluster centroids.

2. Assign each pattern to the closest cluster centroid.3. When all objects have been assigned, recalculate

the positions of the K centroids.4. Repeat Steps 2 and 3 until the assignments do not

change.

Data Clustering – An Introduction Slide 26

Slide 27

K-Means Clustering

Interactive Demo: http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html

Data Clustering – An Introduction

Discussions (1)

1. How to determine k, the number of clusters?

Data Clustering – An Introduction Slide 28

Discussions (2)

2. Any alternative ways of choosing the initial cluster centroids?

Data Clustering – An Introduction Slide 29

Discussions (3)

3. Does the algorithm converge to the same results with different selections of initial cluster centroids? If not, what should we do in practice?

Data Clustering – An Introduction Slide 30

Reading

Chapter 9, Section 9.3: David Hand “Principles of Data Mining”, MIT Press

Chapter 8: Pang-Ning Tan “Introduction to Data Mining”

Anil Jain: “Data Clustering: 50 Years Beyond K-Means”, Pattern Recognition Letters

Data Clustering – An Introduction Slide 31

Lab

In the lab:

Examine a piece of JAVA code for K-Means clustering

Explore the use of K-Means on some Toy datasets

Visualise the clusterings using an EXCEL macro

Data Clustering – An Introduction Slide 32

Recommended