50
Fakultät für Elektrotechnik und Informatik Institut für Verteilte Systeme AG Intelligente Systeme - Data Mining group Data Mining I Summer semester 2019 Lecture 11: Clustering – 2: Hiearchical clustering Lectures: Prof. Dr. Eirini Ntoutsi TAs: Tai Le Quy, Vasileios Iosifidis, Maximilian Idahl, Shaheer Asghar

Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering

Fakultät für Elektrotechnik und InformatikInstitut für Verteilte Systeme

AG Intelligente Systeme - Data Mining group

Data Mining I

Summer semester 2019

Lecture 11: Clustering – 2: Hiearchical clustering

Lectures: Prof. Dr. Eirini Ntoutsi

TAs: Tai Le Quy, Vasileios Iosifidis, Maximilian Idahl, Shaheer Asghar

Page 2: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering

Clustering topics covered in DM1

1. Partitioning-based clustering

kMeans, kMedoids

2. Density-based clustering

DBSCAN

3. Grid-based clustering

4. Hierarchical clustering

1. Diana, Agnes, BIRCH, ROCK, CHAMELEON

5. Clustering evaluation

Data Mining I @SS19: Clustering 3

1 2 3 4 5

2

Page 3: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering

Outline

Hierarchical clustering

Bisecting k-Means

An overview of clustering

Homework/tutorial

Things you should know from this lecture

Data Mining I @SS19: Clustering 3 3

Page 4: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering

Outline

Hierarchical clustering

Bisecting k-Means

An overview of clustering

Homework/tutorial

Things you should know from this lecture

Data Mining I @SS19: Clustering 3 8

Page 5: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering

Hierarchical-based clustering

Produces a set of nested clusters organized as a hierarchical tree

Can be visualized also as a dendrogram

A tree like diagram that records the sequences of merges or splits & cluster memberships

The height at which two clusters are merged in the dendrogram reflects their distance

An instance can belong to multiple clusters.

The assignement though is still hard

Data Mining I @SS19: Clustering 3

1

2

3

4

5

6

1

23 4

5

Nested clusters

Dis

tan

ce

1 3 2 5 4 60

0.05

0.1

0.15

0.2

Dendrogram

1

2

34

5

Points

9

Page 6: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering

Strengths of Hierarchical Clustering

Do not have to assume any particular number of clusters

A clustering can be obtained by ‘cutting’ the dendrogram at the proper level

Cutting based on distance (i.e., I want ≤ 0.1 distance)

Cutting based on the number of clusters (i.e., I want 2 clusters)

Data Mining I @SS19: Clustering 3

3 6 4 1 2 50

0.05

0.1

0.15

0.2

0.25

10

Page 7: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering

Applications of hierarchical clustering 1/3

The dendrogram of clusters may correspond to meaningful taxonomies

Example in biological sciences (e.g., animal kingdom, phylogeny reconstruction, …)

Data Mining I @SS19: Clustering 3

Source: http://currents.plos.org/treeoflife/article/the-tree-of-life-and-a-new-classification-of-bony-fishes/

11

Page 8: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering

Applications of hierarchical clustering 2/3

The dendrogram of clusters may correspond to meaningful taxonomies

Dendrogram showing hierarchical clustering of tissue gene expression data with colours denoting tissues.

Data Mining I @SS19: Clustering 3

Source: http://genomicsclass.github.io/book/pages/clustering_and_heatmaps.html

12

Page 9: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering

Applications of hierarchical clustering 3/3

The dendrogram of clusters may correspond to meaningful taxonomies

USArrests dataset: statistics in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973.

Data Mining I @SS19: Clustering 3

Source: https://uc-r.github.io/hc_clustering

13

Page 10: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering

Hierarchical vs Partitioning

Data Mining I @SS19: Clustering 3

p4

p1 p3

p2

p4p1 p2 p3

Partitioning clustering

Nested clusters

Dendrogram

Hierarchical clustering algorithms typically have local objectives

Partitioning algorithms typically have global objectivese.g., k-Means

14

Page 11: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering

Hierarchical clustering methods

Two main types of hierarchical clustering

Agglomerative or AGNES (Agglomerative Nesting):

Bottom-up approach

Start with the points as individual clusters

At each step, merge the closest pair of clusters

until only one cluster (or k clusters) left

Divisive or DIANA (Divisive analysis):

Top-down approach

Start with one, all-inclusive cluster

At each step, split a cluster until each cluster contains a single point (or there are k clusters)

Merge or split one cluster at a time

Data Mining I @SS19: Clustering 3

1 3 2 5 4 60

0.05

0.1

0.15

0.2

1 3 2 5 4 60

0.05

0.1

0.15

0.2

15

Page 12: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering

Hierarchical clustering methods

Traditional hierarchical algorithms use a similarity or distance matrix to decide on which cluster to split/merge next

Employed distance/similarity function depends on the application

Data Mining I @SS19: Clustering 3

p1

p3

p12

p2

p1 p2 p3 … p12

Proximity matrix

...p1 p2 p3 p4 p9 p10 p11 p12

16

Page 13: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering

Agglomerative clustering algorithm

Most popular hierarchical clustering technique

Basic algorithm is straightforward

1. Compute the proximity matrix

2. Let each data point be a cluster

3. Repeat

4. Merge the two closest clusters

5. Update the proximity matrix

6. Until only a single cluster remains

Key operation: the computation of the proximity of two clusters

Different approaches (single link, complete link, …..) which lead to different algorithms

Data Mining I @SS19: Clustering 3

1 3 2 5 4 60

0.05

0.1

0.15

0.2

17

Page 14: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering

Starting situation

Start with clusters of individual points and a proximity matrix

Data Mining I @SS19: Clustering 3

p1

p3

p12

p2

p1 p2 p3 … p12

Proximity matrix

...p1 p2 p3 p4 p9 p10 p11 p12

18

Page 15: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering

Intermediate situation I

After some merging steps, we have some clusters

Data Mining I @SS19: Clustering 3

C1

C4

C2 C5

C3

C2C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity matrix

...p1 p2 p3 p4 p9 p10 p11 p12

19

Page 16: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering

Intermediate situation II

We want to merge the two closest clusters (C2 and C5) and update the proximity matrix.

Data Mining I @SS19: Clustering 3

Proximity matrix

...p1 p2 p3 p4 p9 p10 p11 p12

C2C1

C1

C3

C5

C4

C2

C3 C4 C5

C1

C4

C2 C5

C3

20

Page 17: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering

Merging

Two major questions for merging

How we identify the closest pair of clusters to be merged?

How do we update the proximity matrix?

Data Mining I @SS19: Clustering 3

C1

C4

C2 U C5

C3

...p1 p2 p3 p4 p9 p10 p11 p12

? ? ? ?

?

?

?

C2 U C5C1

C1

C3

C4

C2 U C5

C3 C4

Proximity matrix

21

Page 18: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering

Distance between clusters

Each cluster is a set of points

How do we compare two sets of points/clusters?

A variety of different methods Single link (or MIN)

Complete link (or MAX)

Group average

Distance between centroids

Distance between medoids

Other methods driven by an objective function

Ward’s Method uses squared error

Data Mining I @SS19: Clustering 3

Distance?

22

Page 19: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering

Distance between clusters: Single link distance or MIN

Single link (or MIN) distance between Ci and Cj is the minimum distance between any object in Ci

and any object in Cj, i.e.,

i.e., the distance is defined by the two closest objects (shortest edge)

Data Mining I @SS19: Clustering 3

jiyxjisl CyCxyxdCCdis ,),(min, ,

Ci CJ

23

Page 20: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering

Distance between clusters: Complete link or MAX

Complete link (or MAX) distance between Ci and Cj is the maximum distance between any object in Ci

and any object in Cj, i.e.,

i.e., the distance is defined by the two most dissimilar objects (longest edge)

Data Mining I @SS19: Clustering 3

Ci CJ

jiyxjicl CyCxyxdCCdis ,),(max, ,

24

Page 21: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering

Distance between clusters: Group average

Group average distance between Ci and Cj is the average distance between any object in Ci and any

object in Cj, i.e.,

Data Mining I @SS19: Clustering 3

ji

CyCx

jiavgCC

yxd

CCdisji

,

),(

,

25

Page 22: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering

Distance between clusters: Centroid distance

Centroid distance between Ci and Cj is the distance between the centroid ci of Ci and the centroid cj

of Cj, i.e.,

Data Mining I @SS19: Clustering 3

),(, jijicentroids ccdCCdis

n

p

c

n

i

i

m

1

Centroid of a cluster

26

Page 23: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering

Example

Data Mining I @SS19: Clustering 3

Dataset (6 2D points)

Distance matrix (Euclidean distance)

27

Page 24: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering

Back to the pseudocode of the agglomerative clustering algorithm

Pseudocode of the algorithm

1. Compute the proximity matrix

2. Let each data point be a cluster

3. Repeat

4. Merge the two closest clusters

5. Update the proximity matrix

6. Until only a single cluster remains

Data Mining I @SS19: Clustering 3

1 3 2 5 4 60

0.05

0.1

0.15

0.2

28

Page 25: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering

Single link distance or MIN agglomerative clustering algorithm

Similarity of two clusters is based on the most similar (closest) pair of objects

Determined by one pair of points

Data Mining I @SS19: Clustering 3

1

2

3

4

5

6

1

2

3

4

5

3 6 2 5 4 10

0.05

0.1

0.15

0.2

Dendrogram

jiyxjisl CyCxyxdCCdis ,),(min, ,

Nested clusters

29

Page 26: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering

Short break (5’)

Given the following 1-dimensional dataset, build a hierarchical agglomerative clustering using single-link distance

Data Mining I @SS19: Clustering 3 30

Page 27: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering

Single link distance (MIN): strengths

Can discover clusters of arbitrary shapes

Data Mining I @SS19: Clustering 3

Original points Two clusters

31

Page 28: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering

Single link distance (MIN): limitations

Data Mining I @SS19: Clustering 3

Two clustersOriginal points

Sensitive to noise and outliers

DBSCAN can be viewed as a robust variant of single link distance

It excludes noisy points between clusters to avoid undesirable chaining effects.

32

Page 29: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering

Single link distance (MIN): limitations

Data Mining I @SS19: Clustering 3

Produces long, elongated clusters (chain-like clusters)

33

Page 30: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering

Complete link distance or MAX agglomerative clustering algorithm

Similarity of two clusters is based on the least similar (most distant) pair of objects

Determined by one pair of points

Data Mining I @SS19: Clustering 3

Nested clusters Dendrogram

3 6 4 1 2 50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

1

2

3

4

5

6

1

2 5

3

4

jiyxjicl CyCxyxdCCdis ,),(max, ,

34

Page 31: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering

Complete link distance (MAX): strengths

Data Mining I @SS19: Clustering 3

Original points Two clusters

Less susceptible to noise and outliers and comparing to MIN

35

Page 32: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering

Complete link distance (MAX): limitations

Because it focuses on minimizing the diameter of the cluster, it will create clusters so that all of them have similar diameter

If there are natural larger clusters than others, it tends to break large clusters

Data Mining I @SS19: Clustering 3

Original points Two clusters

36

Page 33: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering

Short break (5’)

Given the following 1-dimensional dataset, build a hierarchical agglomerative clustering using complete-link distance

Data Mining I @SS19: Clustering 3 37

Page 34: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering

(Group) Average-link distance agglomerative clustering algorithm

Proximity of two clusters is the average of pairwise distances between objects in the two clusters.

Determined by all pairs of points in the two clusters

Data Mining I @SS19: Clustering 3

Nested clusters

1

2

3

4

5

6

1

2

5

3

4

Dendrogram

3 6 4 2 5 10

0.05

0.1

0.15

0.2

0.25

ji

CyCx

jiavgCC

yxd

CCdisji

,

),(

,

38

Page 35: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering

(Group) Average-link distance: strengths and limitations

Compromise between Single and Complete Link

Strengths

Less susceptible to noise and outliers

Limitations

Biased towards spherical clusters

Data Mining I @SS19: Clustering 3 39

Page 36: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering

Centroid-link distance agglomerative clustering algorithm

The distance between two clusters is the distance of their corresponding centroids

Difference to other measures (often considered bad): the possibility of inversions

Two clusters that are merged at step k might be more similar than the pair of clusters merged in step k-1

For the other methods, distance between clusters monotonically increases (or at worst does not increase)

Data Mining I @SS19: Clustering 3

),(, jijicentroids ccdCCdis

40

Page 37: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering

Ward’s method

Ward’s method or Ward's minimum variance method

Clusters are represented by centroids

The proximity between two clusters is measured in terms of theincrease in SSE (sum of squared error) that results from merging the two clusters

At each step, merge the pair of clusters that leads to minimum increase in total inter-cluster variance after merging.

Data Mining I @SS19: Clustering 3

Nested clusters

1

2

3

4

5

61

2

5

3

4

41

Page 38: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering

Ward’s method cont’

Ward’s method seems similarly to k-Means: it tries to minimize the sum of square distances of points from their cluster centroids, but not globally

Less susceptible to noise and outliers

Biased towards spherical clusters

Data Mining I @SS19: Clustering 3 42

Page 39: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering

Comparison of the different methods

Data Mining I @SS19: Clustering 3

Group average Ward’s method

1

2

3

4

5

6

1

2

5

3

4

Complete link (MAX)

1

2

3

4

5

6

1

2

5

34

1

2

3

4

5

6

1

2 5

3

41

2

3

4

5

6

1

2

3

4

5

Single link (MIN)

43

Page 40: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering

Hierarchical methods: complexity

O(N2) space to store the proximity matrix

N is the number of points.

O(N3) time in most of the cases

There are N steps and at each step the size, N2, proximity matrix must be updated and searched

Complexity can be reduced to O(N2 log(N) ) time for some approaches using appropriate data structures

Data Mining I @SS19: Clustering 3 44

Page 41: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering

Hierarchical clustering: overview

No knowledge on the number of clusters

Produces a hierarchy of clusters, not a flat clustering

A single clustering can be obtained from the dendrogram

No backtracking: Merging decisions are final

Once a decision is made to combine two clusters, it cannot be undone

Lack of a global objective function

Decisions are local, at each step

No objective function is directly minimized

Different schemes have problems with one or more of the following:

Sensitivity to noise and outliers

Breaking large clusters

Difficulty handling different sized clusters and convex shapes

Inefficiency, especially for large datasets

Data Mining I @SS19: Clustering 3 45

Page 42: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering

Outline

Hierarchical clustering

Bisecting k-Means

An overview of clustering

Homework/tutorial

Things you should know from this lecture

Data Mining I @SS19: Clustering 3 46

Page 43: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering

Bisecting k-Means

Hybrid method, combines k-Means and hierarchical clustering

Idea: first split the set of points into two clusters, select one of these clusters for further splitting, and so on, until k clusters remain.

Pseudocode:

Which cluster to split?

The one with the largest SSE (worse one)

Based on SSE and size

Data Mining I @SS19: Clustering 3 47

Page 44: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering

Bisecting k-Means

An example

Data Mining I @SS19: Clustering 3 48

Page 45: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering

Outline

Hierarchical clustering

Bisecting k-Means

An overview of clustering

Homework/tutorial

Things you should know from this lecture

Data Mining I @SS19: Clustering 3 49

Page 46: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering

An overview on clustering

Intuitively, a cluster is a set of data objects that are similar to one another within the same cluster and dissimilar to the objects

in other clusters

Cluster analysis: Find similarities between data according to the characteristics found in the data and group similar data

objects into clusters

Key points in clustering

Similarity/ distance function

Learning algorithm

An unsupervised learning task

No clues on the number of clusters, nor in the characteristics of these clusters

Important DM task: as a stand-alone tool or as a preprocessing step

A large amount of algorithms

Partitioning methods

Hierarchical methods

Density-based methods

Model-based methods

….

Data Mining I @SS19: Clustering 3 50

Page 47: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering

Outline

Hierarchical clustering

Bisecting k-Means

An overview of clustering

Homework/tutorial

Things you should know from this lecture

Data Mining I @SS19: Clustering 3 51

Page 48: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering

Homework/ tutorial

Homework

Use the Elki data mining tool to experiment with clustering algorithms http://elki.dbs.ifi.lmu.de/

Or Python/ Weka (more limited w.r.t. clustering)

Readings:

Tan P.-N., Steinbach M., Kumar V book, Chapter 8.

Data Clustering: A Review, https://www.cs.rutgers.edu/~mlittman/courses/lightai03/jain99data.pdf

Nando de Freitas youtube video: https://www.youtube.com/watch?v=voN8omBe2r4

Data Mining I @SS19: Clustering 3 52

Page 49: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering

Outline

Hierarchical clustering

Bisecting k-Means

An overview of clustering

Homework/tutorial

Things you should know from this lecture

Data Mining I @SS19: Clustering 3 53

Page 50: Data Mining I - KBSntoutsi/DM1.SoSe19/lectures/11... · 2019. 11. 10. · Clustering topics covered in DM1 1. Partitioning-based clustering kMeans, kMedoids 2. Density-based clustering

Homework/ tutorial

Hierarchical clustering basics

Agglomerative approach

Similarity measures between clusters

Bisecting kMeans

Data Mining I @SS19: Clustering 3 54