22
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1

Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1

Embed Size (px)

Citation preview

Page 1: Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1

Basic Machine Learning: Clustering

CS 315 – Web Search and Data Mining

1

Page 2: Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1

Supervised vs. Unsupervised Learning

Two Fundamental Methods in Machine LearningSupervised Learning (“learn from my example”)

Goal: A program that performs a task as good as humans. TASK – well defined (the target function) EXPERIENCE – training data provided by a human PERFORMANCE – error/accuracy on the task

Unsupervised Learning (“see what you can find”) Goal: To find some kind of structure in the data. TASK – vaguely defined No EXPERIENCE No PERFORMANCE (but, there are some evaluations metrics)

2

Page 3: Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1

What is Clustering?

The most common form of Unsupervised Learning

Clustering is the process of grouping a set of physical or abstract objects

into classes (“clusters”) of similar objects

It can be used in IR: To improve recall in search For better navigation of search results

Page 4: Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1

Ex1: Cluster to Improve Recall

Cluster hypothesis: Documents with similar text are related

Thus, when a query matches a document D, also return other documents in the cluster containing D.

4

Page 5: Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1

Ex2: Cluster for Better Navigation

5

Page 6: Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1

Clustering Characteristics

Flat Clustering vs Hierarchical Clustering Flat: just dividing objects in groups (clusters) Hierarchical: organize clusters in a hierarchy

Evaluating Clustering Internal Criteria

The intra-cluster similarity is high (tightness) The inter-cluster similarity is low (separateness)

External Criteria Did we discover the hidden classes?

(we need gold standard data for this evaluation)

6

Page 7: Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1

Clustering for Web IR

Representation for clustering Document representation Need a notion of similarity/distance

How many clusters? Fixed a priori? Completely data driven? Avoid “trivial” clusters - too large or small

7

Page 8: Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1

Recall: Documents as vectors

Each doc j is a vector of tf.idf values, one component for each term.

Can normalize to unit length.

Vector space terms are axes - aka features N docs live in this space even with stemming, may have 20,000+ dimensions

What makes documents related?

8

ijijin

i ji

ji

j

jj idftfw

w

w

d

dd

,,

1 ,

, where

Page 9: Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1

Intuition for relatedness

9

t 1

D2

D1

D3

D4t 2

x

y

Documents that are “close together” in vector space talk about the same things.

Page 10: Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1

What makes documents related?

Ideal: semantic similarity.Practical: statistical similarity

We will use cosine similarity.

We will describe algorithms in terms of cosine similarity.

10

n

i kiw

jiw

jdsim

dd

kd

kj

1 ,,)(

:, normalized of similarity Cosine

,

This is known as the “normalized inner product”.

Page 11: Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1

Clustering Algorithms

Hierarchical algorithms Bottom-up, agglomerative clustering

Partitioning “flat” algorithms Usually start with a random (partial) partitioning Refine it iteratively

The famous k-means partitioning algorithm: Given: a set of n documents and the number k Compute: a partition of k clusters that

optimizes the chosen partitioning criterion

11

Page 12: Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1

K-means

Assumes documents are real-valued vectors.Clusters based on centroids of points in a cluster, c (= the center of gravity or mean) :

Reassignment of instances to clusters is based on distance to the current cluster centroids.

See Animation

12

cx

xc

||

1(c)μ

Page 13: Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1

K-Means Algorithm

13

Let d be the distance measure between instances.

Select k random instances {s1, s2,… sk} as seeds.

Until clustering converges or other stopping criterion: For each instance xi: Assign xi to the cluster cj such that d(xi, sj) is minimal. (Update the seeds to the centroid of each cluster) For each cluster cj

sj = (cj)

Page 14: Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1

K-means: Different Issues

When to stop? When a fixed number of iterations is reached When centroid positions do not change

Seed Choice Results can vary based on random seed selection. Try out multiple starting points

14

Example showingsensitivity to seeds

A B

D E

C

F

If you start with centroids: B and Eyou converge to

If you start with centroids D and Fyou converge to:

Page 15: Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1

Hierarchical clustering

Build a tree-based hierarchical taxonomy (dendrogram) from a set of unlabeled examples.

15

animal

vertebrate

fish reptile amphib. mammal worm insect crustacean

invertebrate

Page 16: Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1

Hierarchical Agglomerative Clustering

We assume there is a similarity function that determines the similarity of two instances.

16

Start with all instances in their own cluster.Until there is only one cluster: Among the current clusters, determine the two clusters, ci and cj, that are most similar. Replace ci and cj with a single cluster ci cj

Algorithm:

Watch animation of HAC

Page 17: Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1

What is the most similar cluster?

Single-link Similarity of the most cosine-similar (single-link)

Complete-link Similarity of the “furthest” points, the least cosine-similar

Group-average agglomerative clustering Average cosine between pairs of elements

Centroid clustering Similarity of clusters’ centroids

17

Page 18: Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1

Single link clustering

18

1) Use maximum similarity of pairs:

),(max),(,

yxsimccsimji cycx

ji

2) After merging ci and cj, the similarity of the resulting cluster to another cluster, ck, is:

)),(),,(max()),(( kjkikji ccsimccsimcccsim

Page 19: Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1

Complete link clustering

19

1) Use minimum similarity of pairs:

2) After merging ci and cj, the similarity of the resulting cluster to another cluster, ck, is:

Page 20: Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1

Major issue - labeling

After clustering algorithm finds clusters - how can they be useful to the end user?

Need a concise label for each cluster In search results, say “Animal” or “Car” in the jaguar example. In topic trees (Yahoo), need navigational cues.

Often done by hand, a posteriori.

20

Page 21: Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1

How to Label Clusters

Show titles of typical documents Titles are easy to scan Authors create them for quick scanning! But you can only show a few titles which may not fully represent

cluster

Show words/phrases prominent in cluster More likely to fully represent cluster Use distinguishing words/phrases But harder to scan

21

Page 22: Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1

Further issues

Complexity: Clustering is computationally expensive. Implementations need

careful balancing of needs.

How to decide how many clusters are best?

Evaluating the “goodness” of clustering There are many techniques, some focus on implementation issues

(complexity/time), some on the quality of

22