59
Advanced Multimedia Text Clustering Tamara Berg

Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)

  • View
    227

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)

Advanced Multimedia

Text ClusteringTamara Berg

Page 2: Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)

Reminder - Classification

• Given some labeled training documents• Determine the best label for a test (query)

document

Page 3: Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)

What if we don’t have labeled data?

• We can’t do classification.

Page 4: Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)

What if we don’t have labeled data?

• We can’t do classification.• What can we do?

– Clustering - the assignment of objects into groups (called clusters) so that objects from the same cluster are more similar to each other than objects from different clusters.

Page 5: Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)

What if we don’t have labeled data?

• We can’t do classification.• What can we do?

– Clustering - the assignment of objects into groups (called clusters) so that objects from the same cluster are more similar to each other than objects from different clusters.

– Often similarity is assessed according to a distance measure.

Page 6: Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)

What if we don’t have labeled data?

• We can’t do classification.• What can we do?

– Clustering - the assignment of objects into groups (called clusters) so that objects from the same cluster are more similar to each other than objects from different clusters.

– Often similarity is assessed according to a distance measure.

– Clustering is a common technique for statistical data analysis, which is used in many fields, including machine learning, data mining, pattern recognition, image analysis and bioinformatics.

Page 7: Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)
Page 8: Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)
Page 9: Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)

Any of the similarity metrics we talked about before (SSD, angle between vectors)

Page 10: Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)

Document Clustering

Clustering is the process of grouping a set ofdocuments into clusters of similar documents.

Documents within a cluster should be similar.

Documents from different clusters should bedissimilar.

Page 11: Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)

Source: Hinrich Schutze

Page 12: Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)

Source: Hinrich Schutze

Page 13: Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)

Source: Hinrich Schutze

Page 14: Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)

Source: Hinrich Schutze

Page 15: Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)

Source: Hinrich Schutze

Page 16: Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)

Source: Hinrich Schutze

Google newsFlickr Clusters

Page 17: Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)

Source: Hinrich Schutze

Page 18: Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)

How to cluster Documents

Page 19: Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)

Reminder - Vector Space Model

Documents are represented as vectors in term space

A vector distance/similarity measure between two documents is used to compare documents

Slide from Mitch Marcus

Page 20: Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)

Document Vectors:One location for each word.

nova galaxy heat h’wood film rolediet fur

10 5 3

5 10 10 8 7 9 10 5

10 10 9 10

5 7 9 6 10 2 8

7 5 1 3

ABCDEFGHI

“Nova” occurs 10 times in text A“Galaxy” occurs 5 times in text A“Heat” occurs 3 times in text A(Blank means 0 occurrences.)

Slide from Mitch Marcus

Page 21: Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)

Document Vectors

nova galaxy heat h’wood film rolediet fur

10 5 3

5 10 10 8 7

9 10 5 10 10 9 10

5 7 9 6 10 2 8

7 5 1 3

ABCDEFGHI

Document ids

Slide from Mitch Marcus

Page 22: Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)

TF x IDF Calculation

)/log(* kikik nNtfw

Tk = term k in document Ditf ik = frequency of term Tk in document Diidfk = inverse document frequency of term Tk in C

N = total number of documents in the collection C

nk = the number of documents in C that contain Tk

idfk = log Nkn( )

Slide from Mitch Marcus

W1 W2 W3 … WnA

Page 23: Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)

Features

F1 F2 F3 … FnA

Define whatever features you like:Length of longest string of CAP’sNumber of $’sUseful words for the task…

Page 24: Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)

Similarity between documents

A = [10 5 3 0 0 0 0 0];G = [5 0 7 0 0 9 0 0];E = [0 0 0 0 0 10 10 0];

Sum of Squared Distances (SSD) =

SSD(A,G) = ?SSD(A,E) = ?SSD(G,E) = ?Which pair of documents are the most similar?

(X ii=1

n

∑ −Yi)2

Page 25: Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)

Source: Hinrich Schutze

Page 26: Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)

source: Dan Klein

Page 27: Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)

K-means clustering

• Want to minimize sum of squared Euclidean distances between points xi and their nearest cluster centers mk

k

ki

ki mxMXDcluster

clusterinpoint

2)(),(

source: Svetlana Lazebnik

Page 28: Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)

K-means clustering

• Want to minimize sum of squared Euclidean distances between points xi and their nearest cluster centers mk

k

ki

ki mxMXDcluster

clusterinpoint

2)(),(

source: Svetlana Lazebnik

Page 29: Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)
Page 30: Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)
Page 31: Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)
Page 32: Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)
Page 33: Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)
Page 34: Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)
Page 35: Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)
Page 36: Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)
Page 37: Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)
Page 38: Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)
Page 39: Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)
Page 40: Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)

source: Dan Klein

Page 41: Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)

source: Dan Klein

Page 42: Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)

source: Dan Klein

Page 43: Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)

Convergence of K Means

• K-means converges to a fixed point in a finite number of iterations.

Proof:

Source: Hinrich Schutze

Page 44: Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)

Convergence of K Means

• K-means converges to a fixed point in a finite number of iterations.

Proof:• The sum of squared distances (RSS) decreases during

reassignment.

Source: Hinrich Schutze

Page 45: Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)

Convergence of K Means

• K-means converges to a fixed point in a finite number of iterations.

Proof:• The sum of squared distances (RSS) decreases during

reassignment.• (because each vector is moved to a closer centroid)

Source: Hinrich Schutze

Page 46: Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)

Convergence of K Means

• K-means converges to a fixed point in a finite number of iterations.

Proof:• The sum of squared distances (RSS) decreases during

reassignment.• (because each vector is moved to a closer centroid)• RSS decreases during recomputation.

Source: Hinrich Schutze

Page 47: Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)

Convergence of K Means

• K-means converges to a fixed point in a finite number of iterations.

Proof:• The sum of squared distances (RSS) decreases during

reassignment.• (because each vector is moved to a closer centroid)• RSS decreases during recomputation.• Thus: We must reach a fixed point.

Source: Hinrich Schutze

Page 48: Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)

Convergence of K Means

• K-means converges to a fixed point in a finite number of iterations.

Proof:• The sum of squared distances (RSS) decreases during

reassignment.• (because each vector is moved to a closer centroid)• RSS decreases during recomputation.• Thus: We must reach a fixed point.• But we don’t know how long convergence will take!

Source: Hinrich Schutze

Page 49: Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)

Convergence of K Means

• K-means converges to a fixed point in a finite number of iterations.

Proof:• The sum of squared distances (RSS) decreases during

reassignment.• (because each vector is moved to a closer centroid)• RSS decreases during recomputation.• Thus: We must reach a fixed point.• But we don’t know how long convergence will take!• If we don’t care about a few docs switching back and forth,

then convergence is usually fast (< 10-20 iterations).

Source: Hinrich Schutze

Page 50: Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)

source: Dan Klein

Page 51: Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)

source: Dan Klein

Page 52: Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)

Source: Hinrich Schutze

Page 53: Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)

Source: Hinrich Schutze

Page 54: Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)

Hierarchical clustering strategies

• Agglomerative clustering• Start with each point in a separate cluster• At each iteration, merge two of the “closest” clusters

• Divisive clustering• Start with all points grouped into a single cluster• At each iteration, split the “largest” cluster

source: Svetlana Lazebnik

Page 55: Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)

source: Dan Klein

Page 56: Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)

source: Dan Klein

Page 57: Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)

Divisive Clustering

• Top-down (instead of bottom-up as in Agglomerative Clustering)

• Start with all docs in one big cluster• Then recursively split clusters• Eventually each node forms a cluster on its

own.

Source: Hinrich Schutze

Page 58: Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)

Flat or hierarchical clustering?

• For high efficiency, use flat clustering (e.g. k means)

• For deterministic results: hierarchical clustering• When a hierarchical structure is desired:

hierarchical algorithm• Hierarchical clustering can also be applied if K

cannot be predetermined (can start without knowing K)

Source: Hinrich Schutze

Page 59: Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)

For Thurs

• Read Chapter 6 of textbook