23
Correlation Clustering Shuchi Chawla Carnegie Mellon University Joint work with Nikhil Bansal and Avrim Blum

Correlation Clustering

Embed Size (px)

DESCRIPTION

Correlation Clustering. Shuchi Chawla Carnegie Mellon University Joint work with Nikhil Bansal and Avrim Blum. Document Clustering. Given a bunch of documents, classify them into salient topics Typical characteristics: No well-defined “similarity metric” Number of clusters is unknown - PowerPoint PPT Presentation

Citation preview

Page 1: Correlation Clustering

Correlation Clustering

Shuchi ChawlaCarnegie Mellon University

Joint work with

Nikhil Bansal and Avrim Blum

Page 2: Correlation Clustering

Shuchi Chawla, Carnegie Mellon University2

Document Clustering

Given a bunch of documents, classify them into salient topics

Typical characteristics: No well-defined “similarity metric” Number of clusters is unknown No predefined topics – desirable to figure

them out as part of the algorithm

Page 3: Correlation Clustering

Shuchi Chawla, Carnegie Mellon University3

Research Communities

Given data on research papers, divide researchers into communities by co-authorship

Typical characteristics: How to divide really depends on the given set

of researchers Fuzzy boundaries

Page 4: Correlation Clustering

Shuchi Chawla, Carnegie Mellon University4

Traditional Approaches to Clustering

Approximation algorithms k-means, k-median, k-min sum

Matrix methods Spectral Clustering

AI techniques EM, classification algorithms

Page 5: Correlation Clustering

Shuchi Chawla, Carnegie Mellon University5

Problems with traditional approaches

Dependence on underlying metric

Objective functions are meaningless without a metric eg. k-means

Algorithm works only on specific metrics (such as Euclidean) eg. spectral methods

Page 6: Correlation Clustering

Shuchi Chawla, Carnegie Mellon University6

Problems with traditional approaches

Fixed number of clusters

Meaningless without prespecified number of clusters

eg. for k-means or k-median, if k is unspecified, it is best to put everything in their own cluster

Page 7: Correlation Clustering

Shuchi Chawla, Carnegie Mellon University7

Problems with traditional approaches

No clean notion of “quality” of clustering

Objective functions do not directly translate to how many items have been grouped wrongly

Heuristic approaches

Objective functions derived from generative models

Page 8: Correlation Clustering

Shuchi Chawla, Carnegie Mellon University8

Cohen, McCallum & Richman’s idea

“Learn” a similarity measure on documents may not be a metric!

f(x,y) = amount of similarity between x and yUse labeled data to train up this function

Classify all pairs with the learned function

Find the “most consistent” clusteringOur Task

Page 9: Correlation Clustering

Shuchi Chawla, Carnegie Mellon University9

An example

Consistent clustering:+ edges inside clusters- edges between clusters

+: Same-: Different

Harry Bovik

H. BovikTom X.

Harry B.

Page 10: Correlation Clustering

Shuchi Chawla, Carnegie Mellon University10

An example

+: Same-: Different

Harry Bovik

H. BovikTom X.

Harry B.

Disagreement

Page 11: Correlation Clustering

Shuchi Chawla, Carnegie Mellon University11

An example

Task: Find most consistent clustering or, fewest possible disagreements

equivalently, maximum possible agreements

+: Same-: Different

Harry Bovik

H. BovikTom X.

Harry B.

Disagreement

Page 12: Correlation Clustering

Shuchi Chawla, Carnegie Mellon University12

Correlation clustering

Given a complete graph – Each edge labeled ‘+’ or ‘-’

Our measure of clustering – How many labels does it agree with?

Number of clusters depends on the edge labels

NP-complete; We consider approximations

Page 13: Correlation Clustering

Shuchi Chawla, Carnegie Mellon University13

Compared to traditional approaches…

Do not have to specify k

No condition on weights – can be arbitrary

Clean notion of quality of clustering – number of examples where the clustering differs from f

If a good (perfect) clustering exists, it is easy to find

Page 14: Correlation Clustering

Shuchi Chawla, Carnegie Mellon University14

Some machine learning justification

Noise Removal There is some true classification function f But there are a few errors in the data We want to find the true function

Agnostic Learning There is no inherent clustering Try to find the best representation using a

hypothesis with limited expressivity

Page 15: Correlation Clustering

Shuchi Chawla, Carnegie Mellon University15

Our results

Constant factor approximation for minimizing disagreements

PTAS for maximizing agreements

Results for the random noise case

Page 16: Correlation Clustering

Shuchi Chawla, Carnegie Mellon University16

Minimizing Disagreements

Goal: constant approximation

Problem: Even if we find a cluster as good as

one in OPT, we are headed towards a log n approximation (a set-cover like bound)

Idea: lower bound DOPT

Page 17: Correlation Clustering

Shuchi Chawla, Carnegie Mellon University17

Lower Bounding Idea: Bad Triangles

Consider

+

+

We know any clustering has to disagree with at least one of these edges.

- “Bad Triangle”

Page 18: Correlation Clustering

Shuchi Chawla, Carnegie Mellon University18

Lower Bounding Idea: Bad Triangles

If several edge-disjoint bad triangles, then any clustering makes a mistake on each one

Dopt #{Edge disjoint bad triangles}

1

4 3

25 2 Edge disjoint Bad Triangles(1,2,3), (1,4,5)

+

-+

Page 19: Correlation Clustering

Shuchi Chawla, Carnegie Mellon University19

Using the lower bound

-clean cluster: cluster C where each node has fewer than |C| “bad” edges

-clean clusters have few bad triangles => few mistakes

Possible solution: find a -clean clustering

Caveat: It may not exist

Page 20: Correlation Clustering

Shuchi Chawla, Carnegie Mellon University20

Using the lower bound

We show: a clustering with clusters that are -clean or singleton

Further, it has few mistakes

Nice structure helps us find it easily.

Caveat: A -clean clustering may not exist

Page 21: Correlation Clustering

Shuchi Chawla, Carnegie Mellon University23

Extensions & Open Problems

Weighted edges or incomplete graph Recent work by Bartal et al log-approximation based on multiway cut

Better constant for unweighted case

Can we use bad triangles (or polygons) more directly for a tighter bound?

Experimental performance

Page 22: Correlation Clustering

Shuchi Chawla, Carnegie Mellon University24

Other problems I have worked on

Game Theory and Mechanism Design

Approx for Orienteering & related problems

Online search algorithms based on Machine Learning approaches

Theoretical properties of Power Law graphs

Currently working on Privacy with Cynthia

Page 23: Correlation Clustering

Shuchi Chawla, Carnegie Mellon University25

Thanks!