Correlation Clustering

Correlation Clustering

Shuchi ChawlaCarnegie Mellon University

Joint work with

Nikhil Bansal and Avrim Blum

Shuchi Chawla, Carnegie Mellon University2

Document Clustering

Given a bunch of documents, classify them into salient topics

Typical characteristics: No well-defined “similarity metric” Number of clusters is unknown No predefined topics – desirable to figure

them out as part of the algorithm


Research Communities

Given data on research papers, divide researchers into communities by co-authorship

Typical characteristics: How to divide really depends on the given set

of researchers Fuzzy boundaries


Traditional Approaches to Clustering

Approximation algorithms k-means, k-median, k-min sum

Matrix methods Spectral Clustering

AI techniques EM, classification algorithms


Problems with traditional approaches

Dependence on underlying metric

Objective functions are meaningless without a metric eg. k-means

Algorithm works only on specific metrics (such as Euclidean) eg. spectral methods



Fixed number of clusters

Meaningless without prespecified number of clusters

eg. for k-means or k-median, if k is unspecified, it is best to put everything in their own cluster



No clean notion of “quality” of clustering

Objective functions do not directly translate to how many items have been grouped wrongly

Heuristic approaches

Objective functions derived from generative models


Cohen, McCallum & Richman’s idea

“Learn” a similarity measure on documents may not be a metric!

f(x,y) = amount of similarity between x and yUse labeled data to train up this function

Classify all pairs with the learned function

Find the “most consistent” clusteringOur Task


An example

Consistent clustering:+ edges inside clusters- edges between clusters

+: Same-: Different

Harry Bovik

H. BovikTom X.

Harry B.


An example

+: Same-: Different

Harry Bovik

H. BovikTom X.

Harry B.

Disagreement


An example

Task: Find most consistent clustering or, fewest possible disagreements

equivalently, maximum possible agreements

+: Same-: Different

Harry Bovik

H. BovikTom X.

Harry B.

Disagreement


Correlation clustering

Given a complete graph – Each edge labeled ‘+’ or ‘-’

Our measure of clustering – How many labels does it agree with?

Number of clusters depends on the edge labels

NP-complete; We consider approximations


Compared to traditional approaches…

Do not have to specify k

No condition on weights – can be arbitrary

Clean notion of quality of clustering – number of examples where the clustering differs from f

If a good (perfect) clustering exists, it is easy to find


Some machine learning justification

Noise Removal There is some true classification function f But there are a few errors in the data We want to find the true function

Agnostic Learning There is no inherent clustering Try to find the best representation using a

hypothesis with limited expressivity


Our results

Constant factor approximation for minimizing disagreements

PTAS for maximizing agreements

Results for the random noise case


Minimizing Disagreements

Goal: constant approximation

Problem: Even if we find a cluster as good as

one in OPT, we are headed towards a log n approximation (a set-cover like bound)

Idea: lower bound DOPT


Lower Bounding Idea: Bad Triangles

Consider

+

+

We know any clustering has to disagree with at least one of these edges.

- “Bad Triangle”


Lower Bounding Idea: Bad Triangles

If several edge-disjoint bad triangles, then any clustering makes a mistake on each one

Dopt #{Edge disjoint bad triangles}

1

4 3

25 2 Edge disjoint Bad Triangles(1,2,3), (1,4,5)

+

-+


Using the lower bound

-clean cluster: cluster C where each node has fewer than |C| “bad” edges

-clean clusters have few bad triangles => few mistakes

Possible solution: find a -clean clustering

Caveat: It may not exist


Using the lower bound

We show: a clustering with clusters that are -clean or singleton

Further, it has few mistakes

Nice structure helps us find it easily.

Caveat: A -clean clustering may not exist


Extensions & Open Problems

Weighted edges or incomplete graph Recent work by Bartal et al log-approximation based on multiway cut

Better constant for unweighted case

Can we use bad triangles (or polygons) more directly for a tighter bound?

Experimental performance


Other problems I have worked on

Game Theory and Mechanism Design

Approx for Orienteering & related problems

Online search algorithms based on Machine Learning approaches

Theoretical properties of Power Law graphs

Currently working on Privacy with Cynthia


Thanks!

Documents

Correlation Clustering