Upload
jamalia-mcmillan
View
48
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Correlation Clustering. Shuchi Chawla Carnegie Mellon University Joint work with Nikhil Bansal and Avrim Blum. Document Clustering. Given a bunch of documents, classify them into salient topics Typical characteristics: No well-defined “similarity metric” Number of clusters is unknown - PowerPoint PPT Presentation
Citation preview
Correlation Clustering
Shuchi ChawlaCarnegie Mellon University
Joint work with
Nikhil Bansal and Avrim Blum
Shuchi Chawla, Carnegie Mellon University2
Document Clustering
Given a bunch of documents, classify them into salient topics
Typical characteristics: No well-defined “similarity metric” Number of clusters is unknown No predefined topics – desirable to figure
them out as part of the algorithm
Shuchi Chawla, Carnegie Mellon University3
Research Communities
Given data on research papers, divide researchers into communities by co-authorship
Typical characteristics: How to divide really depends on the given set
of researchers Fuzzy boundaries
Shuchi Chawla, Carnegie Mellon University4
Traditional Approaches to Clustering
Approximation algorithms k-means, k-median, k-min sum
Matrix methods Spectral Clustering
AI techniques EM, classification algorithms
Shuchi Chawla, Carnegie Mellon University5
Problems with traditional approaches
Dependence on underlying metric
Objective functions are meaningless without a metric eg. k-means
Algorithm works only on specific metrics (such as Euclidean) eg. spectral methods
Shuchi Chawla, Carnegie Mellon University6
Problems with traditional approaches
Fixed number of clusters
Meaningless without prespecified number of clusters
eg. for k-means or k-median, if k is unspecified, it is best to put everything in their own cluster
Shuchi Chawla, Carnegie Mellon University7
Problems with traditional approaches
No clean notion of “quality” of clustering
Objective functions do not directly translate to how many items have been grouped wrongly
Heuristic approaches
Objective functions derived from generative models
Shuchi Chawla, Carnegie Mellon University8
Cohen, McCallum & Richman’s idea
“Learn” a similarity measure on documents may not be a metric!
f(x,y) = amount of similarity between x and yUse labeled data to train up this function
Classify all pairs with the learned function
Find the “most consistent” clusteringOur Task
Shuchi Chawla, Carnegie Mellon University9
An example
Consistent clustering:+ edges inside clusters- edges between clusters
+: Same-: Different
Harry Bovik
H. BovikTom X.
Harry B.
Shuchi Chawla, Carnegie Mellon University10
An example
+: Same-: Different
Harry Bovik
H. BovikTom X.
Harry B.
Disagreement
Shuchi Chawla, Carnegie Mellon University11
An example
Task: Find most consistent clustering or, fewest possible disagreements
equivalently, maximum possible agreements
+: Same-: Different
Harry Bovik
H. BovikTom X.
Harry B.
Disagreement
Shuchi Chawla, Carnegie Mellon University12
Correlation clustering
Given a complete graph – Each edge labeled ‘+’ or ‘-’
Our measure of clustering – How many labels does it agree with?
Number of clusters depends on the edge labels
NP-complete; We consider approximations
Shuchi Chawla, Carnegie Mellon University13
Compared to traditional approaches…
Do not have to specify k
No condition on weights – can be arbitrary
Clean notion of quality of clustering – number of examples where the clustering differs from f
If a good (perfect) clustering exists, it is easy to find
Shuchi Chawla, Carnegie Mellon University14
Some machine learning justification
Noise Removal There is some true classification function f But there are a few errors in the data We want to find the true function
Agnostic Learning There is no inherent clustering Try to find the best representation using a
hypothesis with limited expressivity
Shuchi Chawla, Carnegie Mellon University15
Our results
Constant factor approximation for minimizing disagreements
PTAS for maximizing agreements
Results for the random noise case
Shuchi Chawla, Carnegie Mellon University16
Minimizing Disagreements
Goal: constant approximation
Problem: Even if we find a cluster as good as
one in OPT, we are headed towards a log n approximation (a set-cover like bound)
Idea: lower bound DOPT
Shuchi Chawla, Carnegie Mellon University17
Lower Bounding Idea: Bad Triangles
Consider
+
+
We know any clustering has to disagree with at least one of these edges.
- “Bad Triangle”
Shuchi Chawla, Carnegie Mellon University18
Lower Bounding Idea: Bad Triangles
If several edge-disjoint bad triangles, then any clustering makes a mistake on each one
Dopt #{Edge disjoint bad triangles}
1
4 3
25 2 Edge disjoint Bad Triangles(1,2,3), (1,4,5)
+
-+
Shuchi Chawla, Carnegie Mellon University19
Using the lower bound
-clean cluster: cluster C where each node has fewer than |C| “bad” edges
-clean clusters have few bad triangles => few mistakes
Possible solution: find a -clean clustering
Caveat: It may not exist
Shuchi Chawla, Carnegie Mellon University20
Using the lower bound
We show: a clustering with clusters that are -clean or singleton
Further, it has few mistakes
Nice structure helps us find it easily.
Caveat: A -clean clustering may not exist
Shuchi Chawla, Carnegie Mellon University23
Extensions & Open Problems
Weighted edges or incomplete graph Recent work by Bartal et al log-approximation based on multiway cut
Better constant for unweighted case
Can we use bad triangles (or polygons) more directly for a tighter bound?
Experimental performance
Shuchi Chawla, Carnegie Mellon University24
Other problems I have worked on
Game Theory and Mechanism Design
Approx for Orienteering & related problems
Online search algorithms based on Machine Learning approaches
Theoretical properties of Power Law graphs
Currently working on Privacy with Cynthia
Shuchi Chawla, Carnegie Mellon University25
Thanks!