Upload
viveka
View
30
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Detecting Communities Via Simultaneous Clustering of Graphs and Folksonomies Akshay Java Anupam Joshi Tim Finin University of Maryland, Baltimore County. KDD 2008 Workshop on Web Mining and Web Usage Analysis. Outline. Introduction Community Detection Clustering Approach - PowerPoint PPT Presentation
Citation preview
Detecting Communities Via Simultaneous Clustering of Graphs and Folksonomies
Akshay JavaAnupam JoshiTim Finin
University of Maryland, Baltimore County
KDD 2008Workshop on Web Mining and Web Usage Analysis
• Introduction• Community Detection
– Clustering Approach– Spectral Approach– Co-Clustering
• Simultaneous Clustering• Evaluation• Future Work• Conclusions
Outline
• Introduction• Community Detection
– Clustering Approach– Spectral Approach– Co-Clustering
• Simultaneous Clustering• Evaluation• Future Work• Conclusions
Outline
Social Media
Describes the online technologies
and practices that people use to
share opinions, insights,
experiences, and perspectives
and engage with each other.
~Wikipedia
Social Media Graphs
G = (V,E) describing the relationships between different entities (People, Documents, etc.)
G’ = <V,T,R> a tri-partite graph that expresses how entities ‘Tag’ some resource
11 22 33 44
11 22Tags
11 22 33 44 URLs
Users
A community in the real world is identified in a graph as a set of nodes that have more links within the set than outside it.
Political Blogs
Twitter Network
Facebook Network
What is a Community
• Introduction• Community Detection
– Clustering Approach– Spectral Approach– Co-Clustering
• Simultaneous Clustering• Evaluation• Future Work• Conclusions
Outline
Community DetectionClustering Approach
Clustering Approach1. Agglomerative/Hierarchical
Topological Overlap: Similarity is measured in terms of number of nodes that both i and j link to. (Razvasz et al.)
Community DetectionClustering Approach
Clustering Approach1. Agglomerative/Hierarchical
2. Divisive/Partition based
Remove edges that have highest edge betweenness centrality
Political Books
(Girvan-Newman Algorithm)
Community DetectionSpectral Approach
• The graph can be partitioned using the eigenspectrum of the Laplacian. (Shi and Malik)
• The second smallest eigenvector of the graph Laplacian is the Fiedler vector.
• The graph can be recursively partitioned using the sign of the values in its Fielder vector.
€
L = D −W = I − D−
1
2 *W * D−
1
2
€
NCut(A,B) = Cut(A,B)1
Vol(A)+
1
Vol(B)
⎡
⎣ ⎢
⎤
⎦ ⎥
Normalized Cuts
Graph Laplacian
Cost of edges deleted to disconnect the graph
Total cost of all edges that start from B
Community DetectionCo-Clustering
• Spectral graph bipartitioning• Compute graph laplacian using
Where is the document by term matrix
(Dhillon et al.)€
A ∈ ℜn×m
• Introduction• Community Detection
– Clustering Approach– Spectral Approach– Co-Clustering
• Simultaneous Clustering• Evaluation• Future Work• Conclusions
Outline
Social Media Graphs
Links Between Nodes Links Between Nodes and Tags
Simultaneous Cuts
A community in the real world is identified in a graph as a set of nodes that have more links within the set than outside it and share similar tags.
Communities in Social Media
Clustering Tags and Graphs
€
€
1 1 1 0 0
1 1 1 0 0
1 0 1 1 0
1 0 0 1 1
1 0 0 1 1
1 1 0 0 0 1 1 1 0
1 1 1 0 0 1 1 0 0
0 0 1 1 1 0 0 1 1
0 0 0 1 1 0 0 1 1
⎛
⎝
⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜
⎞
⎠
⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟
Nodes
Nodes
Nod
esT
ags
Tag
sN
odes
Tags
Tags
€
1
1
−1
−1
−1
1
1
−1
−1
Fiedler Vector Polarity
€
W ' =I C
C T βW
⎛
⎝ ⎜
⎞
⎠ ⎟
β= 0 is like co-clustering,
β= 1 Equal importance to blog-blog and blog-tag,
β>> 1 NCut
Clustering Tags and Graphs
β= 0 is like co-clustering,
β= 1 Equal importance to blog-blog and blog-tag,
β>> 1 NCut
Clustering Only Links
Clustering Links + Tags
€
W ' =I C
C T βW
⎛
⎝ ⎜
⎞
⎠ ⎟
Clustering Tags and GraphsClustering Only Links
Clustering Links + Tags
• Introduction• Community Detection
– Clustering Approach– Spectral Approach– Co-Clustering
• Simultaneous Clustering• Evaluation• Future Work• Conclusions
Outline
Datasets
• Citeseer– Agents, AI, DB, HCI, IR, ML– Words used in place of tags
• Blog data – derived from the WWE/Buzzmetrics dataset– Tags associated with Blogs derived from del.icio.us– For dimensionality reduction 100 topics derived from blog homepages using LDA (Latent Dirichilet Allocation)
• Pairwise similarity computed – RBF Kernel for Citeseer– Cosine for blogs
Citeseer Data
Accuracy = 36% Accuracy = 62%
Higher accuracy by adding ‘tag’ information
SimCut Results in• Higher intra-cluster similarity• Lower inter-cluster similarity
Citeseer DataNCut SimCut
Constrains cuts based on both• Link Structure• Tags
Citeseer DataNCut SimCutTrue
SimCut Results in• Higher intra-cluster similarity• Lower inter-cluster similarity
Blog DataNCut SimCut
Blog DataNCut SimCut
Ncut
Few, Large clusters with low intra-cluster similarity
SimCut
Moderate size clusters higher intra-cluster similarity
35 Clusters
Effect of Number of Tags, ClustersCiteseer
More tags help, to an extent
Lower mutual information if only the graph is used
Mutual Information compares clusters to ground truth
Effect of Number of Tags, ClustersBlogs
More tags help, to an extent
Lower mutual information if only the graph is used
Mutual Information compares clusters to content-based clusters (no tags/graph)
• Introduction• Community Detection
– Clustering Approach– Spectral Approach– Co-Clustering
• Simultaneous Clustering• Evaluation• Future Work• Conclusions
Outline
Future Work
• Evaluating SimCut algorithm on derived feature types like: named entities, sentiments and opinions, links to main stream media.
• For a dataset with ground truth, a comparison of graph based, text based and graph+tag based clustering
• Evaluating effect of varying β
• Introduction• Community Detection
– Clustering Approach– Spectral Approach– Co-Clustering
• Simultaneous Clustering• Evaluation• Future Work• Conclusions
Outline
Conclusions
• Many Social Media sites allow users to tag resources
• Incorporating folksonomies in community detection can yield better results
• SimCut can be easily implemented and relates to Ncut with two simultaneous objectives– Minimize number of node-node edges being cut– Minimize number of node-tag edges being cut
• Detected communities can be associated with meaningful, descriptive tags
Thanks!
http://ebiquity.umbc.eduhttp://socialmedia.typepad.com
More Tags
Only Graph SimCut
Citeseer (Community Size, Similarity)
Blogs (Community Size, Similarity)