SimRank: A Measure of Structural-Context Similarity
Glen Jeh and Jennifer WidomStanford UniversityACM SIGKDD 2002
January 19, 2011Taikyoung Kim
SNU IDB Lab.
2
Outline Introduction Basic Graph Model SimRank Random Surfer-Pairs Model Conclusion Future Work
3
Introduction Many applications require a measure of “similarity”
between objects– “find-similar-document” query in search engine– Collaborative filtering in a recommender system
4
Introduction Propose a general approach that exploits the object-
to-object relationships in many domains– An algorithm to compute similarity scores between nodes
based on the structural context
Intuition behind the algorithm– Similar objects are related to similar objects– The base case is that objects are similar to themselves
“Two objects are similar if they are referenced by similar objects”
5
Basic Graph Model
G = (V, E) [vertex, edge]– Nodes in V: objects in the domain– Directed edges in E: relationships between objects– <p, q> : from object p to object q
For a node v, denote:– I(v): the set of in-neighbors of v– O(v): the set of out-neighbors of v– Ii(v): individual in-neighbor ( 1 ≤ i ≤ |I(v)| )
– Oi(v): individual out-neighbor ( 1 ≤ i ≤ |O(v)| )
O (Univ)
I (ProfB)
6
Outline Introduction Basic Graph Model SimRank Random Surfer-Pairs Model Conclusion Future Work
7
SimRank Motivation
– Two objects are similar if they are referenced by similar ob-ject
– Consider an object maximally similar to itself (similarity score of 1)
Similar nodes:{ProfA, ProfB},{StudentA, StudentB},{Univ, ProfB},…
8
SimRank
Basic SimRank Equation
The similarity between objects a and b: s(a, b) ∈ [0, 1]
– C is a constant between 0 and 1 Confidence level or decay factor C gives the rate of decay as similarity flows across edges (since
C < 1)
– If a or b may not have any in-neighbors, s(a,b) = 0– SimRank scores are symmetric, i.e., s(a,b) = s(b,a)
Similarity between a and b is the average similar-ity between in-neighbors of a and in-neighbors of b
)()(
b)a (if ))(),(()()(
b)a (if
),( bI
jji
aI
i
bIaIsbIaI
Cbas
11
1
9
SimRank
Basic SimRank Equation
Similarity can be thought of as “propagating” from pair to pair– Consider the derived graph G2=(V2, E2) where
V2=V x V, represents a pair (a,b) of nodes in G An edge from (a,b) to (c,d) exists in E2, iff the edges <a,c> and
<b,d> exist in G
10
SimRank
Bipartite SimRank
Bipartite domains consist of two types of objects Recommender system
– People are similar if they purchase similar items– Items are similar if they are purchased by similar people
11
SimRank
Bipartite SimRank
Bipartite Equation– Directed edges go from people to items– s(A,B) denote the similarity between persons A and B, (A≠B)
– s(c,d) denote the similarity between items c and d, (c≠d)
– The similarity between persons A and B is the average simi-larity between the items they purchased
– The similarity between items c and d is the average similar-ity between the people who purchased them
)(
1
)(
1
1 ))(),(()()(
),(BO
jji
AO
i
BOAOsBOAO
CBAs
)(
1
)(
1
2 ))(),(()()(
),(dI
jji
cI
i
dIcIsdIcI
Cdcs
12
SimRank
Computing SimRank - Naïve Method
Rk(a,b) gives the score between a and b on iteration k
The values Rk(*,*) are non-decreasing as k increase In experiments, when K = 5, Rk is rapidly converged Complexity
– Space: O(n2) to store the result Rk,
– Time: O(Kn2d2), d2 is the average of |I(a)||I(b)| over all node pairs (a,b)
) (
) ( ),(
baif
baifbaR
1
00
)(
1
)(
11 ))(),((
)()(),(
bI
jjik
aI
ik bIaIR
bIaI
CbaR
),(),(lim basbaRkk
13
SimRank
Computing SimRank - Pruning
Pruning the logical graph G2
– In naïve method, All n2 nodes of G2 are considered Similarity score are computed for every node-pair
– Nodes far from a node v has less similarity score with v than nodes near v
Pruning– Set the similarity between two nodes far apart to be 0– Consider node-pairs only for nodes which are near each
other in the range of radius r– Complexity
space: O(ndr), dr is average nodes which are near from a node
time: O(Kndrd2)
14
Outline Introduction Basic Graph Model SimRank Random Surfer-Pairs Model Conclusion Future Work
15
Random Surfer-Pairs Model For the intuition of similarity scores, provide an intu-
itive model– Based on “random surfers”– Show the SimRank score s(a,b) measures how soon two ran-
dom surfers are expected to meet at the same node Expected Distance
– u and v are nodes in strongly connected graph– The ED from u to v is exactly the expected number of steps a
random surfer would take before he first reaches v, starting from u
– Tour t = <w1, …, wk>
– l[t]: length of t– P[t]: probability of traveling t
vut
tltPvud:
][][),(
16
Random Surfer-Pairs Model Expected Meeting Distance (EMD)
– EMD is symmetric– EMD m(a,b) is simply the expected distance in G2 from (a,b)
to any singleton node(x,x) ∈ V2
),(),(:
][][),(xxbat
tltPbam
m(v,w)=1m(u,v)=∞m(u,w)=∞
m(*,*)= ∞ m(*,*)= 3
17
Random Surfer-Pairs Model Expected-f Meeting Distance
– Our approach to circumvent the “infinite EMD” problem Map all distances to a finite interval: instead of computing ex-
pected length l(t) of a tour
Equivalence to SimRank– S’(*,*) is exactly models that our original definition of Sim-
Rank scores
)(
),(),(:
][),(' tl
xxbat
ctPbas
18
Outline Introduction Basic Graph Model SimRank Random Surfer-Pairs Model Conclusion Future Work
19
Conclusion Main contribution
– A formal definition for SimRank similarity scoring over arbi-trary graphs, several useful derivatives of SimRank, and an algorithm to compute SimRank
– A graph-theoretic model for SimRank that gives intuitive mathematical insight into its use and computation
– Experimental results using an in-memory implementation of SimRank over two real data sets shows the effectiveness and feasibility of SimRank
20
Future Work Address efficiency and scalability issues
– Including additional pruning heuristics and disk-based algo-rithms
Consider ternary (or more) relationships in computing structural-context similarity
Explore the combination of SimRank with other do-main-specific similarity measures