SimFusion+: Extending SimFusion Towards Efficient ...wyu1/ppt/oral/sigir12.pdf · Weiren 1Yu , Xuemin 1Lin1, Wenjie Zhang , Ying Zhang1 Jiajin Le2, SimFusion+: Extending SimFusion

Weiren Yu1, Xuemin Lin1, Wenjie Zhang1,

Ying Zhang1 Jiajin Le2,

SimFusion+: Extending SimFusion Towards Efficient

Estimation on Large and Dynamic Networks

1 University of New South Wales & NICTA, Australia

2 Donghua University, China

SIGIR 2012

2

2. Problem Definition

Contents

4. Experimental Results

1. Introduction

3. Optimization Techniques

3

1. Introduction

Many applications require a measure of “similarity” between objects.

similarity search

Citation of Scientific

Papers

(citeseer.com) (amazon.com)

Recommender System

Graph Clustering Web Search Engine

(google.com)

4

SimFusion: A New Link-based Similarity Measure

Structural Similarity Measure

PageRank [Page et. al, 99]

SimRank [Jeh and Widom, KDD 02]

SimFusion similarity

A new promising structural measure [Xi et. al, SIGIR 05]

Extension of Co-Citation and Coupling metrics

Basic Philosophy

Following the Reinforcement Assumption:

The similarity between objects is reinforced by the similarity of

their related objects.

5

SimFusion Overview

Features

Using a Unified Relationship Matrix (URM) to represent

relationships among heterogeneous data

Defined recursively and is computed iteratively

Applicable to any domain with object-to-object relationships

Challenges

URM may incur trivial solution or divergence issue of SimFusion.

Rather costly to compute SimFusion on large graphs

Naïve Iteration: matrix-matrix multiplication

Requiring O(Kn3) time, O(n2) space [Xi et. al. , SIGIR 05]

No incremental algorithms when edges update

6

Existing SimFusion: URM and USM

Data Space: a finite set of data objects (vertices)

Data Relation (edges) Given an entire space

Intra-type Relation carrying info. within one space

Inter-type Relation carrying info. between spaces

Unified Relationship Matrix (URM):

λi,j is the weighting factor between Di and Dj

Unified Similarity Matrix (USM):

1 2{ , , }o o

,i i i i

,i j i j

1

N

ii

1

1, ,

, if ;

, , if , ;

0, otherwise.

j

j

jn

i j i jx

x

x y x y

L

1,1 1,1 1,2 1,2 1, 1,

2,1 2,1 2,2 2,2 2, 2,

URM

,1 ,1 ,2 ,2 , ,

N N

N N

N N N N N N N N

L L L

L L LL

L L L

1,1 1,

,1 ,

. . .

n

n n T

n n n

s s

s t

s s

S S L S L

7

Example.

12

3

intra-typerelationship

inter-typerelationship

dataspace

data object

1v

2v 3v

4v5v

6v

1 2 3

1 1 11 4 4 2

51 12 8 4 8

31 13 5 5 5

Λ

1 1 1 1 18 8 4 4 4

1 1 14 4 2

5 5 51 1 116 16 4 24 24 24

URM 31 1 1 1 110 10 5 15 15 15

31 1 1 1 110 10 5 15 15 15

31 1 1 110 10 5 10 10

0

0 0 0

0

L

High complexity !!!

O(Kn3) time

O(n2) space

. . .n n Ts t S S L S L

1,2 1,

2,1 2,

USM

,1 ,2

1

1

1

n

n

n n

s s

s s

s s

S

SimFusion Similarity on Heterogeneous Domain

Trivial Solution !!!

S=[1]nxn

8

Contributions

Revising the existing SimFusion model, avoiding

non-semantic convergence

divergence issue

Optimizing the computation of SimFusion+

O(Km) pre-computation time, plus O(1) time and O(n) space

Better accuracy guarantee

Incremental computation on edge updates

O(δn) time and O(n) space for handling δ edge updates

9

Revised SimFusion

Motivation: Two issues of the existing SimFusion model

Trivial Solution on Heterogeneous Domain

Divergent Solution on Homogeneous Domain

Root cause: row normalization of URM !!!

10

From URM to UAM

Unified Adjacency Matrix (UAM)

Example

1

, ,

, if ;

, if , ;

0, otherwi e

1

s

,

.

j jn

i j i j

x

x y x y

A

11

Revised SimFusion+

Basic Intuition

replace URM with UAM to postpone “row normalization”

in a delayed fashion while preserving the reinforcement

assumption of the original SimFusion

Revised SimFusion+ Model Original SimFusion

squeeze similarity scores in S into [0, 1].

12

Optimizing SimFusion+ Computation

Conventional Iterative Paradigm

Matrix-matrix multiplication, requiring O(kn3) time and O(n2) space

Our approach: To convert SimFusion+ computation into

finding the dominant eigenvector of the UAM A.

Matrix-vector multiplication, requiring O(km) time and O(n) space

Pre-compute σmax(A) only once, and cache it for later reuse

13

Example

Conventional Iteration:

Our approach:

Assume with

14

Key Observation

Kroneckor product “⊗”:

e.g.

Vec operator:

e.g.

Two important Properties:

5 6 5 6 5 6 10 121 2

7 8 7 81 2 5 6 7 8 14 16, ,

3 4 7 8 15 18 20 245 6 5 63 4

21 24 28 327 8 7 8

X Y X Y

( ) [1 3 2 4]Tvec X

15

Key Observation

Two important Properties:

P1.

P2.

Our main idea:

(1)

(2)

Power Iteration

16

Accuracy Guarantee

Conventional Iterations: No accuracy guarantee !!!

Question: || S(k+1) – S || ≤ ?

Our Method: Utilize Arnoldi decomposition to build an

order-k orthogonal subspace for the UAM A.

Due to Tk small size and almost “upper-triangularity”, Computing σmax(Tk) is less costly than σmax(A).

17

Accuracy Guarantee

Arnoldi Decomposition:

k-th iterative similarity

Estimate Error:

18

Example

Arnoldi Decomposition:

Assume with

Given

(1)

(2)

(3)

19

Edge Update on Dynamic Graphs

Incremental UAM

Given old G =(D,R) and a new G’=(D,R’), the incremental UAM is

a list of edge updates, i.e.,

Main idea

To reuse and the eigen-pair (αp, ξp) of the old A to compute

is a sparse matrix when the number δ of edge updates is small.

Incrementally computing SimFusion+

O(δn) time

O(n) space

20

Example

Suppose edges (P1,P2) and (P2,P1) are removed.

21

Experimental Setting Datasets

Synthetic data (RAND 0.5M-3.5M)

Real data (DBLP, WEBKB)

Compared Algorithms

SimFusion+ and IncSimFusion+ ;

SF, a SimFusion algorithm via matrix iteration [Xi et. al, SIGIR 05];

CSF, a variant SF, using PageRank distribution [Cai et. al, SIGIR 10];

SR, a SimRank algorithm via partial sums [Lizorkin et. al, VLDBJ 10];

PR, a P-Rank encoding both in- and out-links [Zhao et. al, CIKM 09];

DBLP

WEBKB

22

Experiment (1): Accuracy

On DBLP and WEBKB

SF+ accuracy is consistently stable on different datasets.

SF seems hardly to get sensible similarities as all its similarities asymptotically approach the same value as K grows.

23

Experiment (2): CPU Time and Space

On DBLP

On WEBKB

SF+ outperforms the other approaches, due to the use of σmax(Tk)

24

Experiment (3): Edge Updates

IncSF+ outperformed SF+ when δ is small.

For larger δ, IncSF+ is not that good because the small value of δ preserves the sparseness of the incremental UAM.

Varying δ

25

Experiment (4) : Effects of

The small choice of imposes more iterations on computing Tk and vk, and hence increases the estimation costs.

26

Conclusions A revision of SimFusion+, for preventing the trivial solution

and the divergence issue of the original model.

Efficient techniques to improve the time and space of

SimFusion+ with accuracy guarantees.

An incremental algorithm to compute SimFusion+ on

dynamic graphs when edges are updated.

Devise vertex-updating methods for incrementally

computing SimFusion+.

Extend to parallelize SimFusion+ computing on GPU.

Future Work

27

Documents

SimFusion+: Extending SimFusion Towards Efficient ...wyu1/ppt/oral/sigir12.pdf · Weiren 1Yu , Xuemin 1Lin1, Wenjie Zhang , Ying Zhang1 Jiajin Le2, SimFusion+: Extending SimFusion