Measuring Large-Scale Dynamic Graph Similarity by RICom ...best.snu.ac.kr/pub/paper/2015.ICDM.jklee.pdf · Measuring Large-Scale Dynamic Graph Similarity by RICom: RWR with Intergraph

Measuring Large-Scale Dynamic Graph Similarityby RICom: RWR with Intergraph Compression

Jaekoo Lee†, Gunn Kim\, and Sungroh Yoon†∗†Electrical Engineering and Computer Science, Seoul National University, Seoul 151-744, Korea

\Department of Physics, Sejong University, Seoul 143-747, Korea∗Correspondence: [email protected]

Abstract—‘By how much is a large-scale graph transformedover time or by a significant event?’ or ‘how structurallysimilar are two large-scale graphs?’ are the two questionsthat this paper attempts to address. The proposed methodefficiently calculates and accurately produces graph similarity.Our approach is based on the well-known random walk withrestart (RWR) algorithm, which quantifies relevance betweennodes to express the structural and connection characteristics ofgraphs. Intergraph compression, which is inspired by interframecompression, merges two input graphs and reorders their nodescontributing to improved process-data storage efficiency andprocessing convenience. This is a boon to the RWR algorithm forlarge-scale graphs. The representation of a graph transformed viaintergraph compression can be used to accurately show similaritybecause sub-matrix blocks are reordered to concentrate nonzeroelements. In performing the RWR algorithm, which quantifiesinter-node relevance, transformed representation of graph withintergraph compression is efficient in space requirement andproduces results more quickly and accurately over conventionalgraph transformation schemes. We demonstrate the validity ofour method through experiments and apply it to the usage dataof public transportation SmartCard in a large metropolitan areato suggest usefulness of the proposed algorithm.

I. INTRODUCTION

Graphs have been actively employed in attempts to analyzecomplex systems in a wide variety of applications such astraffic maps, the Internet, human brain maps, protein-proteininteraction (PPI) networks [1], [2]. We pay special attentionto efforts that compare and contrast two graphs by captur-ing their connectivity features. Research on graph similaritymeasurements is useful for the classification tools that exploitconnectivity features extracted from static graphs, e.g., a PPInetwork. In applications that can be abstracted with dynamicgraphs, similarity in graph connectivity is monitored to detectanomalies that are manifest in the form of, for instance,accidents and intrusions [3]. Big data analysis has become anactive area of research in recent years, and analyzing large-scale interconnected data such as social networking service(SNS) data is already an intensely studied topic. In particular,this paper focuses on analyzing and measuring large-scalegraphs whose contents change dynamically over time or afteran important event.

This paper proposes a method that efficiently measures thestructural change of a dynamic large-scale graph as well as thesimilarity between two graphs in a large graph set. Our initialgoal is to analyze a dynamic graph changing in a continuumover time. In addition, we apply our method to static graphdatasets from the real world to present an optimized analysisthat can offer analysis tools based on similarity.

Similaritybetween Nodes

between Graphs

RolesProximityKnow Node CorrespondenceUnknown Node Correspondence

Fig. 1: Categorizing node and graph similarity measures [8]

The idea presented in this paper starts from a simpleobservation that perhaps some concepts in video processingcould be borrowed for manipulating dynamic graphs. This isbecause image segmentation methods often abstract pixels intographs and employ graph algorithms to produce meaningfulresults [4]. That is, there exist conceptual similarity between avideo (which consists of a sequence of still images) and a dy-namic graph (which is a series of adjacency matrices of staticgraph). As an image sequence processing method, interframecompression [5] offers improved data storage efficiency andease of operations because it compresses common pixel datain adjacent frames. Adapting the core idea of interframe com-pression for graph processing, we thus propose an intergraphcompression method, which compresses adjacent static graphsin a dynamic graph by exploiting the shared structure featuresbetween graphs. The proposed method can measure changesin dynamic graphs with the RWR algorithm [4] via the Schurcomplement [6], [7] for efficiency and accuracy.

In our experiments, the proposed method indeed producedaccurate results, and its processing time and storage re-quirement were also improved over existing state-of-the-artmethods. The validity and fitness of the proposed algorithmwere experimentally tested using synthetic graph data [3] andreal-world graph data [9]. The method was also tested withthe dynamically changing subway SmartCard data extractedfrom a large metropolitan public transportation system.

The rest of the paper is organized as follows: Section 2provides background materials. Section 3 presents the detailsof our approach and compares it with existing state-of-the-art methods. Section 4 shows our experimental results withboth synthetic and real-word datasets and also analyzes thetime and space complexity of the proposed method. Section 5concludes the paper. The descriptions of the symbols used inthis paper are listed in Table I.

II. PRELIMINARIES

A. Graph Similarity Measure

As smart devices have become widespread, social networkservices such as Facebook or Twitter generate large-scale

TABLE I: The symbols and descriptions used in this paper

symbol description

G or G(V, E) graph, graph with V: set of node, E: set of edgeGU or GU (VU , EU ) union graph of graphs G1 and G2

n, m number of nodes, number of edgesc probability of restartingk number of hubs removed in node reordering methodw number of total hubs, w � n

sim(G1, G2) similarity between graphs G1 and G2

I [n× n] identity matrixA [n× n] adjacency matrix of graph GA′ [n× n] inter-graph compressed adjacency matrix of AD [n× n] diagonal degree matrix of graph G,

dii =∑

j aij(dij = 0 for i 6= j)

D′ [n× n] inter-graph compressed diagonal degree matrix of DBij [ni × nj ] (i, j)th block matrix of BP [n× n] matrix P = [I− (1− c)AD−1]P′ [n× n] matrix P′ = [I− (1− c)A′D′−1]

P′ = [P′11, P′12; P′21, P′22] 2 by 2 block partitioned matrix~ei∗ [n× 1] starting indicator (seed) vector~ri [n× 1] relevance vector of node irij relevance value of node j w.r.t. node iR [n× n] matrix with elements rij

R = [R11,R12; R21,R22] 2 by 2 block partitioned matrixS′ [w × w] Schur complement of block P′11 of matrix P′

∗ unit vector with 1 in the ith element.

graph data, which has been the focus of active research. Forexample, the connectivity of networks can be expressed in adynamically changing graph for analysis, or individuals canbe grouped according to the brain connectivity graph [3].In one study, data was collected from the functional partsof the human brain and converted into a graph for analysisto classify people. Such studies measure similarity betweengraphs in terms of structure and/or features and allow forintuitive comparisons [8]. This paper also attempts to captureand numerically represent graph similarity. We provide anoverview of the general characteristics of conventional graphmeasures that pertain to our work. Fig. 1 shows that our pro-posed method belongs to the category of graph research thatexcludes unknown node correspondence because it comparesgraphs based on approximated information of the structuralrole of node and connectivity features. So, it applies to graphresearch where node correspondence is assumed.

State-of-the-art graph similarity measures satisfy the fol-lowing properties [3]. Similarity measure needs to reflectEdge Importance, which states that insertion and removalof edges between graphs need to be accounted for. WeightAwareness also needs to be satisfied so that any change tothe weight of each edge needs to be reflected. Similaritymeasure also needs to be sensitive to Edge Sub-modularity,hence changes to higher-degree nodes need to have less impactthan changes to lower-degree nodes. Assuming that changesin graphs due to accidents are more frequent than arbitrarychanges in edges, the sensitivity measure needs to identifyedge changes surrounding specific nodes, satisfying FocusAwareness. Through experiments, we verify that our methodsatisfies the four properties above.

B. Random Walk with Restart (RWR)

The RWR algorithm attempts to measure the affinity orrelevance of two nodes in a graph [4]. Conceptually, in RWR,a particle walks between nodes following randomly chosen

edges. The particle starts walking in the initial node i andchooses the next node connected through the edges, each ofwhich is assigned weighted probability, and then makes itsmove. Repeating moves in this fashion multiple times allowsfor estimation of the probability of each node having theparticle on it. This is a quantified value tailored by the structureof graph connectivity, hence is used as a relevance measurebetween two nodes [10]. The relevance score produced by theRWR algorithm accounts for the entire connectivity structureof the graph, so is employed as an essential analysis tool.

Since its introduction, the RWR algorithm has developed toaccommodate expanding graphs and to numerically measurethe relevance of nodes in such graphs while improving ontime complexity and accuracy [4], [10], [11]. Variations of theRWR algorithm have been employed in earlier applicationssuch as automatic image captioning [4], and are expandinginto various fields. Research on the RWR algorithm encom-passes practical applications, optimization for computation,storage, and noticeably for large-scale graphs as theoreticalbackgrounds. Some of the well-known variations of the RWRalgorithm include PageRank, electrical network analogy, andpersonalized RWR [10]. Related studies have been applied todiscover user similarity or influential individuals in SNS [12].

The iterative form of a basic RWR algorithm [4], [10] is:

~r(t)i ← (1− c)AD−1~r

(t−1)i + c~ei (1)

where c is the probability of restarting the random walkfrom the initial node, ~ei is the starting indicator vector,~ri is the unknown relevance (affinity or influence) columnvector, and (t) is the number of renewal activities for ~r(t)

i .The vector ~r(t)

i is obtained by repeatedly processing Eq. 1while |~r(t) − ~r(t−1)| < threshold holds true. The componentrij in the produced vector ~r(t)

i is the relevance of node jw.r.t. node i. In other words, rij is the numerically translatedvalue of influence by node i for node j in the context of aconnectivity structure. Intuitively, node i has more influenceon node j if the edge linking node i to node j is numerous,short, or heavily weighted. The RWR algorithm in an iterativemethod repeatedly performs multiplication operations for an[n × n] matrix in each iteration to compute ~r(t)

i , hence iscomputationally inefficient.

When 0 < c < 1, ~r(t) generally approaches a uniquesolution [10]. Thus, the RWR algorithm can be defined asthe following because ~r has a solution

[I− (1− c)AD−1]~ri = c~ei. (2)

With this equation, ~ri can be rewritten by means of aninverse matrix operation in linear algebra as follows:

~ri = c[I− (1− c)AD−1]−1~ei = cP−1~ei. (3)

This RWR algorithm suffers from the fact that an inversematrix operation has to be performed for an [n × n] matrixto obtain ~ri, and that its computational complexity O(n3)increases as n becomes larger [6]. There are several studiesthat have attempted to reduce time complexity, some of whichare listed in Table II. As charted in the table, variations of

TABLE II: Related work on the RWR approach

methods details

Basic Iteration [10] ~r(t) ← (1− c)AD−1~r(t−1) + c~eInversion [10] ~r = c[I − (1− c)AD−1]−1~e

QR Decomposition [11] ~r = cR−1QT~eLU Decomposition [3] ~r = cU−1L−1~eB LIN [4] ~r = c[I− (1− c)A1 − (1− c)A2]−1~e

≈ c[I− (1− c)A1 − (1− c)UΣV]−1~eFast Belief Propagation [3] ~r = c[I− ε2D− εA]−1~e

BEAR [7] ~r = [r1; r2] where r1 = H−111 (cq1 − H12r2) and

r2 = S−1(cq2 − H21H−111 (cq1))

Proposed ~r = c[I − (1− c)A′D′−1]−1~e

P = QR, Q−1 = QT ; P = LU; divide AD−1 into inner community edges A1

and cross community edges A2, A2 ≈ UΣV (low-rank matrix); H is the partitionedmatrix from Slashburn, and S = H22 − H21H−1

11 H12; ε = 1/(1 + maxi dii),positive constant (< 1) encoding neighbor influence.

1st Stage - Preprocess

INPUT: ApplicationsTraffic,

Computer Network, SNS,…Data Cleansing

Extraction of Raw Data

(A) Representing Graph Dataset

(Dynamic or Static)

Database

2nd Stage - Inter-Graph Compression

(B) Union Graph

(C) Reordering Nodes

3rd Stage - Graph Similarity Measure(D) Forming Matrix P’

for RWR

(E) Calculating RWR by Schur Complement

(F) Calculating Matrix Distanceand Similarity ScoreOUTPUT: Sim(G1, G2) ∈ [0,1]

Compressed Matrices A’ and D’

Fig. 2: Overview of proposed method

RWR include iterative methods, inversion methods [10], QRdecomposition methods [11], RPPR/BRPPR methods [10],B LIN/NB LIN methods [4], fast belief propagation-based(DeltaCon) methods [3], and LU decomposition methods [7].The RWR algorithm to measure the structural feature of graphfor our method shows improved time/space efficiency overstate-of-the-art RWR variations, and it has been devised to fitfor large-scale graphs.

C. Schur Complement

The Schur complement [6] is applied as a part of thesolution to a linear equation system, and is an important toolfor numerical analysis and matrix operation. In graph mining,its use was pioneered by the block elimination approach forRWR (BEAR) [7]. In the proposed method, it is also anessential tool because it allows the RWR algorithm to performefficiently. In case the size of a matrix M = [W,X;Y,Z] is[(p+ q)× (p+ q)], then its constituent block sub-matrices W,X, Y, and Z have the sizes of [p× p], [p× q], [q× p], [q× q],respectively, where p � q in general. If Z is invertible, thenthe Schur complement for block Z in matrix M can be definedas S = W − XZ−1Y, and can be used to calculate the inversematrix of matrix M as follows:

M−1 =

[W X

Y Z

]−1

=

[S−1 −S−1XZ−1

−Z−1YS−1 Z−1+Z−1YS−1XZ−1

].

D. Measurement of Matrix DistanceThere have been various attempts to find similarity or

distance between matrices, such as using matrix correlationor cosine similarity [6]. In this paper, we choose Euclideandistance because it allows for finding distance, which repre-sents the relevance of each node in the graph as calculatedwith the RWR algorithm; the similarity between two graphsis intuitively represented.

III. PROPOSED METHOD

We present a method to efficiently measure (1) the changeof a dynamic graph whose connectivity structure evolves overtime and (2) the similarity of large-scale graph datasets alongwith justification for our method. The definition of the problemwe want to address can be stated briefly as follows [3]:

Given: two graphs G1(V1,E1) and G2(V2,E2) each con-sisting of node set V and edge set E, we assume that the nodecorrespondence of sets V1 and V2 is known.

Find: sim(G1, G2) that lies between 0 and 1 and intuitivelysignifies the structural similarity between two given graphs.Here, sim(G1, G2) = 0 indicates that two graphs are struc-turally complementary, where sim(G1, G2) = 1 means twographs are completely identical.

Fig. 2 summarizes the overall architecture of our method.In the first stage of our method, the provided dataset isconverted to a dynamic graph, and the two adjacent staticgraphs G1(V1,E1) and G2(V2,E2) extracted from the givendynamic graph go through the following steps in sequence.Since intergraph compression borrows its core concept frominterframe compression, the first task is creating the unionset GU with the two input graphs so that the commonlyshared structural features can be relocated for compression.Afterwards, a state-of-the-art node reordering algorithm isapplied to GU , which stores common edges. In the adjacencymatrix, nonzero elements indicate edges between nodes. Thenonzero elements are concentrated to a focused region. Thisnot only brings in a compressing effect on the matrix, butalso facilitates partitioning in applying the RWR algorithm andprovides a basis for applying the Schur complement algorithm.

Following the reordered sequence computed in intergraphcompression, each graph is also reordered in the same nodeindex, and matrix P′ is calculated to apply the RWR algorithmso that the connectivity between nodes can be identified.To P′, a 2-by-2 block partitioned matrix, Schur complementalgorithm is applied and the relevance matrix R is calculated,which shows the affinity between nodes in the structuralcontext of the entire graph. The relevance matrices R1 andR2 are efficiently calculated with a partitioned inverse matrixoperation through the Schur complement algorithm. Both R1

and R2 share an identical node index and are evaluated tonumeric values following matrix distance measuring opera-tions between corresponding block components due to thepartitioning block properties. The value, which lies between 0and 1, intuitively expresses similarity between two graphs, andis the final result of our method. The numeric results obtainedthis way can be used as a tool to measure changes in a dynamicgraph or to numerically express similarity between graphs.

reor

dere

d no

de in

dex

reor

dere

d no

de in

dex

intergraph compressed G1 intergraph compressed GU intergraph compressed G2

G1(V1,E1) GU(VU,EU) G2(V2,E2)

Uni

on G

raph

Reo

rder

ing

Nod

esor

igin

al n

ode

inde

x

orig

inal

nod

e in

dex

Fig. 3: Visualization of intergraph compression on real-world graphs: Uniongraph is an example of the merged graph of GU (V,E) = G1(V1,E1) ∪G2(V2,E2) for intergraph compression, and this is applicable when everysingle node in V1 of G1(V1,E1) has its corresponding node in V2 ofG2(V2,E2) as well as when (V1 6= V2) because the node set of GU can bedefined as V = V1 ∪ V2. In reordering nodes, SlashBurn removes hubs withthe highest degree and the edges connected to the removed hubs, and producesspokes with smaller degrees. The hub nodes are relocated to the front of theindex and the spokes to the back repeatedly to reorder the nodes. As a result,the hubs with higher degrees in the adjacency matrix for the given graph, areconcentrated closely together, producing compression and partition effects.With intergraph compression, it is possible to verify shared structural databetween two graphs by merging the target graphs. Node reordering algorithmis applied on the merged graph to get the compression effect like in motionpicture processing, hence improving space and computation efficiency.

A. Representing Graph Dataset

Several methods have been proposed to visually analyze theinteractive relation of data components in a given dataset usinggraphs [1]. In the preprocessing stage of our method, eachpiece of data is converted to a node and the relation betweendata to edge to represent input data as a graph, which is oneof the simplest approaches of graph conversion. This approachis valid in converting not only static datasets but also dynamicdatasets, which change over time or after an event. To thegenerated graph, our method is applied for clustering (staticdatasets) or for tracing changes (dynamic datasets).

B. Union Graph

The proposed method is inspired from an observationthat commonly shared data between adjacent still frames ofmotion pictures can be processed in compressed form. Thesame approach may be applicable to two adjacent graphsin a dynamic graph setting. Note that for a dynamic graphin the real world [9], changes almost always happen in acontinuum, and abrupt changes virtually never arise. Withthis in mind, the concept of compressibility for shared pixeldata between adjacent frames using interframe compressionfor motion pictures is applied to two dynamically changinggraphs, capturing shared structural features between two inputgraphs. Our method is comparable to interframe compressionin offering storage efficiency and ease of operations.

C. Reordering Nodes

Several node reordering methods are available for use:degree sort, cross association, spectral clustering, and shingle,all of which bring in compressing effects [13]. For ourapproach, we note that most real graphs possess the pattern of

a scale-free network. Inspired by BEAR [7], we thus chooseSlashBurn [13], a state-of-the-art node reordering algorithmthat concentrates nonzero components in adjacent matrices asmuch as possible, producing an excellent compression effect.Fig. 3 clearly shows the compression effect of intergraphcompression after applying this algorithm to GU , which isthe union of the two given graphs.

D. Forming Matrix P′ for RWR

We show that block-form sub-matrices belonging to onelarge-scale matrix are created after compressing two inputgraphs with intergraph compression. Let matrix A′ be theoutput of intergraph compression. We apply the RWR algo-rithm [4] to matrix A′ to identify inter-node connections andfind matrix R, which consists of rij (inter-node relevance).

For an arbitrarily selected node i, ~ri can be expressed as

~ri = c[I− (1− c)A′D′−1]−1~ei = cP′−1~ei. (4)

After connecting each ~ri of node i contiguously (0 < i < n),we can calculate matrix R for all of the nodes in the graphthat has each ~ri for its column vector [3] as follows:

R =[~r1 . . . ~rn

]= P′−1

[~e1 . . . ~en

]= P′−1I = P′−1.

The matrix P′ = [I−(1−c)A′D′−1] is obtained from reorderedadjacency matrix A′. Thus, the blocks are similar, and can beexpressed with 2-by-2 block matrices as in matrix A′.

With the compression effect present in A′ due to intergraphcompression, P′ is partitioned into sub-matrices of P′11, P′12,P′21, and P′22. P′11 of P′ is a smaller partition ([w × w]) withabsolutely high concentration of nonzero components, whereasthe two blocks of P′12 ([w× (n−w)]) and P′21 ([(n−w)×w])are symmetrical, like a pair of wings; the nonzero componentsare thus appropriately scattered depending on the connectivityof the graph. P′22, which takes most of the region with ([(n−w) × (n − w)]) in the P′ matrix, has only a few nonzerocomponents to form a sparse nonzero block diagonal. Sucha block diagonal sub-matrix is invertible because each sparseblock is partitioned and approximated with that goal.

As the small sparse block matrix diagonal to matrix P′22 hasa relatively much smaller size compared to the correspondingsub-matrix in the original matrix, each block is treated as anindependent matrix, and the inverse matrix of the sparse matrixis obtained. The inverse matrix of the sparse block diagonalmatrix is then calculated with approximation reflecting thelocation of each block. Matrix P′ is a 2-by-2 block matrix. Tofind R, which signifies node relevance as a matrix, the inversematrix of P′ is obtained with the Schur complement algorithm[6], [7]. This offers the benefit of partitioning a large-scalegraph matrix into blocks for improved storage efficiency.

E. Calculating RWR Matrix by Schur Complement

Calculating the inverse of matrix P′ for each graph withthe RWR algorithm allows for finding R, which numericallyindicates inter-node relevance in the context of overall graphstructure. Exploiting the techniques used in BEAR [7], the

3.5

0

5

time of proposed methodestimated line of propsed methodtime of deltaconestimated line of deltacon

4.5 5.5 6.5size of node on dataset (log scale)

(a) t

ime

(sec

) (lo

g sc

ale)

Facebook

Oregon

Enron

As Skitternot completed

(c) t

ime

(sec

)

level of corruption6

14

1% 90% level of corruption0

70

1% 90% level of corruption

1200

01% 90%

(b) s

imila

rity

scor

e

0

1

Correlation of Proposed and DeltaCon

p-value<10-10

0

1


p-value<10-10

0

1


p-value<10-10

Facebook Oregon Enron

Fig. 4: Comparison of the proposed and existing approaches: (a) averageprocessing time; (b) shows that the proposed method has faster performancein most cases over state-of-the-art DC [3], which uses approximations tocalculate relevance with fast belief propagation for the RWR algorithm. Thisperformance gap is increasingly significant for larger graphs. In the experimentto measure similarity with varying levels of corruption as shown in (b) and(c), fluctuation in the performance speed of DC for arbitrary insertion/removalof edges is increasingly noticeable for larger graphs whereas the proposedmethod has a relatively stable performance speed. In the similarity measure-ment experiment, DC and our method show exponentially decreasing similar-ity in the same fashion, while the number of arbitrarily altered edges increased.In our experiment, the ratio of changed insertion/deletion edges varies from1% to 90%. When edge connection is altered hundreds of times at each ratio,DC is insensitive to arbitrary changes to edges and thus produces values ina narrow range. In contrast, similarity values produced by our method aresensitive to random insertion/removal of edges, reflecting structural changesin the graph; Real datasets [9]-Facebook (n=4039, m=88234, social network),Oregon (n=11461, m=32730, autonomous systems graph), Enron (n=36692,m=183831, communication network), AS Skitter (n=1696415, m=11095298,Internet topology graph)

steps for obtaining an inverse matrix using the Schur comple-ment algorithm is applied to P′ of each graph to find matrixR = cP′−1, which is given by

R = c

[S′−1 −S′−1P′12P′−1

22

−P′−122 P′21S′−1 P′−1

22 + P′−122 P′21S′−1P′12P′−1

22

].

The resulting matrix R has more accurate values over theexisting approximation of RWR [3] in terms of inter-noderelevance. The computation also finishes more quickly whilerequiring less space.

F. Calculating Matrix Distance and Similarity ScoreIn the previous subsection, matrix R, which numerically

indicates inter-node relevance in the context of overall graphstructure, was obtained from a block-partitioned matrix P′ bymeans of the Schur complement. Using a distance-measuringmethod between matrices R1 and R2 [6], the similarity be-tween two graphs is calculated and numerically represented:

sim(G1, G2) = 1/(

2∑i=1

2∑j=1

ED(R1,ij ,R2,ij) + 1). (5)

Note that graph similarity reflecting overall graph structureends up lying in between 0 and 1, as governed by Eq. 5:

TABLE III: Details of real-world datasets used [9]

properties RWR ours DC VEO SS GED λ-D A λ-D L λ-D NL

edge importance O O O X X X X X Xweight awareness O O O X O X O O Oedge sub-modularity O O O O O O O O Ofocus awareness O O O - - - - - -

Similarity measures [3]: DeltaCon (DC); Vertex/Edge Overlap (VEO); SignatureSimilarity (SS); Graph Edit Distance (GED); λ-D Adjacency (λ-D A); λ-DLaplacian (λ-D L); λ-D Normalized Laplacian (λ-D NL).

(b) s

imila

rity

scor

econsecutive days (with hour)

�𝑠𝑠𝑠𝑠𝑠𝑠-σ

σ2σ

-2σ

Wed Thu Fri Sat Sun Mon Tue

time zone with high similaritybetween days

time zone with low similaritybetween days

(a) s

imila

rity

scor

e

time (hour)

0.7

0.6

0.5

0.4

0.3

Mon. vs Tue.Mon. vs Wed.Mon. vs Thu.Tue. vs Thu.Tue. vs Wed.Wed. vs Thu.

Mon. vs Fri.Tue. vs Fri.Wed. vs Fri.Thu. vs Fri.

6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 230.7

0.6

0.5

0.4

0.3

0.2 9 15 21 9 15 21 9 15 21 9 15 21 9 15 21 9 15 21 9 15 21

Fig. 5: Analysis of “SmartCard big data” by our approach: (a) Comparisonsimilarity of a day of week is the result for the static graph analysis, and itdepicts similarities between the flows for different days of the week. FromMon to Fri, morning rush hours between 07 and 09 have more similarpassenger flows, but the passenger flow pattern on Fri evening between 18and 20 was noticeably different from that on the other weekdays, signifyingthat passengers on Fri go to areas that are different from where they usuallygo on the other weekdays. Manually tracing passenger flow pattern on Friverified that a significant portion of passengers do not return to the subwaystation from which they first left in the morning and instead go to stations nearthe downtown area, probably to enjoy Fri evening; (b) Monitoring pattern ofpassenger movement on consecutive time shows the similarity of passengerflow for each time slot in a dynamic graph. Changes in the graph are depictedin a control chart, with its mid-line and ±σ, ±2σ. The similarity of graphsincreases during morning rush hours on weekdays until most commuters seemto have arrived at work. The graphs start to deviate from this point. The graphsstart to look different after the time when most commuters are assumed to havearrived at work. In the afternoon after 15, the graphs become similar until thepeak at 18. As the flow of passengers increases around 18 when commutersstart returning home, the hourly graphs exhibited the highest similarity at thattime. Late at night after 22, the number of passengers drops, and passengerflow is irregular, so that the transportation graphs differ very much from eachother. Unlike the passenger flow patterns during weekdays, the passengerflow does not change much during the daytime on Sat. Throughout Sun, thepassenger flow pattern shows less similarity to the other days while beingmixed with some irregularity.

IV. RESULTS AND DISCUSSION

We consider the four principled properties of graph sim-ilarity used in evaluating state-of-the-art methods [3]: edgeimportance, weight awareness, edge-submodularity, and focus-awareness. Existing measures for graph similarity listed inTable III and our method are given the same simple syntheticgraph as previous work [3]. We chart similarity measureresults for each method and show that our method fulfills thefour properties. Table III compares the experimental resultsof existing methods with ours in terms of the principledproperties on graph similarity. Fig. 4 presents the result from

TABLE IV: Complexity analysis of proposed approach

action time space

Input: G1 and G2 from dynamic dataset, c probability of restartingOutput: sim(G1, G2) ∈ [0, 1]

(A) represent graph set fromraw dataset

(B) union graph O(m1 +m2) ≈ O(2m) O(|m1 ∪m2|)(C) reordering nodes O(m+ n logn|T |) [13] O(n) [13]

(D) forming matrix P′ forRWR O(mn+ n) O(m+ n)

(E) calculating RWR bySchur Complement

(E-1) compute P′−122

O(∑q

l=1 n3(l)) O(

∑ql=1 n

2(l))

(E-2) compute S′ O(w2 + n2(∑q

l=1 n2(l))) O(w2)

(E-3) compute R11 O(w3) O(w2)

compute R12 O(w2n(∑q

l=1 n2(l))) O(nw)

compute R21 O((∑q

l=1 n2(l))nw

2) O(nw)

compute R22O(

∑ql=1 n

2(l) +

(∑q

l=1 n2(l))n

2w2(∑q

l=1 n2(l))

O((n− w)2)

(F) calculating similarityscore O(n2 + 1) O(1)

The time complexity of our method according to above isO(nm+ n logn|T |+ w2 + w3 +

∑ql=1 n

3(l) + n(

∑ql=1 n

2(l)(w

2 + n) +∑ql=1 n

2(l)(1 + n2w2 ∑q

l=1 n2(l)))) and the space complexity is

O(|m1 ∪m2|+∑q

l=1 n2(l) + w2 + nw + (n− w)2).

∗m1 = |V1| of G1, m2 = |V2| of G2; T is the number of iteration on slashburn;sparse matrices multiplication (A′ × D′) is O(mn); n(l)(� n) is |V| of lth

block diagonal matrix; inversion of [n× n] matrix: O(n3); inversion of q-blockdiagonal matrix: O(

∑ql=1 n

3(l)).

applying the proposed method to processing real-world dataand compares it with existing state-of-the-art methods. Referto the caption for Fig. 4 for more details.

We showcase a few implementations of our method forreal-world datasets in Fig. 5 to suggest the possibility ofapplications to various fields. We collected the usage dataof public transport SmartCard for a large metropolitan areabetween July 10 and 16, 2013. For tens of millions of transitpassengers each day, the dataset contains passenger identifiers,departure points, departure times, arrival points, arrival times,and accompanying passengers. For analysis, we approachedthe dataset with both dynamic and static graphing. The flow ofpassengers through subway stations in this area and its vicinitywere put in slots of 10-minute periods. One static graph wascreated for each time slot covering the seven-day period, andwe identified patterns of passenger flow for the same time slotsin different days of week. We also traced the dynamic changein the graphs over the course of day to find out how passengerflow differs each day in a week. More details can be found inthe caption for Fig. 5.

Table IV lists the complexity of the proposed approach.The analysis assumes that the input graph possesses thecharacteristics of real-world datasets as in scale-free networks.

We believe that the proposed method provides a significantimprovement in terms of time and space complexity overexisting state-of-the-art methods, especially when the size ofthe graph is large. To accommodate very large graphs withhundreds of millions of nodes, our future work includes theconsideration of approximation and parallelization [14] alongwith graph sampling methods.

V. CONCLUSION

The purpose of this work is to offer an accurate andefficient similarity measure for dynamically changing large-scale graphs and structurally comparable graphs. The proposedmethod relies on the idea of intergraph compression imple-mented as the RWR algorithm by the Schur complement andshows significant improvements in terms of time and spacecomplexity for measuring graph similarity over existing state-of-the-art methods. Using the proposed method, medium-sizeddynamic graphs can be scrutinized for anomaly in real time,and structural similarity for graphs with millions of nodes canbe measured in a relatively short period of time for applicationssuch as grouping nodes for in-depth analysis. Big data hasbeen an emerging key word in recent years, and the proposedmethod offers an adequate basis for efficiently examininglarge-scale real-world data in graph forms. Our analysis ofreal-world usage data of the public transportation SmartCardin a large metropolitan area showcases a successful applicationof the method, which can further be employed in analyzingdata from various fields.

ACKNOWLEDGMENT

This work was supported by the NRF grant funded bythe Korea government (Ministry of Science, ICT and FuturePlanning, MSIP) [No. 2011-0009963], by the ICT R&D pro-gram of MSIP/ITP [14-824-09-014, Basic Software Researchin Human-level Lifelong Machine Learning], by the BrainKorea 21 Plus Project in 2015, and by SAP Labs Korea.

REFERENCES

[1] D. J. Cook and L. B. Holder, Mining graph data. John Wiley & Sons,2006.

[2] S. Yoon, J. C. Ebert, E.-Y. Chung, G. De Micheli, and R. B. Altman,“Clustering protein environments for function prediction: finding prositemotifs in 3d,” BMC bioinformatics, vol. 8, no. Suppl 4, p. S10, 2007.

[3] D. Koutra, J. T. Vogelstein, and C. Faloutsos, “Deltacon: A principledmassive-graph similarity function,” arXiv:1304.4657, 2013.

[4] H. Tong, C. Faloutsos, and J.-Y. Pan, “Fast random walk with restartand its applications,” 2006.

[5] V. Bhaskaran and K. Konstantinides, Image and video compressionstandards: algorithms and architectures. Springer Science & BusinessMedia, 1997.

[6] G. H. Golub and C. F. Van Loan, Matrix computations, vol. 3. JHUPress, 2012.

[7] K. Shin, J. Jung, S. Lee, and U. Kang, “Bear: Block elimination approachfor random walk with restart on large graphs,” in Proceedings of ACMSIGMOD, pp. 1571–1585, 2015.

[8] S. Soundarajan, T. Eliassi-Rad, and B. Gallagher, “A guide to selectinga network similarity method,” SIAM, 2014.

[9] J. Leskovec and A. Krevl, “SNAP Datasets: Stanford large networkdataset collection.” http://snap.stanford.edu/data, June 2014.

[10] A. N. Langville and C. D. Meyer, Google’s PageRank and beyond: thescience of search engine rankings. Princeton University Press, 2011.

[11] Y. Fujiwara, M. Nakatsuji, T. Yamamuro, H. Shiokawa, and M. Onizuka,“Efficient personalized pagerank with accuracy assurance,” in Proceed-ings of the 18th ACM SIGKDD international conference on Knowledgediscovery and data mining, pp. 15–23, ACM, 2012.

[12] D. Koutra, T.-Y. Ke, U. Kang, D. H. P. Chau, H.-K. K. Pao, andC. Faloutsos, “Unifying guilt-by-association approaches: Theorems andfast algorithms,” in Machine Learning and Knowledge Discovery inDatabases, pp. 245–260, Springer, 2011.

[13] Y. Lim, U. Kang, and C. Faloutsos, “Slashburn: Graph compression andmining beyond caveman communities,” 2013.

[14] Y. Jeon and S. Yoon, “Multi-threaded hierarchical clustering by par-allel nearest-neighbor chaining,” IEEE Transactions on Parallel andDistributed Systems, vol. 26, no. 9, pp. 2534–2548, 2015.

Documents

Measuring Large-Scale Dynamic Graph Similarity by RICom ...best.snu.ac.kr/pub/paper/2015.ICDM.jklee.pdf · Measuring Large-Scale Dynamic Graph Similarity by RICom: RWR with Intergraph