- Home
- Documents
- Subgraph Matching: on Compression and Matching: on Compression and Computation Miao ... Miao Qiao, Hao Zhang, Hong Cheng. Subgraph Matching: on Compression and ... of objects in areas ...

Published on

28-Apr-2018View

213Download

1

Embed Size (px)

Transcript

Subgraph Matching: on Compression and Computation

Miao QiaoMassey University

New Zealand

Hao Zhang Hong ChengThe Chinese University of Hong Kong

Hong Kongm.qiao@massey.ac.nz {hzhang,hcheng}@se.cuhk.edu.hk

ABSTRACTSubgraph matching finds a set I of all occurrences of apattern graph in a target graph. It has a wide range ofapplications while suers an expensive computation. Thiseciency issue has been studied extensively. All existingapproaches, however, turn a blind eye to the output crisis,that is, when the system has to materialize I as a prepro-cessing/intermediate/final result or an index, the cost of theexport of I dominates the overall cost, which could be pro-hibitive even for a small pattern graph.

This paper studies subgraph matching via two problems.1) Is there an ideal compression of I? 2) Will the compres-sion of I reversely boost the computation of I? For theproblem 1), we propose a technique called VCBC to com-press I to code(I) which serves eectively the same as I.For problem 2), we propose a subgraph matching compu-tation framework CBF which computes code(I) instead ofI to bring down the output cost. CBF further reduces theoverall cost by reducing the intermediate results. Extensiveexperiments show that the compression ratio of VCBC can beup to 105 which also significantly lowers the output cost ofCBF. Extensive experiments show the superior performanceof CBF over existing approaches.

PVLDB Reference Format:

Miao Qiao, Hao Zhang, Hong Cheng. Subgraph Matching: onCompression and Computation. PVLDB, 11(2): 17-188, 2017.DOI: 10.14778/3149193.3149198

1. INTRODUCTIONThe subgraph matching of a pattern graph p on a target

graph d reports the set Ip

of all the subgraphs of d that areisomorphic to p. This problem underpins various analyti-cal applications based on the significant role graphs play inmodelling the interconnectivity of objects in areas such asbiology, chemistry, communication, transportation and so-cial science. For example, by letting pattern graphs havesemantic/statistical meanings, subgraph matching is used

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee. Articles from this volume were invited to presenttheir results at The 44th International Conference on Very Large Data Bases,August 2018, Rio de Janeiro, Brazil.Proceedings of the VLDB Endowment, Vol. 11, No. 2Copyright 2017 VLDB Endowment 2150-8097/17/10... $ 10.00.DOI: 10.14778/3149193.3149198

to monitor terrorist cells in activity networks [10], identi-fy properties of recommendation/social networks [18, 23],and decode functions of biological networks [5]. Subgraphmatching naturally becomes a fundamental construct of thequery language of graph databases such as Neo4j, Agens-Graph and SAP HANA.

Unfortunately, the computation of subgraph matching isNP-complete [11]. The basic approach is a brute-force searchover all the subgraphs of d. Ullmans backtracking algorithm[30] has sparked studies on dierent searching orders, prun-ing rules and neighborhood indexes (see [22] as an entrance).However, these techniques assume that the target graph fitsinto the memory of a machine, which does not hold on manyreal graphs nowadays1. This fact has motivated the researchon two approaches: using external memory and using a clus-ter of machines. A common issue to both approaches is howto arrange the materialization caused by the memory limit.

The first approach [9, 16, 17, 25, 26] is investigated underexternal memory (EM) model [3] where cost is defined as thetotal number of I/Os performed. An I/O transfers a block ofB words between the main memory and the disk. Subgraphmatching has two settings in EM model, subgraph listing [9]and subgraph enumeration [26]. Subgraph listing requiresthe system to materialize I

p

whereas subgraph enumerationdoes not. Such a distinction separates the output costthe

(|I

p

|B

) I/Os of exporting Ip

to the diskfrom the enumer-ation costthe cost of subgraph enumeration [16, 26].

The second approach is to study subgraph matching [1,2, 19, 20, 21, 27, 29] on parallel computing platforms suchas MapReduce. Brute-force search algorithms for subgraphmatching are parallelized in two styles, BFS and DFS, dieron whether intermediate results are materialized or not.

BFS-style algorithms [20, 21, 29] are iterative. In its finaliteration, I

p

is computed from an intermediate result Ip

0 ofthe previous iterationthe instance set of another patterngraph p0. p0 is normally smaller than p by a node or an edge.Such a process applies unless p has only one node/edge. Thesystem must materialize and shue I

p

0 to initiate the com-putation of I

p

. This is a severe burden: shue is the mostexpensive operation in a parallel system such as MapReduce.

DFS-style solutions [1, 2, 19, 27] do not materialize inter-mediate results. The target graph is partitioned, replicatedand shued before the one-round parallel computation takes

1Consider Facebook as an example: with 109 daily activeusers http://newsroom.fb.com/company-info/ andan average of 190 friends per user http://arxiv.org/abs/1111.4503, the graph requires 1.6 petabytes of storage.

176

6

place. DFS-style solutions have some theoretical analysis [2],but their practical performances on real target graphs maynot be appealing [20] compared to BFS-style solutions.

Though the instance set Ip

of a subgraph matching maybe massive in this big data era, its materialization couldbe demanded or even inevitable in practice. This is es-pecially true when subgraph matching is the basic formof a query in a graph database system such as Neo4j. Atraditional database materializes views for query optimiza-tion, which, in the context of a graph database, is to ma-terialize the instance set of a subgraph query. This prac-tice avoids repetitive computations of frequent queries andcommon sub-queries, saves system resources, shortens querydelay and enhance concurrency. Besides, BFS-style paral-lelisms inevitably materialize I

p

. A persistent Ip

is alsodemanded when subgraph matching serves as a preprocess-ing/intermediate step of a application [10, 18, 23, 5]; other-wise any unexpected error will trigger a re-computation ofIp

could be even more expensive than materializing Ip

.

When the system has to materialize the instance set Ip

as a preprocessing result, intermediate result, index, or finalresult, etc., existing solutions turn a blind eye to the output

crisis of subgraph matching: the (|I

p

|B

) I/Os on listing Ip

to the disk becomes a lower bound of the overall cost nomatter how deftly one computes I

p

. This observation hasled us to investigate subgraph matching via two problems:

1. Is there an ideal compression on the instance set Ip

?

2. Will the compression of Ip

reversely boost the compu-tation of subgraph matching?

Our contributions. This is the first attempt, in the liter-ature, on resolving the output crisis of subgraph matchingusing output compression. Output compression is verticalto input compression techniques [14] which focus on down-sizing the size of the target graph in a subgraph matching.

This paper proposes the vertex-cover based compression(VCBC) technique to compress I to code(I). VCBC featuresan impressive compression ratio, that is, the size of code(I)is significantly smaller than that of I. Moreover, code(I)serves eectively the same as a materialized I, that is, thedecompression process of VCBC restores I

p

in a streamed

manner from code(I) in ( |Ip|B

) I/Os. VCBC, together withgeneral compression techniques, provides an eective storagesolution for subgraph matching. Such a storage solution isdesirable in three cases. 1) I

p

is prohibitively large such thatexisting solutions cannot aord materializing I

p

. 2) Thematerialization of I

p

constitutes the performance bottleneckof an algorithm. 3) The access of I

p

is not ecient enoughunless I

p

is placed on a faster yet more expensive medium,for example, SSD or the main memory.

A perhaps more interesting contribution is the Crystal-Based computation Framework (CBF). CBF reduces theoverall cost of subgraph matching by materializing code(I

p

)instead of I

p

. Such a reduction is significant especially whenthe output cost is the bottleneck of the subgraph matchingcomputation. Moreover, in terms of enumeration com-puting I

p

without materializing the result, CBF outperformsthe existing approaches by up to orders of magnitude. Inparticular, CBF excels in matching complex pattern graphsagainst dense target graphs where all existing solutions fail,as will be shown in our empirical studies.

Table 1: Notations

Symbol Descriptionp, d The pattern graph p and target graph d.

np

,mp

np

= |V (p)|,mp

= |E(p)|.g(V 0) The induced subgraph of g on vertex set V 0.code() The compressed code of a piece of data.() The compression ratio: Equation 1.Ip

The instance set of p the set ofsubgraphs of d that are isomorphic to p.

fg

The instance-bijection of instance g 2 Ip

.ord

p

The order on V (p) for symmetry breaking.

HV

c

(g) The helve of instance g: fg

(u) for all u 2 Vc

.H(I

p

) The set of helves of instances in Ip

.Img

p

(u|h) {fg

(u)|g 2 Ip

|h} of a node u.Ip

|h The set of instances in Ip

with helve h.{V

c

,,P} A core-crystal decomposition of p.Vc

A vertex cover of p.core(p) p(V

c

), the induced subgraph of p on Vc

.Vc

The complement of Vc

, that is, V (p) \ Vc

.P p1, p2, , p, subgraphs of p, where

pi

is a crystal Qx

i

,y

i

, for i 2 [1,].Q

x,y

A graph with y nodes fully connected to a Cx

.Cx

A clique of size x.M Size of the main memory.B Size of a disk block., Two constants defined in the assumption.

Organization. Section 2 formally defines subgraph match-ing and the two problems to be addressed in this paper.Sections 3 studies the compression problem while Section 4investigates the computation problem. Section 5 surveys re-lated work. Section 6 evaluates our techniques via extensiveexperimentation. Section 7 concludes the paper.

2. PRELIMINARIESWe now formally introduce all the definitions. Table 1

aggregates all the notations used in the paper.

2.1 Subgraph MatchingThis paper focuses on the subgraph matching on unla-

beled and undirected graphs. A graph g consists of a setV (g) of vertexes and a set E(g) of edges. A vertex is alsocalled a node. An edge e(u, v) connects two vertexes u andv in V (g). e(u, v) is incident to both u and v. The degreeof a node v is the total number of edges incident to v. Agraph g is a clique if for every pair u, v of nodes in V (g),edge (u, v) 2 E(g). A clique of size k is denoted as C

k

.

Let g1 and g2 be two graphs. The intersection g1 \ g2 ofg1 and g2 is a graph with vertex set V (g1)\V (g2) and edgeset E(g1)\E(g2). If g1 \ g2 = g1, then g1 is a subgraph ofg2. The induced subgraph g(V

0) of a graph g on a vertexset V 0 is a graph with vertex set V 0 \ V (g) and edge setE(g)|V 0 where E(g)|V 0 = E(g) \ (V 0 V 0).

Definition 1 (Graph Isomorphism [12]). Given twographs g1 and g2, an isomorphism from g1 and g2 is a bijec-tion f : V (g1) 7! V (g2) such that (u, v) 2 E(g1) if and onlyif (f(u), f(v)) 2 E(g2). If there is an isomorphism from g1to g2, then we say g1 is isomorphic to g2.

177

v5v4 v6 4uv7

1u

u

3uu2

v1

v3

v2

5

6uv8 v9

p :d :

Figure 1: Target graph d and pattern graph p

Definition 2 (Graph Matching). For a given targetgraph d and a given pattern graph p, subgraph matching re-ports the set I

p

of all the subgraphs of d that are isomorphicto p. Denote |V (p)| as n

p

, |E(p)| as mp

.

A subgraph g of d is an instance of p if it is isomorphicto p. In other words, g 2 I

p

if and only if g is an instanceof p. We thus call I

p

the instance set of p.

Example 1. We use a running example of a subgraphmatching on target graph d and pattern graph p in Figure 1.

Let V 0 = {v1, v2, , v5}. d(V 0) is the induced subgraphof d on set V 0. Subgraph g with vertex set V (g) = V 0[{v6}and edge set E(g) = E(d(V 0)) [ {(v2, v6)} is an instance ofp with an isomorphism f that maps v

i

to ui

, for i 2 [1, 6].One instance g may have multiple isomorphisms to p. The

standard technique of symmetry breaking (SimB) [15]validates exactly one isomorphism f

g

: V (p) 7! V (g) foreach instance g. f

g

is called the instance-bijection of g.

Specifically, SimB selects a set ordp

V (p)V (p) of nodepairs in the pattern graph. For each pair hu, vi in ord

p

, apartial order is imposed such that u v. Besides, SimBdefines an arbitrary total order on target graph nodes V (d).By default, for u, v 2 V (d), u < v if the identifier of uis smaller than that of v. Given an instance g 2 I

p

, anisomorphism f from p to g is valid if f(u) < f(v), for anyu v. Each instance g has exactly one valid isomorphismfg

under ordp

. fg

is called the instance-bijection of g.

Example 2. In Figure 1, pattern graph p uses ordp

={hu4, u5i} for symmetry breaking. In Example 1, instance ghas an isomorphism f . g has another isomorphism f 0 whichis the same as f except for f 0(v4) = u5 and f

0(v5) = u4. ordpinvalidates f 01 since f 01(u4) > f

01(u5) violates u4 u5.The instance-bijection f

g

of g under ordp

is

fg

(ui

) = vi

, for 8i 2 [1, 6].

A mapping function maps a source to its image. For aninstance g and its instance-bijection f

g

, we call fg

(u) theimage of u under g. We call Img

p

(u) = {fg

(u)|g 2 Ip

} theimage set of u under I

p

where Ip

is the instance set of p.

Example 3. Example 2 shows the instance-bijection fg

of g.fg

(u1) = v1 so the image of u1 is v1, and thus v1 2 Imgp

(u1).

2.2 AssumptionsThis paper discusses subgraph matching in external mem-

ory (EM) model with two assumptions. In EM model, anI/O transfers a block of B words between the disk and thememory of a machine. The memory size is M words. Thecost is defined as the total number of I/Os performed. We

assume that the pattern graph has O(1) nodes and the tar-get graph has O(M) nodes. Specifically, we assume:

A1 np = |V (p)| = O(1) , that is, np < for a constant .A2 |V (d)| = O(M), that is, |V (d)| <

M for a constant < 1 such that V (d) fits in a memory of M/ words.

2.3 D-Optimal CompressionA compression approach includes a compression algorithm

and a decompression algorithm. Let D be a piece of data.The code of D, denote as code(D), is the compressed formof D. D can be restored from code(D) if the compression islossless. The compression ratio on D is defined as

(D) =|code(D)|

|D| . (1)

In EM model, any algorithm that lists D needs ( |D|B

)I/Os, we thus define the notion of an optimal compression.

Definition 3 (D-Optimal Compression). A compres-sion approach is d-optimal if the decompression is output-sensitiveD can be restored from code(D) in ( |D|

B

) I/Os.

In other words, a d-optimal compression guarantees thatcode(D) serves eectively the same as a materialized D.

2...