- Home
- Documents
*A Method of Subgraphs Extraction in a Large Graph Database ... both graph topology and set...*

prev

next

out of 5

View

216Category

## Documents

Embed Size (px)

International Journal of Computer Applications (0975 8887)

Volume 175 No.5, October 2017

1

A Method of Subgraphs Extraction in a Large Graph

Database in a Distributed System

Ritu Yadav National Institute of Technology, Warangal

Bangalore, India

Samarth Varshney National Institute of Technology, Warangal

Bangalore, India

ABSTRACT Since many real applications such as web connectivity, social

networks, and so on, are emerging now-a-days, thus graph

databases have been commonly used as significant tools to

exemplify and query complex graph data wherein each vertex

in a graph usually contains information, which can be

modeled by a set of tokens or elements. The method for

subgraphs extraction by considering set similarity query over

a large graph database has already been proposed, which

retrieves subgraphs that are structurally isomorphic to the

query graph, and meanwhile satisfy the condition of vertex

pair matching with the (dynamic/fixed) weighted set

similarity in a centralized system. This paper explains the

efficient implementation of subgraphs extraction in a large

graph database in a distributed environment by considering

both vertex set similarity and graph topology which offers a

better price/performance ratio and increases availability using

redundancy when parts of a system fail than centralized

systems in case of a large dataset (i.e., a graph with

millions/billions of nodes wherein each node contains some

information) by performing parallel processing.

General Terms centralized systems, graph databases, parallel processing,

subgraphs extraction

Keywords apache spark, distributed systems, graph topology, set

similarity

1. INTRODUCTION With the increasing growth in the generation of information

now-a-days from a wide range of sources, graphs have been

the simplest and the compact way to represent the complex as

well as simple information. Thus, graph databases have been

commonly used as significant tools to exemplify and query

complex graph data wherein each vertex in a graph usually

contains information, which can be casted by a set of tokens

or elements. The method for subgraphs extraction by

considering both graph topology and set similarity over a

large graph database retrieves subgraphs of the same topology

(structurally isomorphic) as that of the query graph, and

simultaneously satisfy the condition of vertex pair matching

with the (dynamic/fixed) weighted set similarity. In order to

achieve better query performance, all the techniques are

designed for the distributed systems using Apache Spark since

distributed systems possess many advantages over the

centralized ones:

distributed systems offer a better price/performance ratio

than centralized systems

redundancy increases availability when parts of a system fail

applications that can easily be run simultaneously also offer

benefits in terms of faster performance vis--vis centralized

solutions

distributed systems can be extended through the addition of

components, thereby providing better scalability compared to

centralized systems.

A number of solutions were formulated and proposed for

subgraphs extraction over a large graph database [1], [2], [3],

[4],[6]. Suppose a query graph Q and a large graph G is given,

then a typical subgraph matching query retrieves those

subgraphs in G that exactly match with Q in terms of both

graph structure and vertex labels[4]. However, in some real

graph applications, each vertex often contains a rich set of

tokens or elements representing features of the vertex, and the

exact matching of vertex labels is sometimes not feasible or

practical.

The motivation examples show that subgraphs extraction by

considering both graph topology and set similarity queries are

very useful in many real-world applications. To the best of

our knowledge, no prior work studied the subgraphs

extraction problem under the semantic of structural

isomorphism and set similarity with dynamic element weights

(called dynamic weighted set similarity). However a

traditional weighted set similarity [11] that focuses on fixed

element weight has already been proposed which is actually a

special case of dynamic weighted set similarity.

For the methods considering exact subgraph matching query

for subgraphs extraction requires that all the vertices and

edges are matched exactly. The Ullmanns subgraph

isomorphism method [8] and VF2 [7] algorithm are costly for

large graphs since they do not utilize any index structure. A

distance-based multi-way join algorithm has also been

proposed by Zou et al. [3] for answering pattern match queries

over a large graph.

Each of the various methods formulated for subgraphs

extraction[9],[10] are inefficient in one or the other way when

the query is performed over a large graph database. Thus, we

proposed the method for subgraphs extraction in a large graph

database in a distributed system by taking into account both

one-to-one structural isomorphism and dynamic set similarity

of matching vertices.

Section 2 states the problem description in brief followed by

section 3 in which proposed system design is explained,

including the basic framework and all design modules and

their constraints including the flow diagram in distributed

environment. Section 4 presents the implementation details of

all modules, results, analysis and comparison of the results

with the existing method. Section 5 concludes the paper.

2. PROBLEM DESCRIPTION Given a large graph G represented by V(G),E(G), where V(G) is a set of vertices, E(G) is a set of edges is considered.

Each vertex v V(G) is associated with a set, S(v), of elements. Query graph Q is represented by V(Q),E(Q). The set of element domain is denoted by U, in which each element

a has a weight W(a) to indicate the importance of a. Weights

International Journal of Computer Applications (0975 8887)

Volume 175 No.5, October 2017

2

can change dynamically in different queries due to varying

requirements or evolving data in real applications.

Given: - A Query Graph, Q, with n vertices (u1 ,......,un)

- A Data Graph G

- A user-specified similarity threshold

To Find: A Subgraph match X of Q in G in a distributed

system

A subgraph match of Q is a subgraph X of G containing n

vertices of V(G), iff the following conditions hold:

1) There is a mapping function f, for each ui in V(Q) and vj V(G) (1 i n, 1jn) it holds that f(ui)=vj;

2) sim(S(ui),S(vj)) where S(ui) and S(vj) are the sets associated with ui and vj, respectively, and sim(S(ui),S(vj))

outputs a set similarity score between S(ui) and S(vj) ;

3) For any edge (ui , uk) E(Q), there is (f(ui),f(uk)) E(G) (1 k n).

The query retrieves all subgraph matches of Q in graph G

under the semantic of the set similarity. Any similarity

method can be used, we have used weighted Jaccard

Similarity [1].

3. PROPOSED MODEL

3.1 Framework All the phases of the framework are designed for the

distributed systems using Apache Spark. In the filtering phase,

we find frequent patterns of element sets of vertices in data

graph G. Then, data vertices are encoded into signatures, and

organized into Distributed Hash Tables. In order to reduce the

search space, an efficient two-phase pruning strategy is

proposed based on the frequent patterns found in the filtering

phase and signature. In the refinement phase, a dominating set

(DS)-based subgraph matching algorithm to find subgraph

matches with set similarity. A dominating set selection

method is proposed to select a cost-efficient dominating set of

the query graph.

In brief, the following contributions are made:

Frequent patterns of element sets of vertices and a structural

signature-based Distributed Hash Table (DHT) are first

constructed. A set of pruning techniques (vertical and

horizontal pruning) are performed and integrated together to

greatly reduce the search space of subgraphs extraction

queries.

Set similarity pruning techniques (vertical, horizontal, anti-

monotone) are proposed that utilize the frequent patterns over

the element sets of data vertices to evaluate dynamic weighted

set similarity. Anti-monotone principle is also applied to

achieve high pruning power.

Structure-based pruning techniques are proposed based on

the signature buckets.

Subgraphs extraction is done based on the dominating set of

query graph. When filling up the non-dominating vertices of

the graph, a distance preservation principle is devised to prune

candidate vertices that do not preserve the distance to

dominating vertices.

Thus, our approach can effectively and efficiently extract the

subgraphs along with set similarity and structural similarity in

a large graph database. Steps in Flow Chart:

1. Frequent patterns of element sets of vertices in data graph

G. (Filtering Phase)

2. Data vertices are encoded into signatures.

3. After encoding da