24
Parallel Subgraph Listing in a Large- Scale Graph Yingxia Shao Bin Cui Lei Chen Lin Ma Junjie Yao Ning Xu School of EECS, Peking University Hong Kong University of Science and Technology 1

Parallel Subgraph Listing in a Large-Scale Graph

Embed Size (px)

DESCRIPTION

Parallel Subgraph Listing in a Large-Scale Graph. Yingxia Shao  Bin Cui  Lei Chen  Lin Ma  Junjie Yao  Ning Xu   School of EECS, Peking University  Hong Kong University of Science and Technology. Outline. Subgraph listing operation Related work PSgL framework Evaluation - PowerPoint PPT Presentation

Citation preview

Page 1: Parallel  Subgraph  Listing in a Large-Scale Graph

1

Parallel Subgraph Listing in a Large-

Scale Graph

Yingxia Shao Bin Cui Lei Chen Lin Ma Junjie Yao Ning Xu

School of EECS, Peking University Hong Kong University of Science and Technology

Page 2: Parallel  Subgraph  Listing in a Large-Scale Graph

2

Outline

• Subgraph listing operation• Related work• PSgL framework• Evaluation• Conclusion

Page 3: Parallel  Subgraph  Listing in a Large-Scale Graph

3

Motivation

Motif Detection in Bioinformatics Cascades Counting in RNTriangle Counting in SN

Introduction

Page 4: Parallel  Subgraph  Listing in a Large-Scale Graph

4

Problem Definition1

4

2

3

Pattern graph

Subgraph Listing Operationo Input: pattern graph, data graph [both are undirected]

o Output: all the occurrences of pattern graph in the data graph.

Goal of our worko Efficiently listing subgraph in a large-scale graph 4

3

5

2

6

1

Data graph

Introduction

Page 5: Parallel  Subgraph  Listing in a Large-Scale Graph

5

Related Work

Centralized algorithms Enumerate one by one [Chiba ’85, Wernicke ’06, Grochow ’07]

Streaming algorithms Only counting and results are inaccurate [Buriol ’06, Bordino ’08, Zhao ’10]

MapReduce based Parallel algorithms Decompose pattern graph + explicit join operation [Afrati ’13]

Fixed exploration plan + implicit join operation [Plantenga ’13]

Other efficient algorithms for specific pattern graph Triangle [Suri ’11, Chu ’11, Hu ’13]

Related Work

Page 6: Parallel  Subgraph  Listing in a Large-Scale Graph

6

Drawbacks in existing parallel solutions• MapReduce is not friendly to process graphs.• Join operation is expensive.• Do not take care of the balance of data distribution.• Data graph• Intermediate results

The novel PSgL framework lists subgraph via graph traversal on in-memory stored native graph.

Related Work

Page 7: Parallel  Subgraph  Listing in a Large-Scale Graph

7

Contributions

• We propose an efficient parallel subgraph listing framework, PSgL.• We introduce a cost model for the subgraph listing in PSgL.• We propose a simple but effective workload-aware distribution

strategy, which facilitates PSgL to achieve good workload balance.• We design three independent mechanisms to reduce the size of

intermediate results.

Page 8: Parallel  Subgraph  Listing in a Large-Scale Graph

8

Partial subgraph instance

• A data structure that records the mapping between pattern graph and data graph.• Denoted by • Assume the vertices of are

numbered from 1 to , we simply state as {map(1), map(2), ..., map()}.

4

3

5

2

6

1

1

4

2

3

{?,?,?,?}

4

3

5

2

6

1

4

3

5

2

6

1

{2,3,4,5}

{1,5,6,?}

Preliminaries

Page 9: Parallel  Subgraph  Listing in a Large-Scale Graph

9

Independence Property

• Tree • A node is a

• The children of a node are derived from expanding one mapped data vertex in the node.

• Characteristics• A encodes a set of results.• s are independent from each other except

the ones in its generation path.

Gpsi

{?,?,?,?}

Gpsi

{1,?,?,?}Gpsi

{2,?,?,?}Gpsi

{4,?,?,?}Gpsi

{6,?,?,?}

Gpsi

{2,1,?,5}Gpsi

{2,1,?,3}Gpsi

{2,3,?,5}Gpsi

{6,1,?,5}Gpsi

{6,5,?,1}

Tree

Preliminaries

Page 10: Parallel  Subgraph  Listing in a Large-Scale Graph

10

PSgL: Parallel Subgraph Listing Framework• PSgL follows the popular graph

processing paradigm• vertex centric model• BSP model

• PSgL iteratively generates in parallel;• Each is expanded by a data vertex. P1

Gpsi

Gpsi Gpsi

P2 P3

Gpsi

Gpsi Gpsi

Gpsi

Gpsi Gpsi

Iteration

Worker-1 Worker-2 Worker-3

PSgL

Page 11: Parallel  Subgraph  Listing in a Large-Scale Graph

11

Algorithm of Expanding a - I

• Partial Pattern Graph encodes• pattern graph, • ,• progress state.

• Three types of vertices• BLACK vertex is the one which has been

expanded.• GRAY vertex has a mapped data vertex,

but it has not been expanded.• WHITE vertex is the one which hasn’t

been mapped to any data vertex.

2

4

1

3

Gpsi = { 3, 5, ?, 2 }

<4,2>

<1,3>

<3,?>

<2,5>

Gpp

Gp

+

PSgL Vertex program

Page 12: Parallel  Subgraph  Listing in a Large-Scale Graph

12

Algorithm of Expanding a Gpsi - II

• Main logic• Changes one GRAY vertex into BLACK;• Validates the expanding vertex’s GRAY neighbors;• Makes the expanding vertex’s WHITE neighbor

become GRAY.

• Two observations• In each expansion, at least one pattern vertex is

processed.• All GRAYs are the valid candidates for the next

expansion.

2

4

1

3

Gpsi = { 3, 5, ?, 2 }

<4,2>

<1,3>

<3,?>

<2,5>

Gpp

Gp

+

Example: expanding vertex <4, 2>

PSgL Vertex program

Page 13: Parallel  Subgraph  Listing in a Large-Scale Graph

13

Efficiency of PSgL

Three metrics:• The number of iterations.

• S is bounded by |MVC| ≤ S ≤ |Vp| - 1.• Workload balance.

• Required by the max function.• The number of s

# of iterations

Total cost

# of workers

# of Gpsi processed by worker k

cost of processing a

Gpsi

*Refer to the paper for the details of estimating load().

PSgL Analysis

Page 14: Parallel  Subgraph  Listing in a Large-Scale Graph

14

• Partial subgraph instance distribution problem• There are N s to be processed by K workers, the goal is to find out a

distribution strategy to achieve

• NP-hard problem!

• Naive Solutions• Random distribution strategy• Roulette wheel distribution strategy

• has a higher probability to be expanded by a data vertex with smaller degree.

Workload balance - I

Optimization

Page 15: Parallel  Subgraph  Listing in a Large-Scale Graph

15

• Workload aware distribution strategy• A general greedy-based heuristic rule.

Workload balance - II

α Description Drawbacks1 Selecting worker for the which has minimal overall workload local optimal

0 Selecting worker where the incurs the least increased workload imbalance

0.5 (*) Making a trade-off between local optimal and imbalance -

All three strategies have the same worst bound which is K*|OPT|. But in practice, α = 0.5 performs best.

Optimization

Page 16: Parallel  Subgraph  Listing in a Large-Scale Graph

16

Comparison among various approaches

Optimization

Random

Roulette =1=0

=0

Page 17: Parallel  Subgraph  Listing in a Large-Scale Graph

17

Partial subgraph instance reduction - I

• Pattern graph automorphism breaking• Using DFS to find the equivalent vertex

group• Assign partial order for each equivalent

vertex group

• Initial pattern vertex selection• Introduce a cost model• General pattern graph

• Enumerate all possible selections based on cost model

• Cycle and clique• The vertex with lowest rank is the best one.

1

2 3<

< <

Automorphism Breaking

Cost Model

Best Initial

Pattern Vertex

Initial Pattern Vertex Section based on cost model

Optimization

Page 18: Parallel  Subgraph  Listing in a Large-Scale Graph

18

Partial subgraph instance reduction - II • Online pruning invalid s• Filter by the partial order and degree restriction• Prune with the help of a light weight global edge index

• Using bloom filter to index the ends of an edge

Data Graph PG Gpsi # w/ index Gpsi # w/o index Pruning RatioLiveJournal PG1(v1) 2.86 x 108 6.81 x 108 58.01%

PG4(v1) 9.93 x 109 OOM unknown

UsPatent PG5(v1) 2.26 x 107 3.17 x 108 92.87%

PG5(v3; v4) 7.38 x 109 2.04 x 1010 63.89%

1

2 3

1 4

32

2 5

43

1

PG1

PG4

PG5

Optimization

Page 19: Parallel  Subgraph  Listing in a Large-Scale Graph

19

Evaluation - Comparing to MR solutions

1

2 3

1 4

32

PSgL: 4302sAfrati: 7291s

Evaluation

Afrati and SGIA-MR are the state-of-art MapReduce solutions.The ratios exceed 100 times are not visualized.

Page 20: Parallel  Subgraph  Listing in a Large-Scale Graph

20

Evaluation - Comparing to GraphLabData Graph Pattern Graph Afrati PowerGraph PSgL

Twitter 432min 2min 12.5min

Wikipedia 871s 36s 125s

WikiTalk 4402s 48s 318s

WikiTalk 13743s 100s 494s

WikiTalk 13743s OOM* 494s

WikiTalk 1785s 127s 38s

LiveJournal 2749s OOM 1330s

Evaluation

1

2 3

1 4

32

1 4

32

1 4

32

𝑃𝐺1 𝑃𝐺2 𝑃𝐺3 𝑃𝐺4

* using a different traversal order.

Page 21: Parallel  Subgraph  Listing in a Large-Scale Graph

21

Conclusion

• Subgraph listing is a fundamental operation for massive graph analysis.• We propose an efficient parallel subgraph listing framework, PSgL.• Various distribution strategies• Cost model• Light-weight global edge index

• The workload-aware distribution strategy can be extended to other balance problems.• A new execution engine is required for larger pattern graphs.

Page 22: Parallel  Subgraph  Listing in a Large-Scale Graph

22

Thanks!

Page 23: Parallel  Subgraph  Listing in a Large-Scale Graph

23

Backup Expr. – Scalability of PSgL

Performance vs. Worker Number

Page 24: Parallel  Subgraph  Listing in a Large-Scale Graph

24

Backup Expr. – Initial pattern vertex selection

Livejournal Random graph

Influences of the Initial Pattern Vertex on Various Data Graphs