34
A Framework for Processing Large Graphs in Shared Memory Julian Shun Based on joint work with Guy Blelloch and Laxman Dhulipala (Work done at Carnegie Mellon University)

A Framework for Processing Large Graphs in Shared Memory · A Framework for Processing Large Graphs in Shared Memory Julian Shun ... Large Graphs 22 6.6 billion edges 128 billion

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A Framework for Processing Large Graphs in Shared Memory · A Framework for Processing Large Graphs in Shared Memory Julian Shun ... Large Graphs 22 6.6 billion edges 128 billion

A Framework for Processing Large Graphs in Shared Memory

Julian Shun

Based on joint work with Guy Blelloch and Laxman Dhulipala(Work done at Carnegie Mellon University)

Page 2: A Framework for Processing Large Graphs in Shared Memory · A Framework for Processing Large Graphs in Shared Memory Julian Shun ... Large Graphs 22 6.6 billion edges 128 billion

What are graphs?

• Can contain up to billions of vertices and edges• Need simple, efficient, and scalable ways to analyze them

2

EdgeVertex Vertex

Graph Data is Everywhere!

Page 3: A Framework for Processing Large Graphs in Shared Memory · A Framework for Processing Large Graphs in Shared Memory Julian Shun ... Large Graphs 22 6.6 billion edges 128 billion

Efficient Graph Processing• Use parallelism

• Design efficient algorithms

• Write/optimize code for each application• Build a general framework

3

Breadth-first searchBetweenness centralityConnected components…

Single-source shortest pathsEccentricity estimation(Personalized) PageRank…

Page 4: A Framework for Processing Large Graphs in Shared Memory · A Framework for Processing Large Graphs in Shared Memory Julian Shun ... Large Graphs 22 6.6 billion edges 128 billion

Ligra Graph Processing Framework4

EdgeMap VertexMap

Breadth-first searchBetweenness centralityConnected componentsTriangle countingK-core decompositionMaximal independent setSet cover

Single-source shortest pathsEccentricity estimation(Personalized) PageRankLocal graph clusteringBiconnected componentsCollaborative filtering…

Simplicity, Performance, Scalability

Page 5: A Framework for Processing Large Graphs in Shared Memory · A Framework for Processing Large Graphs in Shared Memory Julian Shun ... Large Graphs 22 6.6 billion edges 128 billion

Graph Processing Systems

• Existing: Pregel/Giraph/GPS, GraphLab, Pegasus, Knowledge Discovery Toolbox, GraphChi, etc.

• Our system: Ligra - Lightweight graph processing system for shared memory

5

Takes advantage of “frontier-based” nature of many algorithms

(active set is dynamic and often small)

Page 6: A Framework for Processing Large Graphs in Shared Memory · A Framework for Processing Large Graphs in Shared Memory Julian Shun ... Large Graphs 22 6.6 billion edges 128 billion

Breadth-first Search (BFS)• Compute a BFS tree rooted at source r containing all vertices reachable from r

6

• Can process each frontier in parallel• Race conditions, load balancing

ApplicationsBetweenness centralityEccentricity estimationMaximum flowWeb crawlersNetwork broadcastingCycle detection…

Page 7: A Framework for Processing Large Graphs in Shared Memory · A Framework for Processing Large Graphs in Shared Memory Julian Shun ... Large Graphs 22 6.6 billion edges 128 billion

Steps for Graph Traversal• Operate on a subset of vertices• Map computation over subset of edges in parallel• Return new subset of vertices• Map computation over subset of vertices in parallel

7

}

Graph

VertexSubset

EdgeMap

VertexMap

Think with flat data-parallel operators

We built the Ligra abstraction for these kinds of computations

Abstraction enables optimizations(hybrid traversal and graph compression)

Many graph traversal algorithms do this!

Page 8: A Framework for Processing Large Graphs in Shared Memory · A Framework for Processing Large Graphs in Shared Memory Julian Shun ... Large Graphs 22 6.6 billion edges 128 billion

Breadth-first Search in Ligraparents = {-1, …, -1}; //-1 indicates “unexplored”

procedure UPDATE(s, d):return compare_and_swap(parents[d], -1, s);

procedure COND(v):return parents[v] == -1; //checks if “unexplored”

procedure BFS(G, r):parents[r] = r;

frontier = {r}; //VertexSubsetwhile (size(frontier) > 0):

frontier = EDGEMAP(G, frontier, UPDATE, COND);

8

frontier

TT TTFfrontier

Page 9: A Framework for Processing Large Graphs in Shared Memory · A Framework for Processing Large Graphs in Shared Memory Julian Shun ... Large Graphs 22 6.6 billion edges 128 billion

Actual BFS code in Ligra9

Page 10: A Framework for Processing Large Graphs in Shared Memory · A Framework for Processing Large Graphs in Shared Memory Julian Shun ... Large Graphs 22 6.6 billion edges 128 billion

Sparse or Dense EdgeMap?10

11

10

9

12

13

15

14

1

4

3

2

5

8

7

6

Frontier• Dense method better when

frontier is large and many vertices have been visited

• Sparse (traditional) method better for small frontiers

• Switch between the two methods based on frontier size [Beamer et al. SC ’12]

11

10

9

12

Limited to BFS?

Page 11: A Framework for Processing Large Graphs in Shared Memory · A Framework for Processing Large Graphs in Shared Memory Julian Shun ... Large Graphs 22 6.6 billion edges 128 billion

EdgeMap11

Loop through outgoing edges of frontier vertices in parallel

procedure EDGEMAP(G, frontier, Update, Cond):if (size(frontier) + sum of out-degrees > threshold) then:

return EDGEMAP_DENSE(G, frontier, Update, Cond);else:

return EDGEMAP_SPARSE(G, frontier, Update, Cond);

Loop through incoming edges of “unexplored” vertices (in parallel), breaking early if possible

• More general than just BFS!• Generalized to many other problems

• For example, betweenness centrality, connected components, sparse PageRank, shortest paths, eccentricity estimation, graph clustering, k-core decomposition, set cover, etc.

• Users need not worry about this

Page 12: A Framework for Processing Large Graphs in Shared Memory · A Framework for Processing Large Graphs in Shared Memory Julian Shun ... Large Graphs 22 6.6 billion edges 128 billion

Frontier-based approach enables hybrid traversal

12

0

2

4

6

8

10

12

14

BFS BetweennessCentrality

ConnectedComponents

EccentricityEstimation

40-c

ore

runn

ing

time

(sec

onds

)

Twitter graph (41M vertices, 1.5B edges)

Dense

Sparse

Hybrid

30.7 20.7

(switching between sparse and dense using default threshold of |E|/20)

Page 13: A Framework for Processing Large Graphs in Shared Memory · A Framework for Processing Large Graphs in Shared Memory Julian Shun ... Large Graphs 22 6.6 billion edges 128 billion

PageRank13

Page 14: A Framework for Processing Large Graphs in Shared Memory · A Framework for Processing Large Graphs in Shared Memory Julian Shun ... Large Graphs 22 6.6 billion edges 128 billion

VertexMap14

0 4 68VertexSubset

4

7

52

1

0

6

8

3

68VertexSubset

bool f(v){data[v] = data[v] + 1;return (data[v] == 1);

}

4

0

6

8

VertexMap

T F F T

Page 15: A Framework for Processing Large Graphs in Shared Memory · A Framework for Processing Large Graphs in Shared Memory Julian Shun ... Large Graphs 22 6.6 billion edges 128 billion

PageRank in Ligrap_curr = {1/|V|, …, 1/|V|}; p_next = {0, …, 0}; diff = {}; error =∞;

procedure UPDATE(s, d):

atomic_increment(p_next[d], p_curr[s] / degree(s));

return 1;

procedure COMPUTE(i):

p_next[i] = α ∙ p_next[i] + (1- α) ∙ (1/|V|);

diff[i] = abs(p_next[i] – p_curr[i]);

p_curr[i] = 0;

return 1;

procedure PageRank(G, α, ε):

frontier = {0, …, |V|-1};

while (error > ε):

frontier = EDGEMAP(G, frontier, UPDATE, CONDtrue);frontier = VERTEXMAP(frontier, COMPUTE);

error = sum of diff entries;

swap(p_curr, p_next)

return p_curr;

15

Page 16: A Framework for Processing Large Graphs in Shared Memory · A Framework for Processing Large Graphs in Shared Memory Julian Shun ... Large Graphs 22 6.6 billion edges 128 billion

PageRank• Sparse version?

• PageRank-Delta: Only update vertices whose PageRank value has changed by more than some Δ-fraction (discussed in PowerGraph and McSherry WWW ‘05)

16

Page 17: A Framework for Processing Large Graphs in Shared Memory · A Framework for Processing Large Graphs in Shared Memory Julian Shun ... Large Graphs 22 6.6 billion edges 128 billion

PageRank-Delta in Ligra

PR[i] = {1/|V|, …, 1/|V|}; nghSum = {0, …, 0};Change = {}; //store changes in PageRank values

procedure UPDATE(s, d): //passed to EdgeMapatomic_increment(nghSum[d], Change[s] / degree(s));return 1;

procedure COMPUTE(i): //passed to VertexMapChange[i] = α ∙ nghSum[i]; PR[i] = PR[i] + Change[i];return (abs(Change[i]) > Δ); //check if absolute value of change is big enough

17

Page 18: A Framework for Processing Large Graphs in Shared Memory · A Framework for Processing Large Graphs in Shared Memory Julian Shun ... Large Graphs 22 6.6 billion edges 128 billion

Performance of Ligra

18

Page 19: A Framework for Processing Large Graphs in Shared Memory · A Framework for Processing Large Graphs in Shared Memory Julian Shun ... Large Graphs 22 6.6 billion edges 128 billion

Ligra BFS Performance19

0

0.05

0.1

0.15

0.2

0.25

0.3

BFS

Run

ning

tim

e (s

econ

ds)

Twitter graph (41M vertices, 1.5B edges)

Ligra (40-coremachine)

Hand-writtenCilk/OpenMP (40-coremachine)

020406080

100120140160180

BFS

Line

s of

cod

e• Comparing against hybrid traversal BFS code by Beamer et al.

Page 20: A Framework for Processing Large Graphs in Shared Memory · A Framework for Processing Large Graphs in Shared Memory Julian Shun ... Large Graphs 22 6.6 billion edges 128 billion

Ligra PageRank Performance20

0

2

4

6

8

10

12

14

Page Rank (1 iteration)

Run

ning

tim

e (s

econ

ds)

Twitter graph (41M vertices, 1.5B edges)

PowerGraph (64 x 8-cores)

PowerGraph (40-coremachine)

Ligra (40-coremachine)

Hand-writtenCilk/OpenMP (40-coremachine) 0

10

20

30

40

50

60

70

80

Page Rank

Line

s of

cod

e

• Easy to implement “sparse” version of PageRank in Ligra

Page 21: A Framework for Processing Large Graphs in Shared Memory · A Framework for Processing Large Graphs in Shared Memory Julian Shun ... Large Graphs 22 6.6 billion edges 128 billion

Connected Components Performance21

0

50

100

150

200

250

Connected Components

Run

ning

tim

e (s

econ

ds)

Twitter graph (41M vertices, 1.5B edges)

PowerGraph (16 x 8-cores)[Gonzalez et al.]PowerGraph (40-coremachine)

Ligra (40-coremachine)

Hand-writtenCilk/OpenMP (40-core machine) 0

50100150200250300350400

Connected Components

Line

s of

cod

e

• Ligra’s performance is close to hand-written code• Faster than best existing system • Subsequent systems have used Ligra’s abstraction and hybrid

traversal idea, e.g., Galois [SOSP ‘13], Polymer [PPoPP ’15], Gunrock [PPoPP ’16], Gemini [OSDI ’16], GraphGrind [ICS ‘17], Grazelle [PPoPP ‘18]

22.5

0

0.5

1

1.5

2

2.5

3

Connected Components

Run

ning

tim

e (s

econ

ds)

3.5 billion vertices128 billion edges

(540 GB)

Largest publicly available graph

72-core machine with 1TB RAM Ligra Running timeBFS 12sConnected components 42s1 iteration PageRank 28s

Page 22: A Framework for Processing Large Graphs in Shared Memory · A Framework for Processing Large Graphs in Shared Memory Julian Shun ... Large Graphs 22 6.6 billion edges 128 billion

Large Graphs22

6.6 billion edges

128 billion edges

~1 trillion edges [VLDB 2015]

• Most can fit on commodity shared memory machine

Amazon EC2

ExampleDell PowerEdge R930:Up to 96 cores and 6 TB of RAM

Page 23: A Framework for Processing Large Graphs in Shared Memory · A Framework for Processing Large Graphs in Shared Memory Julian Shun ... Large Graphs 22 6.6 billion edges 128 billion

What if you don’t have or can’t afford that much memory?

23R

unni

ng T

ime

Memory Required

Graph Compression

Page 24: A Framework for Processing Large Graphs in Shared Memory · A Framework for Processing Large Graphs in Shared Memory Julian Shun ... Large Graphs 22 6.6 billion edges 128 billion

Ligra+: Adding Graph Compression to Ligra

24

Page 25: A Framework for Processing Large Graphs in Shared Memory · A Framework for Processing Large Graphs in Shared Memory Julian Shun ... Large Graphs 22 6.6 billion edges 128 billion

• Same interface as Ligra• All changes hidden from the user!

25

Ligra+: Adding Graph Compression to Ligra

Graph

VertexSubset

EdgeMap

VertexMap

Interface

Use compressed representation

Decode edges on-the-fly

Same as before

Same as before

Page 26: A Framework for Processing Large Graphs in Shared Memory · A Framework for Processing Large Graphs in Shared Memory Julian Shun ... Large Graphs 22 6.6 billion edges 128 billion

Graph representation26

0 4 5 11

2 7 9 16 0 1 6 9 12

...

...

Offsets

Edges

2 5 2 7 -1 -1 5 3 3 ... Compressed

Edges

• Graph reordering to improve locality

Vertex IDs 0 1 2 3

Sort edges and encode differences

2 - 0 = 2 7 - 2 = 5 1 - 2 = -1

Page 27: A Framework for Processing Large Graphs in Shared Memory · A Framework for Processing Large Graphs in Shared Memory Julian Shun ... Large Graphs 22 6.6 billion edges 128 billion

0 0 0 0 0 1 10 0 1 0 0 0 1

• k-bit codes• Encode value in chunks of k bits• Use k-1 bits for data, and 1 bit as the “continue” bit

• Example: encode “401” using 8-bit (byte) code• In binary:

27

1 1 0 0 1 0 0 0 1

1 0 0 1 0 0 0 1 0 0 0 0 0 0 1 1

“continue” bit

7 bits for data

Variable-length codes

Page 28: A Framework for Processing Large Graphs in Shared Memory · A Framework for Processing Large Graphs in Shared Memory Julian Shun ... Large Graphs 22 6.6 billion edges 128 billion

• Another idea: get rid of “continue” bits

28

0 1 0 1 1 0 0 1

Number of bytes per integer

Size of group(max 64)

Header

……Integers in group

encoded in byte chunks

• Increases space, but makes decoding cheaper (no branch misprediction from checking “continue” bit)

x1 x2 x3 x4 x5 x6 x7 x8 ……Number of bytes required to encode each integer

1 2 2 2 2 2 2 2

Use run-length encoding

……

Encoding optimization

Page 29: A Framework for Processing Large Graphs in Shared Memory · A Framework for Processing Large Graphs in Shared Memory Julian Shun ... Large Graphs 22 6.6 billion edges 128 billion

• Same interface as Ligra• All changes hidden from the user!

29

Ligra+: Adding Graph Compression to Ligra

Graph

VertexSubset

EdgeMap

VertexMap

Interface

Use compressed representation

Decode edges on-the-fly

Same as before

Same as before

Page 30: A Framework for Processing Large Graphs in Shared Memory · A Framework for Processing Large Graphs in Shared Memory Julian Shun ... Large Graphs 22 6.6 billion edges 128 billion

• Processes outgoing edges of a subset of vertices

30

Modifying EdgeMap

2 5 2 7 9 2 1 3 3VertexSubset

-4 6 3 1 3 5 6 2

5 10 2

30 5

-16 2 19 1 4 2 5 3

All vertices processed in parallel

16

25

44

0

7

What about high-degree vertices?

Page 31: A Framework for Processing Large Graphs in Shared Memory · A Framework for Processing Large Graphs in Shared Memory Julian Shun ... Large Graphs 22 6.6 billion edges 128 billion

31

Handling high-degree vertices

-1 2 4 3 16 2 1 5 8 19 4 1 23 14 12 1 9 10 3 5

High-degree vertex

Chunks of size T

-1 2 4 3 16 2 27 5 8 19 4 1 87 14 12 1 9 10

Encode first entry relative to source vertex

All chunks can be decoded in parallel!

• We chose T=1000• Similar performance

and space usage for a wide range of T

Page 32: A Framework for Processing Large Graphs in Shared Memory · A Framework for Processing Large Graphs in Shared Memory Julian Shun ... Large Graphs 22 6.6 billion edges 128 billion

Ligra+ Space Savings

32

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

soc-L

J

cit-Patents

com-LJ

com-Orkut

nlpkkt240

Twitter

uk-u

nion

Yahoo

Space relative to Ligra using byte codes with run-length encoding

Ligra

Ligra+

• Space savings of about 1.3—3x

• Could use more sophisticated schemes to further

reduce space, but more expensive to decode

• Cost of decoding on-the-fly?

Page 33: A Framework for Processing Large Graphs in Shared Memory · A Framework for Processing Large Graphs in Shared Memory Julian Shun ... Large Graphs 22 6.6 billion edges 128 billion

Ligra+ Performance33

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

1.4

BFS

Betwee

nnes

s

Eccen

tricity

Com

pone

nts

PageR

ank

Bellm

an-F

ord

40-core time relative to Ligra

• Cost of decoding on-the-fly?

• Memory subsystem is a scalability bottleneck in parallel as these graph algorithms are memory-bound

• Ligra+ decoding gets better parallel speed up

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

1.4

BFS

Betwee

nnes

s…

Eccen

tricity

Com

pone

nts

PageR

ank

Bellm

an-F

ord

Single-thread time relative to Ligra

Ligra

Ligra+

0

5

10

15

20

25

30

35

40

BFS

Betwee

nnes

s

Eccen

tricity

Com

pone

nts

PageR

ank

Bellm

an-F

ord

Self-relative 40-core Speedup

Ligra

Ligra+

Page 34: A Framework for Processing Large Graphs in Shared Memory · A Framework for Processing Large Graphs in Shared Memory Julian Shun ... Large Graphs 22 6.6 billion edges 128 billion

Ligra Summary35

EdgeMapVertexMap

Breadth-first searchBetweenness centralityConnected componentsTriangle countingK-core decompositionMaximal independent set…

Single-source shortest pathsEccentricity estimation(Personalized) PageRankLocal graph clusteringBiconnected componentsCollaborative filtering…

Simplicity, Performance, Scalability

VertexSubset

Optimizations: Hybrid traversal and graph compression