72
GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold Xin, Ankur Dave, Daniel Crankshaw, Michael Franklin, and Ion Stoica OSDI 2014 UC BERKELEY

GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold

Embed Size (px)

Citation preview

Page 1: GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold

GraphX:Graph Processing in a Distributed Dataflow Framework Joseph GonzalezPostdoc, UC-Berkeley AMPLabCo-founder, GraphLab Inc.

Joint work with Reynold Xin, Ankur Dave, Daniel Crankshaw, Michael Franklin, and Ion Stoica

OSDI 2014 UC BERKELEY

Page 2: GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold

Modern Analytics

Raw Wikipedia

< / >< / >< / >XML

Hyperlinks PageRank Top 20 PagesTitle PR

Link Table

Title Link

Editor GraphCommunityDetection

User Community

UserCom

.

DiscussionTableUser Disc.

Top Communities

Com.

PR..

Page 3: GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold

Tables

Raw Wikipedia

< / >< / >< / >XML

Hyperlinks PageRank Top 20 PagesTitle PR

Link Table

Title Link

Editor GraphCommunityDetection

User Community

UserCom

.

DiscussionTableUser Disc.

Top Communities

Com.

PR..

Page 4: GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold

Graphs

Raw Wikipedia

< / >< / >< / >XML

Hyperlinks PageRank Top 20 PagesTitle PR

Link Table

Title Link

Editor GraphCommunityDetection

User Community

UserCom

.

DiscussionTableUser Disc.

Top Communities

Com.

PR..

Page 5: GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold

Separate Systems

Tables Graphs

Page 6: GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold

Separate Systems

GraphsDataflow Systems

Table

Result

Row

Row

Row

Row

Page 7: GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold

Separate SystemsDataflow Systems

Graph Systems

Dependency Graph

Pregel

Table

Result

Row

Row

Row

Row

Page 8: GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold

Difficult to Use

Users must Learn, Deploy, and Manage multiple systems

Leads to brittle and often complex interfaces

8

Page 9: GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold

Inefficient

9

Extensive data movement and duplication across

the network and file system

< / >< / >< / >XML

HDFS

HDFS

HDFS

HDFS

Limited reuse internal data-structures across stages

Page 10: GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold

Spark DataflowFramework

GraphX

GraphX Unifies Computation on Tables and Graphs

Table View Graph View

Enabling a single system to easily and efficiently support the entire

pipeline

Page 11: GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold

Separate SystemsDataflow Systems

Graph Systems

Dependency Graph

Pregel

Table

Result

Row

Row

Row

Row

?

Page 12: GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold

Separate SystemsDataflow Systems

Dependency Graph

Graph Systems

Pregel

Table

Result

Row

Row

Row

Row

Page 13: GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold

GraphLab

Spark

Hadoop

0 200 400 600 800 1000 1200 1400 1600

22

354

1340

Runtime (in seconds, PageRank for 10 iter-ations)

PageRank on the Live-Journal Graph

Hadoop is 60x slower than GraphLabSpark is 16x slower than GraphLab

?

Page 14: GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold

Key Question

How can we naturally express and efficiently execute graph computation in a general purpose dataflow framework?

Distill the lessons learnedfrom specialized graph systems

Page 15: GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold

Key Question

How can we naturally express and efficiently execute graph computation in a general purpose dataflow framework?

Representation Optimizations

Page 16: GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold

Raw Wikipedia

< / >< / >< / >XML

Hyperlinks PageRank Top 20 PagesTitle PR

Link Table

Title Link

Editor GraphCommunityDetection

User Community

UserCom

.

DiscussionTableUser Disc.

Top Communities

Com.

PR..

Page 17: GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold

Express computation locally:

Iterate until convergence

Rank of Page i Weighted sum of

neighbors’ ranks

17

Example Computation: PageRank

RandomReset Prob.

Page 18: GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold

“Think like a Vertex.”- Malewicz et al., SIGMOD’10

18

Page 19: GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold

19

Gather information fromneighboring vertices

Graph-Parallel Pattern

Gonzalez et al. [OSDI’12]

Page 20: GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold

20

Apply an update the vertex property

Graph-Parallel Pattern

Gonzalez et al. [OSDI’12]

Page 21: GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold

21

Scatter information toneighboring vertices

Graph-Parallel Pattern

Gonzalez et al. [OSDI’12]

Page 22: GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold

Many Graph-Parallel Algorithms

Collaborative Filtering»Alternating Least Squares»Stochastic Gradient

Descent»Tensor Factorization

Structured Prediction»Loopy Belief Propagation»Max-Product Linear

Programs»Gibbs Sampling

Semi-supervised ML»Graph SSL

»CoEM

Community Detection»Triangle-Counting»K-core Decomposition»K-Truss

Graph Analytics»PageRank»Personalized PageRank»Shortest Path»Graph Coloring

MACHINE LEARNING

NETWORK

ANALYSIS

22

Page 23: GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold

Specialized Computational

Pattern

Specialized Graph

Optimizations

Page 24: GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold

Graph System Optimizations

24

SpecializedData-Structures

Vertex-CutsPartitioning

RemoteCaching / Mirroring

Message Combiners

Active Set Tracking

Page 25: GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold

Representation

Optimizations

DistributedGraphs

HorizontallyPartitioned Tables

Join

VertexPrograms

DataflowOperators

Advances in Graph Processing Systems

Distributed Join Optimization

Materialized View Maintenance

Page 26: GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold

Property Graph Data Model

B C

A D

F E

A DD

Property Graph

B C

D

E

AA

F

Vertex Property:• User Profile• Current PageRank Value

Edge Property:• Weights• Timestamps

Page 27: GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold

Part. 2

Part. 1

Vertex Table(RDD)

B C

A D

F E

A D

Encoding Property Graphs as Tables

D

Property Graph

B C

D

E

AA

F

Mach

ine 1

Mach

ine 2

Edge Table

(RDD) A B

A C

C D

B C

A E

A F

E F

E D

B

C

D

E

A

F

Routing

Table(RDD)

B

C

D

E

A

F

1

2

1 2

1 2

1

2

Vertex Cut

Page 28: GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold

Separate Properties and Structure

Reuse structural information across multiple graphs

Input Graph

Transform Vertex Properties

Transformed GraphVertex

TableRouting

TableEdgeTable

VertexTable

Routing Table

EdgeTable

Page 29: GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold

class Graph [ V, E ] { def Graph(vertices: Table[ (Id, V) ], edges: Table[ (Id, Id, E) ])

// Table Views -----------------def vertices: Table[ (Id, V) ]def edges: Table[ (Id, Id, E) ]def triplets: Table [ ((Id, V), (Id, V), E) ]// Transformations ------------------------------def reverse: Graph[V, E]def subgraph(pV: (Id, V) => Boolean,

pE: Edge[V,E] => Boolean): Graph[V,E]def mapV(m: (Id, V) => T ): Graph[T,E] def mapE(m: Edge[V,E] => T ): Graph[V,T]// Joins ----------------------------------------def joinV(tbl: Table [(Id, T)]): Graph[(V, T), E ]def joinE(tbl: Table [(Id, Id, T)]): Graph[V, (E, T)]// Computation ----------------------------------def mrTriplets(mapF: (Edge[V,E]) => List[(Id, T)],

reduceF: (T, T) => T): Graph[T, E]}

Graph Operators (Scala)

30

Page 30: GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold

class Graph [ V, E ] { def Graph(vertices: Table[ (Id, V) ], edges: Table[ (Id, Id, E) ])

// Table Views -----------------def vertices: Table[ (Id, V) ]def edges: Table[ (Id, Id, E) ]def triplets: Table [ ((Id, V), (Id, V), E) ]// Transformations ------------------------------def reverse: Graph[V, E]def subgraph(pV: (Id, V) => Boolean,

pE: Edge[V,E] => Boolean): Graph[V,E]def mapV(m: (Id, V) => T ): Graph[T,E] def mapE(m: Edge[V,E] => T ): Graph[V,T]// Joins ----------------------------------------def joinV(tbl: Table [(Id, T)]): Graph[(V, T), E ]def joinE(tbl: Table [(Id, Id, T)]): Graph[V, (E, T)]// Computation ----------------------------------def mrTriplets(mapF: (Edge[V,E]) => List[(Id, T)],

reduceF: (T, T) => T): Graph[T, E]}

Graph Operators (Scala)

31

def mrTriplets(mapF: (Edge[V,E]) => List[(Id, T)], reduceF: (T, T) => T): Graph[T, E]

capture the Gather-Scatter pattern fromspecialized graph-processing systems

Page 31: GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold

Triplets Join Vertices and Edges

The triplets operator joins vertices and edges:

TripletsVertices

B

A

C

D

Edges

A B

A C

B C

C D

A BA

B A C

B C

C D

Page 32: GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold

Map-Reduce TripletsMap-Reduce triplets collects information about the neighborhood of each vertex:

C D

A C

B C

A B

Src. or Dst.

MapFunction( ) (B, )

MapFunction( ) (C, )

MapFunction( ) (C, )

MapFunction( ) (D, )

Reduce

(B, )

(C, + )

(D, )Message

Combiners

Page 33: GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold

Using these basic GraphX operators

we implemented Pregel and GraphLab

in under 50 lines of code!

34

Page 34: GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold

The GraphX Stack(Lines of Code)

GraphX (2,500)

Spark (30,000)

Pregel API (34)

PageRank (20)

Connected Comp. (20)

K-core(60)

Triangle

Count(50)

LDA(220)

SVD++

(110)

Some algorithms are more naturally expressed using the GraphX primitive operators

Page 35: GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold

Representation

Optimizations

DistributedGraphs

HorizontallyPartitioned Tables

Join

VertexPrograms

DataflowOperators

Advances in Graph Processing Systems

Distributed Join Optimization

Materialized View Maintenance

Page 36: GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold

Vertex Table (RDD)

Join Site Selection using Routing Tables

Edge Table (RDD)

A B

A C

C D

B C

A E

A F

E F

E D

B

C

D

E

A

F

Routing

Table(RDD)

B

C

D

E

A

F

1

2

1 2

1 2

1

2

Never ShuffleEdges!

Page 37: GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold

Vertex Table (RDD)

Caching for Iterative mrTripletsEdge Table

(RDD)

A B

A C

C D

B C

A E

A F

E F

E D

MirrorCache

B

C

D

A

MirrorCache

D

E

F

A

B

C

D

E

A

F

B

C

D

E

A

F

A

D

Sca

nSca

n

ReusableHash Index

ReusableHash Index

Page 38: GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold

Vertex Table (RDD)

Edge Table (RDD)

A B

A C

C D

B C

A E

A F

E F

E D

MirrorCache

B

C

D

A

MirrorCache

D

E

F

A

Incremental Updates for Triplets View

B

C

D

E

A

F

Change AA

Change E

Sca

n

Page 39: GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold

Vertex Table (RDD)

Edge Table (RDD)

A B

A C

C D

B C

A E

A F

E F

E D

MirrorCache

B

C

D

A

MirrorCache

D

E

F

A

Aggregation for Iterative mrTriplets

B

C

D

E

A

F

Change

Change

Sca

n

Change

Change

Change

Change

LocalAggregate

LocalAggregate

B

C

D

F

Page 40: GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold

Reduction in Communication Due to

Cached Updates

0 2 4 6 8 10 12 14 160.1

1

10

100

1000

10000

Connected Components on Twitter Graph

Iteration

Netw

ork

Com

m.

(MB

)

Most vertices are within 8 hopsof all vertices in their comp.

Page 41: GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold

Benefit of Indexing Active Vertices

0 2 4 6 8 10 12 14 160

5

10

15

20

25

30

Connected Components on Twitter Graph

Iteration

Ru

nti

me (

Secon

ds)

Without Active Tracking

Active Vertex Tracking

Page 42: GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold

Join EliminationIdentify and bypass joins for unused triplet fields

» Java bytecode inspection

43

0 2 4 6 8 10 12 14 16 18 200

2000400060008000

100001200014000

PageRank on Twitter

Iteration

Com

mu

nic

ati

on

(M

B)

Factor of 2 reduction in communication

Bette

r

Join Elimination

Without Join Elimination

Page 43: GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold

Additional OptimizationsIndexing and Bitmaps:

»To accelerate joins across graphs»To efficiently construct sub-graphs

Lineage based fault-tolerance»Exploits Spark lineage to recover in parallel»Eliminates need for costly check-points

Substantial Index and Data Reuse:»Reuse routing tables across graphs and

sub-graphs»Reuse edge adjacency information and

indices

44

Page 44: GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold

System ComparisonGoal:

Demonstrate that GraphX achieves performance parity with specialized graph-processing systems.

Setup:

16 node EC2 Cluster (m2.4xLarge) + 1GigE

Compare against GraphLab/PowerGraph (C++), Giraph (Java), & Spark (Java/Scala)

Page 45: GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold

Graph

X

Graph

Lab

Giraph

Naïve

Spa

rk0

500

1000

1500

2000

2500

3000

3500

Graph

X

Graph

Lab

Giraph

Naïve

Spa

rk0

100020003000400050006000700080009000

Twitter Graph (42M Vertices,1.5B Edges)UK-Graph (106M Vertices, 3.7B Edges)

PageRank Benchmark

GraphX performs comparably to state-of-the-art graph processing systems.

Runti

me (

Seco

nd

s)

EC2 Cluster of 16 x m2.4xLarge (8 cores) + 1GigE

7x 18x

Page 46: GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold

Connected Comp. Benchmark

Graph

X

Graph

Lab

Giraph

Naïve

Spa

rk0

500

1000

1500

2000

2500

Graph

X

Graph

Lab

Giraph

Naïve

Spa

rk0

100020003000400050006000700080009000

Twitter Graph (42M Vertices,1.5B Edges)UK-Graph (106M Vertices, 3.7B Edges)

GraphX performs comparably to state-of-the-art graph processing systems.

Out-

of-

Mem

ory

EC2 Cluster of 16 x m2.4xLarge (8 cores) + 1GigE

Runti

me (

Seco

nd

s)

8x 10x

Page 47: GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold

Graphs are just one stage….

What about a pipeline?

Page 48: GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold

HDFSHDFS

ComputeSpark PreprocessSpark Post.

A Small Pipeline in GraphX

Timed end-to-end GraphX is the fastest

Raw Wikipedia

< / >< / >< / >XML

Hyperlinks PageRank Top 20 Pages

GraphX

Giraph + Spark

0 200 400 600 800 1000 1200 1400 1600

Total Runtime (in Seconds)

605

375

1492

342

Page 49: GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold

Adoption and Impact

GraphX is now part of Apache Spark

• Part of Cloudera Hadoop Distribution

In production at Alibaba Taobao

• Order of magnitude gains over Spark

Inspired GraphLab Inc. SFrame technology

• Unifies Tables & Graphs on Disk

Page 50: GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold

GraphX Unified Tables and Graphs

Enabling users to easily and efficiently express the entire

analytics pipeline

New APIBlurs the distinction between Tables and

Graphs

New SystemUnifies Data-Parallel

Graph-Parallel Systems

Page 51: GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold

Graph Systems GraphX

Specialized Systems

Integrated Frameworks

What did we Learn?

Page 52: GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold

Graph Systems GraphX

Specialized Systems

Integrated Frameworks

Parameter Server

?

Future Work

Page 53: GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold

Graph Systems GraphX

Specialized Systems

Integrated Frameworks

Parameter Server

Future Work

AsynchronyNon-deterministic

Shared-State

Page 54: GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold

Thank You

[email protected]

http://amplab.cs.berkeley.edu/projects/graphx/

ReynoldXin

AnkurDave

DanielCrankshaw

MichaelFranklin

IonStoica

Page 55: GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold

Related WorkSpecialized Graph-Processing Systems:

GraphLab [UAI’10], Pregel [SIGMOD’10], Signal-Collect [ISWC’10], Combinatorial BLAS [IJHPCA’11], GraphChi [OSDI’12], PowerGraph [OSDI’12], Ligra [PPoPP’13], X-Stream [SOSP’13]

Alternative to Dataflow framework:Naiad [SOSP’13]: GraphLINQHyracks: Pregelix [VLDB’15]

Distributed Join Optimization:Multicast Join [Afrati et al., EDBT’10]Semi-Join in MapReduce [Blanas et al.,

SIGMOD’10]

Page 56: GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold

Edge Files Have Locality

GraphLab GraphX + Shuffle

GraphX0

200

400

600

800

1000

1200

GraphLab rebalances theedge-files on-load.GraphX preserves the on-disk layout through Spark. Better Vertex-Cut

UK-Graph (106M Vertices, 3.7B Edges)

Runtim

e (S

eco

nd

s)

Page 57: GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold

Scalability

8 16 24 32 40 48 56 640

50

100

150

200

250

300

350

400

450

500

GraphX

Linear Scaling

Twitter Graph (42M Vertices,1.5B Edges)

Scales slightly better than

PowerGraph/GraphLab

Ru

nti

me

EC2-Nodes

Page 58: GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold

Apache Spark Dataflow Platform

Resilient Distributed Datasets (RDD): Zaharia et al., NSDI’12

HDFS

HDFS

Map

Map

RDD RDD

Reduce

RDD

Load

Load

Page 59: GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold

Apache Spark Dataflow Platform

Resilient Distributed Datasets (RDD): Zaharia et al., NSDI’12

HDFS

HDFS

Map

Map

RDD RDD

Reduce

Optimized for iterative access to data.

RDD

Load

Load

.cache()

Persist in Memory

Page 60: GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold

Shared Memory Advantage

Spark Shared Nothing Model

Core

Core

Core

Core

ShuffleFiles

GraphLab Shared Memory

Core

Core

Core

Core

SharedDe-serializedIn-MemoryGraph

Page 61: GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold

Shared Memory Advantage

GraphLab GraphLab NoSHM

GraphX0

50100150200250300350400450500

Twitter Graph (42M Vertices,1.5B Edges)Spark Shared Nothing Model

GraphLab No SHM.

Core

Core

Core

Core

ShuffleFiles

Core

Core

Core

Core

TCP/IP

Runtim

e (S

eco

nd

s)

Page 62: GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold

PageRank Benchmark

Graph

Lab

Graph

X

Giraph

Opt. S

park

Naïve

Spa

rk0

500

1000

1500

2000

2500

3000

3500

Graph

Lab

Graph

X

Giraph

Opt. S

park

Naïve

Spa

rk0

100020003000400050006000700080009000

Twitter Graph (42M Vertices,1.5B Edges)UK-Graph (106M Vertices, 3.7B Edges)

GraphX performs comparably to state-of-the-art graph processing systems.

Page 63: GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold

Connected Comp. Benchmark

Graph

Lab

Graph

X

Giraph

Opt. S

park

Naïve

Spa

rk0

500

1000

1500

2000

2500

Graph

Lab

Graph

X

Giraph

Opt. S

park

Naïve

Spa

rk0

100020003000400050006000700080009000

Twitter Graph (42M Vertices,1.5B Edges)UK-Graph (106M Vertices, 3.7B Edges)

GraphX performs comparably to state-of-the-art graph processing systems.

Out-

of-

Mem

ory

EC2 Cluster of 16 x m2.4xLarge Nodes + 1GigE

Runti

me (

Seco

nd

s)

8x

10x

Out-

of-

Mem

ory

Page 64: GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold

Fault-ToleranceLeverage Spark Fault-Tolerance Mechanism

No Failure Lineage Restart0

100200300400500600700800900

1000

Ru

nti

me (

Secon

ds)

Page 65: GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold

67

Graph-Processing Systems

Pregeloogle

Expose specialized API to simplify graph programming.

CombBLAS

Kineograph

X-Stream

GraphChi

Ligra

GPS

Representation

Page 66: GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold

68

Vertex-Program Abstraction

iPregel_PageRank(i, messages) : // Receive all the messages total = 0 foreach( msg in messages) : total = total + msg

// Update the rank of this vertex R[i] = 0.15 + total

// Send new messages to neighbors foreach(j in out_neighbors[i]) : Send msg(R[i]) to vertex j

Page 67: GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold

The Vertex-Program Abstraction

GraphLab_PageRank(i) // Compute sum over neighbors total = 0 foreach( j in neighbors(i)): total += R[j] * wji

// Update the PageRank R[i] = 0.15 + total

69

R[4] * w41

R[3]

* w

31

R[2] * w21

++

4 1

3 2

Low, Gonzalez, et al. [UAI’10]

Page 68: GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold

F

E

Example: Oldest Follower

D

B

A

CCalculate the number of older followers for each user?val olderFollowerAge = graph .mrTriplets( e => // Map if(e.src.age > e.dst.age) { (e.srcId, 1) else { Empty } , (a,b) => a + b // Reduce ) .vertices

23 42

30

19 75

1670

Page 69: GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold

71

Enhanced Pregel in GraphX

pregelPR(i, messageList ):

// Receive all the messagestotal = 0foreach( msg in messageList) : total = total + msg

// Update the rank of this vertexR[i] = 0.15 + total

// Send new messages to neighborsforeach(j in out_neighbors[i]) : Send msg(R[i]/E[i,j]) to vertex

Require Message Combiners

messageSum

messageSum

Remove Message

Computationfrom the

Vertex Program

sendMsg(ij, R[i], R[j], E[i,j]): // Compute single message return msg(R[i]/E[i,j])

combineMsg(a, b): // Compute sum of two messages return a + b

Page 70: GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold

72

PageRank in GraphX

// Load and initialize the graph

val graph = GraphBuilder.text(“hdfs://web.txt”)

val prGraph = graph.joinVertices(graph.outDegrees)

// Implement and Run PageRank

val pageRank =

prGraph.pregel(initialMessage = 0.0, iter = 10)(

(oldV, msgSum) => 0.15 + 0.85 * msgSum,

triplet => triplet.src.pr / triplet.src.deg,

(msgA, msgB) => msgA + msgB)

Page 71: GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold

Example Analytics Pipeline

// Load raw data tables

val articles = sc.textFile(“hdfs://wiki.xml”).map(xmlParser)

val links = articles.flatMap(article => article.outLinks)

// Build the graph from tables

val graph = new Graph(articles, links)

// Run PageRank Algorithm

val pr = graph.PageRank(tol = 1.0e-5)

// Extract and print the top 20 articles

val topArticles = articles.join(pr).top(20).collect

for ((article, pageRank) <- topArticles) { println(article.title + ‘\t’ + pageRank)}

Page 72: GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley AMPLab Co-founder, GraphLab Inc. Joint work with Reynold

Apache Spark Dataflow Platform

Resilient Distributed Datasets (RDD): Zaharia et al., NSDI’12

HDFS

HDFS

Map

Map

RDD RDD

Reduce

Lineage:

RDD

Load

Load

HDFS

RDD RDDReduceMap

RDDLoad

.cache()

Persist in Memory