43
GraphX: Unifying Table and Graph Analytics Presented by Joseph Gonzalez Joint work with Reynold Xin, Daniel Crankshaw, Ankur Dave, Michael Franklin, and Ion Stoica IPDPS 2014 *These slides are best viewed in PowerPoint with an

GraphX: Unifying Table and Graph Analytics Presented by Joseph Gonzalez Joint work with Reynold Xin, Daniel Crankshaw, Ankur Dave, Michael Franklin, and

Embed Size (px)

Citation preview

Page 1: GraphX: Unifying Table and Graph Analytics Presented by Joseph Gonzalez Joint work with Reynold Xin, Daniel Crankshaw, Ankur Dave, Michael Franklin, and

GraphX: Unifying Table and Graph Analytics

Presented by Joseph Gonzalez

Joint work with Reynold Xin, Daniel Crankshaw, Ankur Dave, Michael Franklin, and Ion Stoica

IPDPS 2014

*These slides are best viewed in PowerPoint with animation.

Page 2: GraphX: Unifying Table and Graph Analytics Presented by Joseph Gonzalez Joint work with Reynold Xin, Daniel Crankshaw, Ankur Dave, Michael Franklin, and

Graphs are Central to Analytics

Raw Wikipedia

< / >< / >< / >XML

Hyperlinks PageRank Top 20 PagesTitle PR

TextTableTitle Body

Topic Model(LDA) Word Topics

Word

Topic

Editor GraphCommunityDetection

User Community

UserCom

.

Term-DocGraph

DiscussionTableUser Disc.

CommunityTopic

TopicCom

.

Page 3: GraphX: Unifying Table and Graph Analytics Presented by Joseph Gonzalez Joint work with Reynold Xin, Daniel Crankshaw, Ankur Dave, Michael Franklin, and

Update ranks in parallel

Iterate until convergence

Rank of user i Weighted sum of

neighbors’ ranks

3

PageRank: Identifying Leaders

Page 4: GraphX: Unifying Table and Graph Analytics Presented by Joseph Gonzalez Joint work with Reynold Xin, Daniel Crankshaw, Ankur Dave, Michael Franklin, and

Low-Rank Matrix Factorization:

5

r13

r14

r24

r25

f(1)

f(2)

f(3)

f(4)

f(5)

Use

r Fa

ctors

(U

)

Movie

Facto

rs (M)

Use

rs

MoviesNetflix

Use

rs≈x

Movies

f(i)

f(j)

Iterate:

Recommending Products

Page 5: GraphX: Unifying Table and Graph Analytics Presented by Joseph Gonzalez Joint work with Reynold Xin, Daniel Crankshaw, Ankur Dave, Michael Franklin, and

The Graph-Parallel Pattern

6

Model / Alg. State

Computation depends only on the

neighbors

Page 6: GraphX: Unifying Table and Graph Analytics Presented by Joseph Gonzalez Joint work with Reynold Xin, Daniel Crankshaw, Ankur Dave, Michael Franklin, and

Many Graph-Parallel Algorithms

• Collaborative Filtering– Alternating Least

Squares– Stochastic Gradient

Descent– Tensor Factorization

• Structured Prediction– Loopy Belief

Propagation– Max-Product Linear

Programs– Gibbs Sampling

• Semi-supervised ML

– Graph SSL – CoEM

• Community Detection– Triangle-Counting– K-core Decomposition– K-Truss

• Graph Analytics– PageRank– Personalized PageRank– Shortest Path– Graph Coloring

• Classification– Neural Networks

7

MACHINE LEARNING

SOCIAL NETWORK ANALYSIS

GRAPH ALGORITHMS

Page 7: GraphX: Unifying Table and Graph Analytics Presented by Joseph Gonzalez Joint work with Reynold Xin, Daniel Crankshaw, Ankur Dave, Michael Franklin, and

Graph-Parallel Systems

8

Pregeloogle

Expose specialized APIs to simplify graph programming.

Page 8: GraphX: Unifying Table and Graph Analytics Presented by Joseph Gonzalez Joint work with Reynold Xin, Daniel Crankshaw, Ankur Dave, Michael Franklin, and

9

“Think like a Vertex.”

- Pregel [SIGMOD’10]

Page 9: GraphX: Unifying Table and Graph Analytics Presented by Joseph Gonzalez Joint work with Reynold Xin, Daniel Crankshaw, Ankur Dave, Michael Franklin, and

10

The Pregel (Push) Abstraction

Vertex-Programs interact by sending messages.

iPregel_PageRank(i, messages) : // Receive all the messages total = 0 foreach( msg in messages) : total = total + msg

// Update the rank of this vertex R[i] = 0.15 + total

// Send new messages to neighbors foreach(j in out_neighbors[i]) : Send msg(R[i]) to vertex j

Malewicz et al. [PODC’09, SIGMOD’10]

Page 10: GraphX: Unifying Table and Graph Analytics Presented by Joseph Gonzalez Joint work with Reynold Xin, Daniel Crankshaw, Ankur Dave, Michael Franklin, and

The GraphLab (Pull) Abstraction

Vertex Programs directly access adjacent vertices and edgesGraphLab_PageRank(i) // Compute sum over neighbors total = 0 foreach( j in neighbors(i)): total = total + R[j] * wji

// Update the PageRank R[i] = 0.15 + total

11

R[4] * w41

R[3]

* w

31

R[2] * w21

++

4 1

3 2

Data movement is managed by the system and not the user.

Page 11: GraphX: Unifying Table and Graph Analytics Presented by Joseph Gonzalez Joint work with Reynold Xin, Daniel Crankshaw, Ankur Dave, Michael Franklin, and

Barrie

rIterative Bulk Synchronous

ExecutionCompute Communicate

Page 12: GraphX: Unifying Table and Graph Analytics Presented by Joseph Gonzalez Joint work with Reynold Xin, Daniel Crankshaw, Ankur Dave, Michael Franklin, and

Graph-Parallel Systems

13

Pregeloogle

Expose specialized APIs to simplify graph programming.

Exploit graph structure to achieve orders-of-magnitude performance

gains over more general data-parallel systems.

Page 13: GraphX: Unifying Table and Graph Analytics Presented by Joseph Gonzalez Joint work with Reynold Xin, Daniel Crankshaw, Ankur Dave, Michael Franklin, and

PageRank on the Live-Journal Graph

GraphLab

Naïve Spark

Mahout/Hadoop

0 200 400 600 800 1000 1200 1400 1600

22

354

1340

Runtime (in seconds, PageRank for 10 iter-ations)

Spark is 4x faster than HadoopGraphLab is 16x faster than Spark

Page 14: GraphX: Unifying Table and Graph Analytics Presented by Joseph Gonzalez Joint work with Reynold Xin, Daniel Crankshaw, Ankur Dave, Michael Franklin, and

Counted: 34.8 Billion Triangles

15

Triangle Counting on Twitter

64 Machines15 SecondsGraphLab

1536 Machines423 Minutes

Hadoop[WWW’11]

S. Suri and S. Vassilvitskii, “Counting triangles and the curse of the last reducer,” WWW’11

1000 x Faster

40M Users, 1.4 Billion Links

Page 15: GraphX: Unifying Table and Graph Analytics Presented by Joseph Gonzalez Joint work with Reynold Xin, Daniel Crankshaw, Ankur Dave, Michael Franklin, and

Graph Analytics Pipeline

Raw Wikipedia

< / >< / >< / >XML

Hyperlinks PageRank Top 20 PagesTitle PR

TextTableTitle Body

Topic Model(LDA) Word Topics

Word

Topic

Editor GraphCommunityDetection

User Community

UserCom

.

Term-DocGraph

DiscussionTableUser Disc.

CommunityTopic

TopicCom

.

Page 16: GraphX: Unifying Table and Graph Analytics Presented by Joseph Gonzalez Joint work with Reynold Xin, Daniel Crankshaw, Ankur Dave, Michael Franklin, and

Tables

Raw Wikipedia

< / >< / >< / >XML

Hyperlinks PageRank Top 20 PagesTitle PR

TextTableTitle Body

Topic Model(LDA) Word Topics

Word

Topic

Editor GraphCommunityDetection

User Community

UserCom

.

Term-DocGraph

DiscussionTableUser Disc.

CommunityTopic

TopicCom

.

Page 17: GraphX: Unifying Table and Graph Analytics Presented by Joseph Gonzalez Joint work with Reynold Xin, Daniel Crankshaw, Ankur Dave, Michael Franklin, and

Graphs

Raw Wikipedia

< / >< / >< / >XML

Hyperlinks PageRank Top 20 PagesTitle PR

TextTableTitle Body

Topic Model(LDA) Word Topics

Word

Topic

Editor GraphCommunityDetection

User Community

UserCom

.

Term-DocGraph

DiscussionTableUser Disc.

CommunityTopic

TopicCom

.

Page 18: GraphX: Unifying Table and Graph Analytics Presented by Joseph Gonzalez Joint work with Reynold Xin, Daniel Crankshaw, Ankur Dave, Michael Franklin, and

Separate Systems to Support Each View

Table View Graph View

Dependency Graph

Pregel

Table

Result

Row

Row

Row

Row

Page 19: GraphX: Unifying Table and Graph Analytics Presented by Joseph Gonzalez Joint work with Reynold Xin, Daniel Crankshaw, Ankur Dave, Michael Franklin, and

Having separate systems for each view is

difficult to use and inefficient

20

Page 20: GraphX: Unifying Table and Graph Analytics Presented by Joseph Gonzalez Joint work with Reynold Xin, Daniel Crankshaw, Ankur Dave, Michael Franklin, and

Difficult to Program and Use

Users must Learn, Deploy, and Manage multiple systems

Leads to brittle and often complex interfaces

21

Page 21: GraphX: Unifying Table and Graph Analytics Presented by Joseph Gonzalez Joint work with Reynold Xin, Daniel Crankshaw, Ankur Dave, Michael Franklin, and

Inefficient

22

Extensive data movement and duplication across

the network and file system

< / >< / >< / >XML

HDFS HDFS HDFS HDFS

Limited reuse internal data-structures across stages

Page 22: GraphX: Unifying Table and Graph Analytics Presented by Joseph Gonzalez Joint work with Reynold Xin, Daniel Crankshaw, Ankur Dave, Michael Franklin, and

GraphX Solution: Tables and Graphs are

views of the same physical data

GraphX Unified

Representation

Graph ViewTable View

Each view has its own operators that

exploit the semantics of the view to achieve efficient execution

Page 23: GraphX: Unifying Table and Graph Analytics Presented by Joseph Gonzalez Joint work with Reynold Xin, Daniel Crankshaw, Ankur Dave, Michael Franklin, and

Graphs Relational Algebra

1. Encode graphs as distributed tables

2. Express graph computation in relational algebra

3. Recast graph systems optimizations as:1. Distributed join optimization2. Incremental materialized maintenanceIntegrate Graph and Table data

processing systems.

Achieve performance parity

with specialized systems.

Page 24: GraphX: Unifying Table and Graph Analytics Presented by Joseph Gonzalez Joint work with Reynold Xin, Daniel Crankshaw, Ankur Dave, Michael Franklin, and

Part. 2

Part. 1

Vertex Table

B C

A D

F E

A D

Distributed Graphs as Distributed Tables

D

Property Graph

B C

D

E

AA

F

Edge Table

A B

A C

C D

B C

A E

A F

E F

E D

B

C

D

E

A

F

Routing

Table

B

C

D

E

A

F

1

2

1 2

1 2

1

2

2D Vertex Cut Heuristic

Page 25: GraphX: Unifying Table and Graph Analytics Presented by Joseph Gonzalez Joint work with Reynold Xin, Daniel Crankshaw, Ankur Dave, Michael Franklin, and

Table OperatorsTable operators are inherited from Spark:

26

map

filter

groupBy

sort

union

join

leftOuterJoin

rightOuterJoin

reduce

count

fold

reduceByKey

groupByKey

cogroup

cross

zip

sample

take

first

partitionBy

mapWith

pipe

save

...

Page 26: GraphX: Unifying Table and Graph Analytics Presented by Joseph Gonzalez Joint work with Reynold Xin, Daniel Crankshaw, Ankur Dave, Michael Franklin, and

class Graph [ V, E ] { def Graph(vertices: Table[ (Id, V) ], edges: Table[ (Id, Id, E) ])

// Table Views -----------------def vertices: Table[ (Id, V) ]def edges: Table[ (Id, Id, E) ]def triplets: Table [ ((Id, V), (Id, V), E) ]// Transformations ------------------------------def reverse: Graph[V, E]def subgraph(pV: (Id, V) => Boolean,

pE: Edge[V,E] => Boolean): Graph[V,E]def mapV(m: (Id, V) => T ): Graph[T,E] def mapE(m: Edge[V,E] => T ): Graph[V,T]// Joins ----------------------------------------def joinV(tbl: Table [(Id, T)]): Graph[(V, T), E ]def joinE(tbl: Table [(Id, Id, T)]): Graph[V, (E, T)]// Computation ----------------------------------def mrTriplets(mapF: (Edge[V,E]) => List[(Id, T)],

reduceF: (T, T) => T): Graph[T, E]}

Graph Operators

27

Page 27: GraphX: Unifying Table and Graph Analytics Presented by Joseph Gonzalez Joint work with Reynold Xin, Daniel Crankshaw, Ankur Dave, Michael Franklin, and

Triplets Join Vertices and Edges

The triplets operator joins vertices and edges:

The mrTriplets operator sums adjacent triplets.SELECT t.dstId, reduce( map(t) ) AS sum FROM triplets AS t GROUPBY t.dstId

TripletsVertices

B

A

C

D

Edges

A B

A C

B C

C D

A BA

B A C

B C

C D

SELECT s.Id, d.Id, s.P, e.P, d.PFROM edges AS eJOIN vertices AS s, vertices AS dON e.srcId = s.Id AND e.dstId = d.Id

Page 28: GraphX: Unifying Table and Graph Analytics Presented by Joseph Gonzalez Joint work with Reynold Xin, Daniel Crankshaw, Ankur Dave, Michael Franklin, and

F

E

Example: Oldest Follower

D

B

A

CCalculate the number of older followers for each user?val olderFollowerAge = graph .mrTriplets( e => // Map if(e.src.age < e.dst.age) { (e.srcId, 1) else { Empty } , (a,b) => a + b // Reduce ) .vertices

23 42

30

19 75

1629

Page 29: GraphX: Unifying Table and Graph Analytics Presented by Joseph Gonzalez Joint work with Reynold Xin, Daniel Crankshaw, Ankur Dave, Michael Franklin, and

We express enhanced Pregel and GraphLab

abstractions using the GraphX operators

in less than 50 lines of code!

30

Page 30: GraphX: Unifying Table and Graph Analytics Presented by Joseph Gonzalez Joint work with Reynold Xin, Daniel Crankshaw, Ankur Dave, Michael Franklin, and

31

Enhanced Pregel in GraphX

Malewicz et al. [PODC’09, SIGMOD’10]

pregelPR(i, messageList ):

// Receive all the messagestotal = 0foreach( msg in messageList) : total = total + msg

// Update the rank of this vertexR[i] = 0.15 + total

// Send new messages to neighborsforeach(j in out_neighbors[i]) : Send msg(R[i]/E[i,j]) to vertex

Require Message Combiners

messageSum

messageSum

Remove Message

Computationfrom the

Vertex Program

sendMsg(ij, R[i], R[j], E[i,j]): // Compute single message return msg(R[i]/E[i,j])

combineMsg(a, b): // Compute sum of two messages return a + b

Page 31: GraphX: Unifying Table and Graph Analytics Presented by Joseph Gonzalez Joint work with Reynold Xin, Daniel Crankshaw, Ankur Dave, Michael Franklin, and

32

PageRank in GraphX

// Load and initialize the graph

val graph = GraphBuilder.text(“hdfs://web.txt”)

val prGraph = graph.joinVertices(graph.outDegrees)

// Implement and Run PageRank

val pageRank =

prGraph.pregel(initialMessage = 0.0, iter = 10)(

(oldV, msgSum) => 0.15 + 0.85 * msgSum,

triplet => triplet.src.pr / triplet.src.deg,

(msgA, msgB) => msgA + msgB)

Page 32: GraphX: Unifying Table and Graph Analytics Presented by Joseph Gonzalez Joint work with Reynold Xin, Daniel Crankshaw, Ankur Dave, Michael Franklin, and

Join EliminationIdentify and bypass joins for unused triplet fields

33

0 2 4 6 8 10 12 14 16 18 200

5000

10000

15000

PageRank on TwitterThree Way Join

Join Elimination

IterationCom

mu

nic

ati

on

(M

B)

Factor of 2 reduction in communication

sendMsg(ij, R[i], R[j], E[i,j]): // Compute single message return msg(R[i]/E[i,j])

Page 33: GraphX: Unifying Table and Graph Analytics Presented by Joseph Gonzalez Joint work with Reynold Xin, Daniel Crankshaw, Ankur Dave, Michael Franklin, and

We express the Pregel and GraphLab like

abstractions using the GraphX operators

in less than 50 lines of code!

34

By composing these operators we canconstruct entire graph-analytics

pipelines.

Page 34: GraphX: Unifying Table and Graph Analytics Presented by Joseph Gonzalez Joint work with Reynold Xin, Daniel Crankshaw, Ankur Dave, Michael Franklin, and

Example Analytics Pipeline

// Load raw data tables

val verts = sc.textFile(“hdfs://users.txt”).map(parserV)

val edges = sc.textFile(“hdfs://follow.txt”).map(parserE)

// Build the graph from tables and restrict to recent links

val graph = new Graph(verts, edges)

val recent = graph.subgraph(edge => edge.date > LAST_MONTH)

// Run PageRank Algorithm

val pr = graph.PageRank(tol = 1.0e-5)

// Extract and print the top 25 users

val topUsers = verts.join(pr).top(25).collect

topUsers.foreach(u => println(u.name + ‘\t’ + u.pr))

Page 35: GraphX: Unifying Table and Graph Analytics Presented by Joseph Gonzalez Joint work with Reynold Xin, Daniel Crankshaw, Ankur Dave, Michael Franklin, and

The GraphX Stack(Lines of Code)

GraphX (3575)

Spark

Pregel (28) + GraphLab (50)

PageRank (5)

Connected Comp. (10)

Shortest Path (10)

ALS(40) LDA

(120)

K-core(51)

Triangle

Count(45)

SVD(40)

Page 36: GraphX: Unifying Table and Graph Analytics Presented by Joseph Gonzalez Joint work with Reynold Xin, Daniel Crankshaw, Ankur Dave, Michael Franklin, and

Performance Comparisons

GraphLab

Giraph

Mahout/Hadoop

0 200 400 600 800 1000 1200 1400 1600

22

68

207

354

1340

Runtime (in seconds, PageRank for 10 iter-ations)

GraphX is roughly 3x slower than GraphLab

Live-Journal: 69 Million Edges

Page 37: GraphX: Unifying Table and Graph Analytics Presented by Joseph Gonzalez Joint work with Reynold Xin, Daniel Crankshaw, Ankur Dave, Michael Franklin, and

GraphX scales to larger graphs

GraphLab

GraphX

Giraph

0 200 400 600 800

203

451

749

Runtime (in seconds, PageRank for 10 iter-ations)

GraphX is roughly 2x slower than GraphLab»Scala + Java overhead: Lambdas, GC time, …»No shared memory parallelism: 2x increase in comm.

Twitter Graph: 1.5 Billion Edges

Page 38: GraphX: Unifying Table and Graph Analytics Presented by Joseph Gonzalez Joint work with Reynold Xin, Daniel Crankshaw, Ankur Dave, Michael Franklin, and

PageRank is just one stage….

What about a pipeline?

Page 39: GraphX: Unifying Table and Graph Analytics Presented by Joseph Gonzalez Joint work with Reynold Xin, Daniel Crankshaw, Ankur Dave, Michael Franklin, and

HDFSHDFS

ComputeSpark PreprocessSpark Post.

A Small Pipeline in GraphX

Timed end-to-end GraphX is faster than GraphLab

Raw Wikipedia

< / >< / >< / >XML

Hyperlinks PageRank Top 20 Pages

GraphLab + Spark

Giraph + Spark

0 200 400 600 800 1000 1200 1400 1600

342

1492

Total Runtime (in Seconds)

605

375

Page 40: GraphX: Unifying Table and Graph Analytics Presented by Joseph Gonzalez Joint work with Reynold Xin, Daniel Crankshaw, Ankur Dave, Michael Franklin, and

StatusPart of Apache Spark

In production at several large technology companies

Page 41: GraphX: Unifying Table and Graph Analytics Presented by Joseph Gonzalez Joint work with Reynold Xin, Daniel Crankshaw, Ankur Dave, Michael Franklin, and

GraphX: Unified Analytics

Enabling users to easily and efficiently express the entire graph

analytics pipeline

New APIBlurs the distinction between Tables and

Graphs

New SystemCombines Data-Parallel Graph-

Parallel Systems

Page 42: GraphX: Unifying Table and Graph Analytics Presented by Joseph Gonzalez Joint work with Reynold Xin, Daniel Crankshaw, Ankur Dave, Michael Franklin, and

A Case for Algebra in Graphs

A standard algebra is essential for graph systems:

• e.g.: SQL proliferation of relational system

By embedding graphs in relational algebra:

• Integration with tables and preprocessing

• Leverage advances in relational systems

• Graph opt. recast to relational systems opt.