46
GraphX: Graph Analytics on Spark Joseph Gonzalez, Reynold Xin, Ion Stoica, Michael Franklin Developed at the UC Berkeley AMPLab AMPCamp: August 29, 2013

GraphX : Graph Analytics on Spark

  • Upload
    kiley

  • View
    311

  • Download
    3

Embed Size (px)

DESCRIPTION

GraphX : Graph Analytics on Spark. Joseph Gonzalez, Reynold Xin , Ion Stoica , Michael Franklin Developed at the UC Berkeley AMPLab AMPCamp : August 29, 2013. Graphs are Essential to Data Mining and Machine Learning. Identify influential people and information Find communities - PowerPoint PPT Presentation

Citation preview

Page 1: GraphX : Graph Analytics on Spark

GraphX:Graph Analytics on SparkJoseph Gonzalez, Reynold Xin,Ion Stoica, Michael FranklinDeveloped at the UC Berkeley AMPLab

AMPCamp: August 29, 2013

Page 2: GraphX : Graph Analytics on Spark

Graphs are Essential to Data Mining and Machine Learning

Identify influential people and informationFind communitiesUnderstand people’s shared interestsModel complex data dependencies

Page 3: GraphX : Graph Analytics on Spark

Liberal Conservative

Post

Post

Post

Post

Post

Post

Post

Post

Predicting Political Bias

Post

Post

Post

Post

Post

Post

Post

Post

Post

Post

Post

Post

Post

Post

??

?

?

??

?

? ??

?

?

??

? ?

?

?

?

?

?

?

?

?

?

?

?

? ?

?

3

Conditional Random FieldBelief Propagation

Page 4: GraphX : Graph Analytics on Spark

Triangle CountingCount the triangles passing through each vertex:

Measures “cohesiveness” of local community

More TrianglesStronger Community

Fewer TrianglesWeaker Community

12 3

4

Page 5: GraphX : Graph Analytics on Spark

Collaborative FilteringRatings Item

sUser

s

Page 6: GraphX : Graph Analytics on Spark

6

Many More Graph Algorithms

• Collaborative Filtering– Alternating Least Squares– Stochastic Gradient Descent– Tensor Factorization– SVD

• Structured Prediction– Loopy Belief Propagation– Max-Product Linear

Programs– Gibbs Sampling

• Semi-supervised ML– Graph SSL – CoEM

• Graph Analytics– PageRank– Single Source Shortest Path– Triangle-Counting– Graph Coloring– K-core Decomposition– Personalized PageRank

• Classification– Neural Networks– Lasso…

Page 7: GraphX : Graph Analytics on Spark

7

Dependency Graph

Table

Structure of Computation

Result

Data-Parallel Graph-Parallel

Row

Row

Row

Row

Pregel

Page 8: GraphX : Graph Analytics on Spark

The Graph-Parallel AbstractionA user-defined Vertex-Program runs on each vertexGraph constrains interaction along edges

Using messages (e.g. Pregel [PODC’09, SIGMOD’10])Through shared state (e.g., GraphLab [UAI’10, VLDB’12])

Parallelism: run multiple vertex programs simultaneously

8

Page 9: GraphX : Graph Analytics on Spark

By exploiting graph-structure

Graph-Parallel systems can be orders-of-

magnitude faster.

9

Page 10: GraphX : Graph Analytics on Spark

Counted: 34.8 Billion Triangles

10

Triangle Counting on Twitter

64 Machines15 SecondsGraphLab

1536 Machines423 Minutes

Hadoop[WWW’11]

S. Suri and S. Vassilvitskii, “Counting triangles and the curse of the last reducer,” WWW’11

1000 x Faster

40M Users, 1.4 Billion Links

Page 11: GraphX : Graph Analytics on Spark

Pregel

Specialized Graph Systems

Page 12: GraphX : Graph Analytics on Spark

Specialized Graph Systems

1. APIs to capture complex graph dependencies

2. Exploit graph structure toreduce communicationand computation

Page 13: GraphX : Graph Analytics on Spark

Why GraphX?

13

Page 14: GraphX : Graph Analytics on Spark

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Graph

Lab Hadoop Graph AlgorithmsGraph CreationPostProc

.

The Bigger Picture

Time Spent in Data Pipeline

Page 15: GraphX : Graph Analytics on Spark
Page 16: GraphX : Graph Analytics on Spark

Vertices

Page 17: GraphX : Graph Analytics on Spark

Edges

Edges

Page 18: GraphX : Graph Analytics on Spark

Limitations of Specialized Graph-Parallel Systems

No support for Construction & Post ProcessingNot interactive Requires maintaining multiple platforms

Spark excels at these!

Page 19: GraphX : Graph Analytics on Spark

GraphX Unifies Data-Parallel and Graph-

Parallel Systems

Spark Table API

RDDs, Fault-tolerance, and task scheduling

GraphLabGraph API

graph representation and

execution

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Graph Construction ComputationPost-Processingone system for the entire graph pipeline

Page 20: GraphX : Graph Analytics on Spark

Enable Joining Tables and Graphs

User Data

ProductRatings

Friend Graph

ETL

Product Rec.Graph

Join Inf.

Prod.Rec.

Tables Graphs

20

Page 21: GraphX : Graph Analytics on Spark

The GraphX Resilient Distributed

GraphId

RxinJegonzalFranklinIstoica

SrcId DstIdrxin jegonzal

franklin

rxin

istoica franklinfrankli

njegonzal

R

J

F

IAttribute (E)

FriendAdvisor

CoworkerPI

Attribute (V)(Stu., Berk.)

(PstDoc, Berk.)(Prof., Berk)(Prof., Berk)

Page 22: GraphX : Graph Analytics on Spark

class Graph [ V, E ] {// Table Views -----------------def vertices: RDD[ (Id, V) ]def edges: RDD[ (Id, Id, E) ]def triplets: RDD[ ((Id, V), (Id, V), E) ]

// Transformations ------------------------------def reverse: Graph[V, E]def filterV(p: (Id, V) => Boolean): Graph[V,E]def filterE(p: Edge[V,E] => Boolean): Graph[V,E]def mapV[T](m: (Id, V) => T ): Graph[T,E] def mapE[T](m: Edge[V,E] => T ): Graph[V,T]

// Joins ----------------------------------------def joinV[T](tbl: RDD[(Id, T)]): Graph[(V, Opt[T]), E ]def joinE[T](tbl: RDD[(Id, Id, T)]): Graph[V, (E, Opt[T])]

// Computation ----------------------------------def aggregateNeighbors[T](mapF: (Edge[V,E]) => T,

reduceF: (T, T) => T, direction: EdgeDir):

Graph[T, E]}

GraphX API

Page 23: GraphX : Graph Analytics on Spark

F

E

Aggregate NeighborsMap-Reduce for each vertex

D

B

A

C

mapF( )A B

mapF( )A C

a1

a2

reduceF( , )a1 a2 A

Page 24: GraphX : Graph Analytics on Spark

F

E

Example: Oldest Follower

D

B

A

CWhat is the age of the oldest follower for each user?val followerAge = graph.aggNbrs( e => e.src.age, // MapF max(_, _), // ReduceF InEdges).vertices

23 42

30

19 75

16

Page 25: GraphX : Graph Analytics on Spark

We can express both Pregel and GraphLab using

aggregateNeighbors in 40 lines of code!

Page 26: GraphX : Graph Analytics on Spark

Performance Optimizations

Replicate & co-partition vertices with edges

»GraphLab (PowerGraph) style vertex-cut partitioning

»Minimize communication by avoiding edge data movement in JOINs

In-memory hash index for fast joins

Page 27: GraphX : Graph Analytics on Spark

Early Performance

GraphLab

GraphX

Hadoop

0 200 400 600 800 1000 1200 1400 1600

22

165

1340

Runtime (in seconds, PageRank for 10 iter-ations)

Page 28: GraphX : Graph Analytics on Spark

In Progress Optimizations

Byte-code inspection of user functions»E.g. if mapf does not need edge data, we

can rewrite the query to delay the join

Execution strategies optimizer»Scan edges randomly accessing vertices»Scan vertices randomly accessing edges

Page 29: GraphX : Graph Analytics on Spark

Current Implementation

Pregel (20)

PageRank (5)

GraphX

Spark (relational operators)

Connected

Comp. (10)

Shortest Path (10)

ALS(40)

GraphLab (20)

Page 30: GraphX : Graph Analytics on Spark

DemoReynold Xin

Page 31: GraphX : Graph Analytics on Spark

Summary1. Graph-parallel primitives on Spark.2. Currently slower than GraphLab, but

»No need for specialized systems»Easier ETL, and easier consumption of

output»Interactive graph data mining

3. Future work will bring performance closer to specialized engines.

Page 32: GraphX : Graph Analytics on Spark

StatusCurrently finalizing the APIs

»Feedback wanted: http://bit.ly/graph-api

Also working on improving system performanceWill be part of Spark 0.9

Page 34: GraphX : Graph Analytics on Spark

Backup slides

Page 35: GraphX : Graph Analytics on Spark

Vertex Cut Partitioning

Page 36: GraphX : Graph Analytics on Spark

Vertex Cut Partitioning

Page 37: GraphX : Graph Analytics on Spark

aggregateNeighbors

Page 38: GraphX : Graph Analytics on Spark

aggregateNeighbors

Page 39: GraphX : Graph Analytics on Spark

aggregateNeighbors

Page 40: GraphX : Graph Analytics on Spark

aggregateNeighbors

Page 41: GraphX : Graph Analytics on Spark

Example: Vertex Degree

Page 42: GraphX : Graph Analytics on Spark

Example: Vertex Degree

Page 43: GraphX : Graph Analytics on Spark

Example: Vertex DegreeA: 5B: 0C: 0D: 0E: 0F: 0

Page 44: GraphX : Graph Analytics on Spark

F

E

Example: Oldest Follower

D

B

A

CWhat is the age of the oldest follower for each user?val followerAge = graph.aggNbrs( e => e.src.age, // MapF max(_, _), // ReduceF InEdges).vertices

Page 45: GraphX : Graph Analytics on Spark

Specialized Graph Systems

47

Shared State[UAI’10, VLDB’12]

PregelMessaging

[PODC’09, SIGMOD’10]

Many OthersGiraph, Stanford GPS, Signal-Collect,

Combinatorial BLAS, BoostPGL, …

Page 46: GraphX : Graph Analytics on Spark

class Graph [ V, E ] {// Table Views -----------------def vertices: RDD[ (Id, V) ]def edges: RDD[ (Id, Id, E) ]def triplets: RDD[ ((Id, V), (Id, V), E) ]

// Transformations ------------------------------def reverse: Graph[V, E]def filterV(p: (Id, V) => Boolean): Graph[V,E]def filterE(p: Edge[V,E] => Boolean): Graph[V,E]def mapV[T](m: (Id, V) => T ): Graph[T,E] def mapE[T](m: Edge[V,E] => T ): Graph[V,T]

// Joins ----------------------------------------def joinV[T](tbl: RDD[(Id, T)]): Graph[(V, Opt[T]), E ]def joinE[T](tbl: RDD[(Id, Id, T)]): Graph[V, (E, Opt[T])]

// Computation ----------------------------------def aggregateNeighbors[T](mapF: (Edge[V,E]) => T,

reduceF: (T, T) => T, direction: EdgeDir):

Graph[T, E]}

GraphX API