24
GraphX : Graph Processing in a Distributed Dataflow Framework OSDI 2014 Bidyut Hota

GraphX : Graph Processing in a Distributed Dataflow Frameworkpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-graphx-bidyut.pdf · Google Knowledge graph :570MVertices, 18B Edges (

  • Upload
    others

  • View
    13

  • Download
    0

Embed Size (px)

Citation preview

Page 1: GraphX : Graph Processing in a Distributed Dataflow Frameworkpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-graphx-bidyut.pdf · Google Knowledge graph :570MVertices, 18B Edges (

GraphX : Graph Processing in a Distributed Dataflow Framework

OSDI 2014

BidyutHota

Page 2: GraphX : Graph Processing in a Distributed Dataflow Frameworkpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-graphx-bidyut.pdf · Google Knowledge graph :570MVertices, 18B Edges (

Agenda

•  Analytics space background

•  Motivation

•  Goal

•  Approach

•  Optimizations

•  Results

•  Flaws/Limitations

•  Questions

Page 3: GraphX : Graph Processing in a Distributed Dataflow Frameworkpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-graphx-bidyut.pdf · Google Knowledge graph :570MVertices, 18B Edges (

Real life Analytics Pipeline

Link Table Page Rank Desired results Raw data

Eg. Google Knowledge graph :570MVertices, 18B Edges ( as in Mid 2017)

Page 4: GraphX : Graph Processing in a Distributed Dataflow Frameworkpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-graphx-bidyut.pdf · Google Knowledge graph :570MVertices, 18B Edges (

Real life Analytics Pipeline

Link Table Page Rank Desired results Raw data

Tables

Page 5: GraphX : Graph Processing in a Distributed Dataflow Frameworkpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-graphx-bidyut.pdf · Google Knowledge graph :570MVertices, 18B Edges (

Real life Analytics Pipeline

Link Table Page Rank Desired results Raw data

Graphs

Page 6: GraphX : Graph Processing in a Distributed Dataflow Frameworkpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-graphx-bidyut.pdf · Google Knowledge graph :570MVertices, 18B Edges (

Systems landscape

Page 7: GraphX : Graph Processing in a Distributed Dataflow Frameworkpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-graphx-bidyut.pdf · Google Knowledge graph :570MVertices, 18B Edges (

Motivation

• Currently separate systems exist to compute on these data representation. • Ability to combine data

sources. • Enhance dataflow

frameworks to leverage inherent positives.

Page 8: GraphX : Graph Processing in a Distributed Dataflow Frameworkpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-graphx-bidyut.pdf · Google Knowledge graph :570MVertices, 18B Edges (

Current drawbacks of dataflow frameworks

•  Implementing iterative algorithms -> requires multiple stages of complex joins. • Do not cover common patterns in

graph algorithms -> Room for optimization. • Unlike Spark, no fine grained control

of data partitioning.

Page 9: GraphX : Graph Processing in a Distributed Dataflow Frameworkpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-graphx-bidyut.pdf · Google Knowledge graph :570MVertices, 18B Edges (

Current drawbacks of specialized systems

• Lacking ability for combining graphs with unstructured or tabular data • Systems favoring snapshot recovery

rather than fault tolerance like in Spark

Page 10: GraphX : Graph Processing in a Distributed Dataflow Frameworkpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-graphx-bidyut.pdf · Google Knowledge graph :570MVertices, 18B Edges (

What can we leverage?

•  Immutability of RDD’s • Reusing indices across graph and

collection views over iterations. •  Increase in performance

Page 11: GraphX : Graph Processing in a Distributed Dataflow Frameworkpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-graphx-bidyut.pdf · Google Knowledge graph :570MVertices, 18B Edges (

Goal

• General purpose distributed frameworks for graph computations • Comparable performances to

specialized graph processing systems

Page 12: GraphX : Graph Processing in a Distributed Dataflow Frameworkpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-graphx-bidyut.pdf · Google Knowledge graph :570MVertices, 18B Edges (

Approach

•  Unifies Tabular view and Graph view

•  Imbibe the best of specialized systems

•  Graph representation on dataflow frameworks

•  Optimizations •  Develop GraphX API on top of Spark

Page 13: GraphX : Graph Processing in a Distributed Dataflow Frameworkpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-graphx-bidyut.pdf · Google Knowledge graph :570MVertices, 18B Edges (

Graph approach: Page Rank example

•  Eg. Page Rank algorithm • Graph parallel abstraction • Define a vertex program •  Terminate when vertex programs vote to halt

Figure : PageRank in Pregel

Page 14: GraphX : Graph Processing in a Distributed Dataflow Frameworkpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-graphx-bidyut.pdf · Google Knowledge graph :570MVertices, 18B Edges (

Approach

• GAS (Gather Apply Scatter)

How to apply this in dataflow frameworks? • Map, group-by, join dataflow operators

Page 15: GraphX : Graph Processing in a Distributed Dataflow Frameworkpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-graphx-bidyut.pdf · Google Knowledge graph :570MVertices, 18B Edges (

Representing Property graphs as Tables

Never transfer edges

Page 16: GraphX : Graph Processing in a Distributed Dataflow Frameworkpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-graphx-bidyut.pdf · Google Knowledge graph :570MVertices, 18B Edges (

GraphX API

Page 17: GraphX : Graph Processing in a Distributed Dataflow Frameworkpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-graphx-bidyut.pdf · Google Knowledge graph :570MVertices, 18B Edges (

Using the dataflow operators

Logical representation Join of vertices table on edges table

Page 18: GraphX : Graph Processing in a Distributed Dataflow Frameworkpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-graphx-bidyut.pdf · Google Knowledge graph :570MVertices, 18B Edges (

Using the dataflow operators on vertex program

Userdefined

Page 19: GraphX : Graph Processing in a Distributed Dataflow Frameworkpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-graphx-bidyut.pdf · Google Knowledge graph :570MVertices, 18B Edges (

Optimizations

SpecializedDataStructure Vertex-cutPartitioning Remotecaching

ActiveSetTracking

Page 20: GraphX : Graph Processing in a Distributed Dataflow Frameworkpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-graphx-bidyut.pdf · Google Knowledge graph :570MVertices, 18B Edges (

Implementing Optimizations

• Reusable Hash index •  Sequential scan or clustered scan based on active set (Dynamic) •  Incremental updates • Automatic Join elimination

Additional optimizations: • Memory based shuffle • Batching and columnar structure • Variable Integer encoding

Page 21: GraphX : Graph Processing in a Distributed Dataflow Frameworkpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-graphx-bidyut.pdf · Google Knowledge graph :570MVertices, 18B Edges (

Results

Page 22: GraphX : Graph Processing in a Distributed Dataflow Frameworkpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-graphx-bidyut.pdf · Google Knowledge graph :570MVertices, 18B Edges (

Results

Scaling for PageRank on Twitter dataset

Effect of partitioning on communication

Page 23: GraphX : Graph Processing in a Distributed Dataflow Frameworkpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-graphx-bidyut.pdf · Google Knowledge graph :570MVertices, 18B Edges (

Current Flaws •  Is not optimized for dynamic graphs. • Requires incremental updates to

routing table. •  Is not designed for streaming

applications.

• Asynchronous graph computation not available. This is where Naiad will outperform.

Page 24: GraphX : Graph Processing in a Distributed Dataflow Frameworkpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-graphx-bidyut.pdf · Google Knowledge graph :570MVertices, 18B Edges (

Questions