This paper is included in the Proceedings of the 11th USENIX Symposium on
Operating Systems Design and Implementation.October 68, 2014 Broomfield, CO
Open access to the Proceedings of the 11th USENIX Symposium on Operating Systems
Design and Implementation is sponsored by USENIX.
GraphX: Graph Processing in a Distributed Dataflow Framework
Joseph E. Gonzalez, University of California, Berkeley; Reynold S. Xin, University of California, Berkeley, and Databricks;
Ankur Dave, Daniel Crankshaw, and Michael J. Franklin, University of California, Berkeley; Ion Stoica, University of California, Berkeley, and Databricks
USENIX Association 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14) 599
GraphX: Graph Processing in a Distributed Dataflow Framework
Joseph E. Gonzalez*, Reynold S. Xin*, Ankur Dave*, Daniel Crankshaw*
Michael J. Franklin*, Ion Stoica**UC Berkeley AMPLab Databricks
AbstractIn pursuit of graph processing performance, the systemscommunity has largely abandoned general-purpose dis-tributed dataflow frameworks in favor of specialized graphprocessing systems that provide tailored programming ab-stractions and accelerate the execution of iterative graphalgorithms. In this paper we argue that many of the advan-tages of specialized graph processing systems can be re-covered in a modern general-purpose distributed dataflowsystem. We introduce GraphX, an embedded graph pro-cessing framework built on top of Apache Spark, a widelyused distributed dataflow system. GraphX presents a fa-miliar composable graph abstraction that is sufficient toexpress existing graph APIs, yet can be implemented us-ing only a few basic dataflow operators (e.g., join, map,group-by). To achieve performance parity with special-ized graph systems, GraphX recasts graph-specific op-timizations as distributed join optimizations and mate-rialized view maintenance. By leveraging advances indistributed dataflow frameworks, GraphX brings low-costfault tolerance to graph processing. We evaluate GraphXon real workloads and demonstrate that GraphX achievesan order of magnitude performance gain over the basedataflow framework and matches the performance of spe-cialized graph processing systems while enabling a widerrange of computation.
The growing scale and importance of graph datahas driven the development of numerous specializedgraph processing systems including Pregel , Pow-erGraph , and many others [7, 9, 37]. By exposingspecialized abstractions backed by graph-specific opti-mizations, these systems can naturally express and ef-ficiently execute iterative graph algorithms like PageR-ank  and community detection  on graphs withbillions of vertices and edges. As a consequence, graph
GraphX (2,500)!Spark (30,000) !
GAS Pregel API (34)!
Connected Comp. (20)!
Figure 1: GraphX is a thin layer on top of the Sparkgeneral-purpose dataflow framework (lines of code).
processing systems typically outperform general-purposedistributed dataflow frameworks like Hadoop MapReduceby orders of magnitude [13, 20].
While the restricted focus of these systems enables awide range of system optimizations, it also comes at a cost.Graphs are only part of the larger analytics process whichoften combines graphs with unstructured and tabular data.Consequently, analytics pipelines (e.g., Figure 11) areforced to compose multiple systems which increases com-plexity and leads to unnecessary data movement and du-plication. Furthermore, in pursuit of performance, graphprocessing systems often abandon fault tolerance in fa-vor of snapshot recovery. Finally, as specialized systems,graph processing frameworks do not generally enjoy thebroad support of distributed dataflow frameworks.
In contrast, general-purpose distributed dataflow frame-works (e.g., Map-Reduce , Spark , Dryad ) ex-pose rich dataflow operators (e.g., map, reduce, group-by,join), are well suited for analyzing unstructured and tabu-lar data, and are widely adopted. However, directly imple-menting iterative graph algorithms using dataflow oper-ators can be challenging, often requiring multiple stagesof complex joins. Furthermore, the general-purpose joinand aggregation strategies defined in distributed dataflowframeworks do not leverage the common patterns andstructure in iterative graph algorithms and therefore missimportant optimization opportunities.
600 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14) USENIX Association
Historically, graph processing systems evolved sepa-rately from distributed dataflow frameworks for severalreasons. First, the early emphasis on single stage computa-tion and on-disk processing in distributed dataflow frame-works (e.g., MapReduce) limited their applicability toiterative graph algorithms which repeatedly and randomlyaccess subsets of the graph. Second, early distributeddataflow frameworks did not expose fine-grained controlover the data partitioning, hindering the application ofgraph partitioning techniques. However, new in-memorydistributed dataflow frameworks (e.g., Spark and Naiad)expose control over data partitioning and in-memory rep-resentation, addressing some of these limitations.
Given these developments, we believe there is an oppor-tunity to unify advances in graph processing systems withadvances in dataflow systems enabling a single systemto address the entire analytics pipeline. In this paper weexplore the design of graph processing systems on top ofgeneral purpose distributed dataflow systems. We arguethat by identifying the essential dataflow patterns in graphcomputation and recasting optimizations in graph pro-cessing systems as dataflow optimizations we can recoverthe advantages of specialized graph processing systemswithin a general-purpose distributed dataflow framework.To support this argument we introduce GraphX, an effi-cient graph processing framework embedded within theSpark  distributed dataflow system.
GraphX presents a familiar, expressive graph API (Sec-tion 3). Using the GraphX API we implement a variantof the popular Pregel abstraction as well as a range ofcommon graph operations. Unlike existing graph process-ing systems, the GraphX API enables the composition ofgraphs with unstructured and tabular data and permits thesame physical data to be viewed both as a graph and ascollections without data movement or duplication. For ex-ample, using GraphX it is easy to join a social graph withuser comments, apply graph algorithms, and expose theresults as either collections or graphs to other procedures(e.g., visualization or rollup). Consequently, GraphX en-ables users to adopt the computational pattern (graph orcollection) that is best suited for the current task withoutsacrificing performance or flexibility.
We built GraphX as a library on top of Spark (Figure 1)by encoding graphs as collections and then expressingthe GraphX API on top of standard dataflow operators.GraphX requires no modifications to Spark, revealinga general method to embed graph computation withindistributed dataflow frameworks and distill graph compu-tation to a specific joinmapgroup-by dataflow pattern.By reducing graph computation to a specific pattern weidentify the critical path for system optimization.
However, naively encoding graphs as collections andexecuting iterative graph computation using general-purpose dataflow operators can be slow and inefficient.
To achieve performance parity with specialized graph pro-cessing systems, GraphX introduces a range of optimiza-tions (Section 4) both in how graphs are encoded as col-lections as well as the execution of the common dataflowoperators. Flexible vertex-cut partitioning is used to en-code graphs as horizontally partitioned collections andmatch the state of the art in distributed graph partitioning.GraphX recasts system optimizations developed in thecontext of graph processing systems as join optimizations(e.g., CSR indexing, join elimination, and join-site speci-fication) and materialized view maintenance (e.g., vertexmirroring and delta updates) and applies these techniquesto the Spark dataflow operators. By leveraging logicalpartitioning and lineage, GraphX achieves low-cost faulttolerance. Finally, by exploiting immutability GraphXreuses indices across graph and collection views and overmultiple iterations, reducing memory overhead and im-proving system performance.
We evaluate GraphX on real-world graphs and compareagainst direct implementations of graph algorithms usingthe Spark dataflow operators as well as implementationsusing specialized graph processing systems. We demon-strate that GraphX can achieve performance parity withspecialized graph processing systems while preservingthe advantages of a general-purpose dataflow framework.In summary, the contributions of this paper are:
1. an integrated graph and collections API which issufficient to express existing graph abstractions andenable a much wider range of computation.
2. an embedding of vertex-cut partitioned graphs in hor-izontally partitioned collections and the GraphX APIin a small set of general-purpose dataflow operators.
3. distributed join and materialized view optimizationsthat enable general-purpose distributed dataflowframeworks to execute graph computation at per-formance parity with specialized graph systems.
4. a large-scale evaluation on real graphs and com-mon benchmarking algorithms comparing GraphXagainst widely used graph processing systems.
In this section we review the design trade-offs