Upload
prashant-raaghav
View
62
Download
10
Tags:
Embed Size (px)
DESCRIPTION
Comparing Open Source implementations of Pregel and Related Systems. Installation of Hadoop and the Pregel Related Systems. Worked with Datasets of varying sizes from very small to very large. Large datasets that have around 30 million vertices and 50 million edges. Worked on 1,4,8 node Amazon EC2 cluster. 4 Algorithms : PageRank,Shortest Path,KMeans,Collaborative Filtering
Citation preview
Comparison and Evaluation of Open Source Implementations of Pregel and Related Systems
December 2, 2013
Joshua Woo, Prashant Raghav, Vishnu Prathish
David R. Cheriton School of Computer ScienceUniversity of Waterloo
Outline
● Motivation● Our Project● Setup● Preliminary Results● Preliminary Analysis● In-Progress● References
Motivation
Recall: Pregel● Large-scale graph processing system● Fault-tolerant framework for graph
algorithms● MapReduce for graph operations?● Vertex-centric model (“think like a vertex”)
Motivation
● Pregel is proprietary● Many open source graph processing
systems○ Pregel clones○ Pregel-inspired○ BSP
Motivation
● Apache Hama● Signal/Collect● Apache Giraph● GPS● GraphLab
● Phoebus● GoldenOrb● HipG● Mizan
MotivationSystem Impl. Language Type
Apache Hama Java Pure BSP framework
Signal/Collect Scala Pregel inspired
Apache Giraph Java Pregel clone
GPS Java Advanced Pregel clone
GraphLab C++ Pregel inspired
Phoebus Erlang Pregel clone
GoldenOrb Java Pregel clone
HipG Java Advanced Pregel clone
Mizan C++ Advanced Pregel clone
Motivation
● How do these systems compare?○ In terms of performance (runtime)?○ In terms of memory footprint?○ In terms of network utilization (num. messages)?○ Variables:
■ Algorithm■ Graph size (number of vertices)■ Cluster size
Our Project
● Compare at least 3 systems○ Apache Hama - general BSP framework○ Apache Giraph - Hadoop Map-only job, Facebook○ GPS - +dynamic repartitioning, +multi vertex-centric○ Signal/Collect - +edges, +async computations○ GraphLab○ Mizan
Our Project
● Measure the runtime of at least two algorithms on each system○ PageRank
■ Fixed number of supersteps = 30○ Single Source Shortest Path (SSSP)○ k-means clustering
Setup
● Experiments on AWS○ Ubuntu 12.04 m1.medium EC2 instances
■ 2 ECUs, 1 vCPU, 3.7 GiB memory, moderate network performance
■ 8 GiB EBS volume per instance○ Cluster sizes:
■ Single-node cluster■ 4-node cluster■ 8-node cluster
Setup
● Experiments on AWS○ 5 runs per dataset per algorithm per cluster
■ 35 runs per algorithm per cluster■ 70 runs per cluster■ 140 runs in total (single-node, 4-node)
● TODO: another 70 runs (8-node)
Setup
● Dataset○ 7 datasets
■ tinyEWD: 8 vertices 15 edges■ mediumEWD: 250 vertices 2,546 edges■ 1000EWD: 1,000 vertices 16,866 edges■ rome99: 3,353 vertices 8,870 edges■ 10000EWD: 10,000 vertices 16,866 edges■ NYC: 264,346 vertices 733,846 edges■ largeEWD: 1,000,000 vertices 15,172,126 edges
○ Source: http://algs4.cs.princeton.edu/44sp/
Setup
● Systems○ Hama
■ Hadoop 1.03.0■ Hama 0.6.3
○ Giraph■ Hadoop 0.20.203rc1■ Giraph (trunk@37bc2c80564b45d7e4ce95db76f5411a6b8bdb3a)
○ GPS■ Hadoop 0.20.203rc1■ GPS (trunk@Revision 112)
Setup
● Input Graph○ Source files converted into format suitable for each
system■ Time for this conversion excluded from results:
● Conversion done before algorithms are run (pre-processing?)
● Negligible for largeEWD (1,000,000 vertices, 15,172,126 edges)
Preliminary Results
Dataset Hama Giraph GPS
tinyEWD 14.17 41.60 14.40
mediumEWD 16.36 44.00 36.00
1000EWD 18.06 48.80 46.60
rome99 22.95 66.00 50.00
10000EWD 25.32 67.40 55.00
NYC 165.01 267.00 310.00
largeEWD 6,109.20 602.80 618.70
Average SSSP runtime on 4-node cluster (in seconds)
Preliminary ResultsSSSP runtime vs. graph size (num. vertices)
Preliminary Results
Dataset Hama Giraph GPS
tinyEWD 29.36 49.40 58.57
mediumEWD 30.26 53.40 60.42
1000EWD 37.86 54.60 61.03
rome99 29.35 56.20 61.80
10000EWD 302.33 61.80 64.80
NYC 1,001.24 134.40 68.69
largeEWD Failed 2,100.00 1,213.56
Average PageRank (30 supersteps) runtime on 4-node cluster (in seconds)
Preliminary ResultsPageRank runtime vs. graph size (num. vertices)
Preliminary Analysis● A point of resource crunch
○ No significant change in performance until a point● Hama does not scale well (vertices ~10^4)● Giraph and GPS scale better● In general, PageRank runtime > SSSP runtime● GPS input reader does not guarantee true partitioning
for large datasets● Which ‘knobs’ to keep constant? - Optimization vs.
Comparability
In-Progress
● Output validation● Memory footprint● Network utilization (num. messages)● GraphLab and Signal/Collect● Green-Marl?
○ (DSL) → [Compiler] → (Giraph, GPS)
Questions?
Extras
Preliminary Results
Dataset Hama Giraph GPS
tinyEWD 10 7 7
mediumEWD 16 13 18
1000EWD 27 25 23
rome99 105 102 18
10000EWD 85 80 64
NYC 671 905 438
largeEWD 806 670 730
Number of supersteps for SSSP
Preliminary ResultsNumber of supersteps for SSSP
Really, really PreliminaryPageRank runtime (in seconds) on GPS: native vs. Green-Marl generated
Dataset Native Green-Marl generated
tinyEWD 58.57 60.20
mediumEWD 60.42 60.11
1000EWD 61.03 62.30
rome99 61.80 62.32
10000EWD 64.80 65.78
NYC 68.69 71.34
largeEWD 1,213.56 -
Really, really PreliminaryPageRank runtime (in seconds) on GPS: native vs. Green-Marl generated
References
● Our Project Proposal● http://algs4.cs.princeton.edu/44sp/● https://github.com/apache/hadoop-common● https://github.com/apache/giraph● https://subversion.assembla.com/svn/phd-
projects/gps/trunk/● http://ppl.stanford.edu/main/green_marl.html