View
510
Download
2
Embed Size (px)
DESCRIPTION
(Abstract from Strata talk) http://strataconf.com/strata2014/public/schedule/detail/32137 Graph analytics have applications beyond large web scale organizations. Many computing problems can be efficiently expressed and processed as a graph and can lead to useful insights that drive product and business decisions While you can express graph algorithms as SQL queries in Hive or Hadoop MapReduce programs, an API designed specifically for graph processing makes writing many iterative graph computations (such as page rank, connected components, label propagation, graph-based clustering, etc.) easy to express in simpler and easier to understand code. Apache Giraph provides such a native graph processing API, runs on existing Hadoop infrastructure and can directly access HDFS and/or Hive tables. This talk describes our efforts at Facebook to scale Apache Giraph to very large graphs of up to one trillion edges and how we run Apache Giraph in production. We will also talk about several algorithms that we have implemented and their use cases.
Citation preview
Graph Analysis with One Trillion Edges on Apache Giraph
2/13/2014 Avery Ching, Facebook Strata
Motivation
Apache Giraph• Inspired by Google’s Pregel but runs on Hadoop
• “Think like a vertex”
• Maximum value vertex example
Processor 1
Processor 2
Time
5
5
5
5
2 5
5
5
2
1 5
5
2
1
Giraph on Hadoop / Yarn
MapReduce YARN
Giraph
Hadoop 0.20.x
Hadoop 0.20.203
Hadoop 2.0.x
Hadoop 1.x
Apache Giraph data flow
Loading the graph
Input format
Split 0
Split 1
Split 2
Split 3
Mas
ter
Load/ Send
Graph
Wor
ker 0
Load/ Send
Graph
Wor
ker 1
Storing the graph
Wor
ker 0
Wor
ker 1
Output format
Part 0
Part 1
Part 2
Part 3
Part 0
Part 1
Part 2
Part 3
Compute / Iterate
Compute/ Send
Messages
Compute/ Send
Messages
In-memory graph
Part 0
Part 1
Part 2
Part 3M
aste
r Wor
ker 0
Wor
ker 1
Send stats/iterate!
Beyond PregelSharded aggregators
Master computation
Composable computation
Use case: k-means clusteringCluster input vectors into k clusters
• Assign each input vector to the closest centroid
• Update centroid locations based on assignments
c2
c1
c0
Random centroid location
c2
c1
c0
Assignment to centroid
c2
c1
c0
c0
c2
c1
Update centroids
k-means in GiraphInput vectors → vertices
• Partitioned across machines
Centroids → aggregators
• Shared data across all machines
!
!
Problem solved....right?c2
c1
c0
Worker 0
c2
c1
c0
Partitioning the problem
c2
c1
c0
Worker 1
Problem 1: Massive dimensionsCluster Facebook members by friendships?
• 1 billion members (dimensions)
• k clusters
Each worker sending to the master a maximum of
• 1B * (2 bytes - max 5k friends) * k = 2 * k GB
Master receives up to 2 * k * workers GB
• Saturated network link
• OOM
Sharded aggregators
master
worker 0
worker 1
partial agg 0
partial agg 1
partial agg 2
partial agg 0
partial agg 1
partial agg 2
partial agg 0
partial agg 1
partial agg 2
worker 2
final agg 0
final agg 1
final agg 2
final agg 0
final agg 1
final agg 2
final agg 0
final agg 1
final agg 2
final agg 0
final agg 1
final agg 2
Master handles all aggregators
master
worker 0
worker 1
partial agg 0
partial agg 1
partial agg 2
partial agg 0
partial agg 1
partial agg 2
partial agg 0
partial agg 1
partial agg 2
worker 2
final agg 0
final agg 1
final agg 2
final agg 0
final agg 1
final agg 2
final agg 1
final agg 2
final agg 0
final agg 2
final agg 0
final agg 1
Aggregators sharded to workers
• Share aggregator load across workers
• Future work - tree-based optimizations (not yet a problem)
Problem 2: Edge cut metricClusters should reduce the number of cut edges
Two phases
• Send all out edges your cluster id
• Aggregate edges with different cluster ids
Calculate no more than once an hour?
Master computationSerial computation on master
• Communicates to workers via aggregators
• Added to Giraph by Stanford GPS team
Time
Worker 0
Worker 1
Master
k-means
k-means
k-means
k-means
start cut
start cut
end cut
end cut
k-means
k-means
Problem 3: More phases, more problemsAdd a stage to initialize the centroids
Add random input vectors to centroids
• Add a few random friends
Two phases
• Randomly sample input vertices to add
• Send messages to a few random neighbors
c0c2
c3
Cannot easily support different messages, combiners
Vertex compute code getting messy
Problem 3: (continued)
c0c2
c3
if (phase == INITIALIZE_SELF)
// Randomly add to centroid
else if (phase == INITIALIZE_FRIEND)
// Add my vector to centroid if a friend selected me
else if (phase == K_MEANS)
// Do k-means
else if (phase == START_EDGE_CUT)...
Composable computationDecouple vertex from computation
Master sets the computation, combiner classes
Reusable and composable
ComputationAdd random
centroid / random friends
Add to centroid K-means Start edge cut End edge cut
In message NullCentroid message
Null Null Cluster
Out message Centroid message
Null Null Cluster Null
Combiner N/A N/A N/A Cluster combiner N/A
Composable computation (cont)
compute candidates to move to partitions
probabilistically move vertices
Continue if halting condition not met (i.e. < n vertices moved?)
Balanced Label Propagation
Composable computation (cont)
calculate and send responsibilities
calculate and send availabilities
update exemplars
compute candidates to move to partitions
probabilistically move vertices
Continue if halting condition not met (i.e. < n vertices moved?)
Balanced Label Propagation
Continue if halting condition met (i.e. < n vertices changed exemplars?)
Affinity Propagation
Faster than Hive?
Application Graph Size CPU Time Speedup Elapsed Time Speedup
Page rank(single iteration)
400B+ edges 26x 120x
Friends of friends score 71B+ edges 12.5x 48x
Apache Giraph scalabilityScalability of workers
(200B edges)
Seco
nd
s
0
125
250
375
500
# of Workers
50 100 150 200 250 300
Giraph Ideal
Scalability of edges (50 workers)
Seco
nd
s
0
125
250
375
500
# of Edges
1E+09 7E+10 1E+11 2E+11
Giraph Ideal
A billion edges isn’t cool. You know what’s cool?
A TRILLION edges.
Page rank on 200 machines with 1 trillion
(1,000,000,000,000) edges <4 minutes / iteration!
* Results from 6/30/2013 with one-to-all messaging + request
processing improvements
Why balanced partitioningRandom partitioning == good balance
BUT ignores entity affinity
0 1
2
3
4 5
6
7
8 9
10
11
Balanced partitioning applicationResults from one service:
Cache hit rate grew from 70% to 85%, bandwidth cut in 1/2
!
!
0
2
3
5
6 9
11
1 4 7
8
10
Balanced label propagation results
* Loosely based on Ugander and Backstrom. Balanced label propagation for partitioning massive graphs, WSDM '13
Avoiding out-of-coreExample: Mutual friends calculation between neighbors
1. Send your friends a list of your friends
2. Intersect with your friend list
!
1.23B (as of 1/2014)
200+ average friends (2011 S1)
8-byte ids (longs)
= 394 TB / 100 GB machines
3,940 machines (not including the graph)
A B
C
D
E
A:{D} D:{A,E} E:{D}
B:{} C:{D} D:{C}
A:{C} C:{A,E} E:{C}
!C:{D} D:{C}
!!E:{}
Superstep splittingSubsets of sources/destinations edges per superstep
* Currently manual - future work automatic!
A
Sources: A (on), B (off) Destinations: A (on), B (off)
B
BB
AA
A
Sources: A (on), B (off) Destinations: A (off), B (on)
B
BB
AA
A
Sources: A (off), B (on) Destinations: A (on), B (off)
B
BB
AA
A
Sources: A (off), B (on) Destinations: A (off), B (on)
B
BB
AA
DebuggingDebugging with GiraphicJam
Giraph in ProductionOver 1.5 years in production
Over 100 jobs processed a week
30+ applications in our internal application repository
Sample production job - 700B+ edges
Very stable
• Checkpointing disabled (highly loaded HDFS adds instability)
• Retries handle intermittent failures
Giraph roadmap
2/12 - 0.1 Spring 2014 - 1.15/13 - 1.0
Relaxing BSP - 1.2?
• Giraph++ (IBM research)
• Giraphx (University at Buffalo, SUNY)
Future workEvaluate alternative computing models
Performance
Lower the barrier to entry
Applications
Our team
!
Maja Kabiljo
Greg Malewicz
Pavan Athivarapu
Avery Ching
Sambavi Muthukrishnan