2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph

Graph Analysis with One Trillion Edges on Apache Giraph

2/13/2014 Avery Ching, Facebook Strata

Motivation

Apache Giraph• Inspired by Google’s Pregel but runs on Hadoop

• “Think like a vertex”

• Maximum value vertex example

Processor 1

Processor 2

Time

5

5

5

5

2 5

5

5

2

1 5

5

2

1

Giraph on Hadoop / Yarn

MapReduce YARN

Giraph

Hadoop 0.20.x

Hadoop 0.20.203

Hadoop 2.0.x

Hadoop 1.x

Apache Giraph data flow

Loading the graph

Input format

Split 0

Split 1

Split 2

Split 3

Mas

ter

Load/ Send

Graph

Wor

ker 0

Load/ Send

Graph

Wor

ker 1

Storing the graph

Wor

ker 0

Wor

ker 1

Output format

Part 0

Part 1

Part 2

Part 3

Part 0

Part 1

Part 2

Part 3

Compute / Iterate

Compute/ Send

Messages

Compute/ Send

Messages

In-memory graph

Part 0

Part 1

Part 2

Part 3M

aste

r Wor

ker 0

Wor

ker 1

Send stats/iterate!

Beyond PregelSharded aggregators

Master computation

Composable computation

Use case: k-means clusteringCluster input vectors into k clusters

• Assign each input vector to the closest centroid

• Update centroid locations based on assignments

c2

c1

c0

Random centroid location

c2

c1

c0

Assignment to centroid

c2

c1

c0

c0

c2

c1

Update centroids

k-means in GiraphInput vectors → vertices

• Partitioned across machines

Centroids → aggregators

• Shared data across all machines

!

!

Problem solved....right?c2

c1

c0

Worker 0

c2

c1

c0

Partitioning the problem

c2

c1

c0

Worker 1

Problem 1: Massive dimensionsCluster Facebook members by friendships?

• 1 billion members (dimensions)

• k clusters

Each worker sending to the master a maximum of

• 1B * (2 bytes - max 5k friends) * k = 2 * k GB

Master receives up to 2 * k * workers GB

• Saturated network link

• OOM

Sharded aggregators

master

worker 0

worker 1

partial agg 0

partial agg 1

partial agg 2

partial agg 0

partial agg 1

partial agg 2

partial agg 0

partial agg 1

partial agg 2

worker 2

final agg 0

final agg 1

final agg 2

final agg 0

final agg 1

final agg 2

final agg 0

final agg 1

final agg 2

final agg 0

final agg 1

final agg 2

Master handles all aggregators

master

worker 0

worker 1

partial agg 0

partial agg 1

partial agg 2

partial agg 0

partial agg 1

partial agg 2

partial agg 0

partial agg 1

partial agg 2

worker 2

final agg 0

final agg 1

final agg 2

final agg 0

final agg 1

final agg 2

final agg 1

final agg 2

final agg 0

final agg 2

final agg 0

final agg 1

Aggregators sharded to workers

• Share aggregator load across workers

• Future work - tree-based optimizations (not yet a problem)

Problem 2: Edge cut metricClusters should reduce the number of cut edges

Two phases

• Send all out edges your cluster id

• Aggregate edges with different cluster ids

Calculate no more than once an hour?

Master computationSerial computation on master

• Communicates to workers via aggregators

• Added to Giraph by Stanford GPS team

Time

Worker 0

Worker 1

Master

k-means

k-means

k-means

k-means

start cut

start cut

end cut

end cut

k-means

k-means

Problem 3: More phases, more problemsAdd a stage to initialize the centroids

Add random input vectors to centroids

• Add a few random friends

Two phases

• Randomly sample input vertices to add

• Send messages to a few random neighbors

c0c2

c3

Cannot easily support different messages, combiners

Vertex compute code getting messy

Problem 3: (continued)

c0c2

c3

if (phase == INITIALIZE_SELF)

// Randomly add to centroid

else if (phase == INITIALIZE_FRIEND)

// Add my vector to centroid if a friend selected me

else if (phase == K_MEANS)

// Do k-means

else if (phase == START_EDGE_CUT)...

Composable computationDecouple vertex from computation

Master sets the computation, combiner classes

Reusable and composable

ComputationAdd random

centroid / random friends

Add to centroid K-means Start edge cut End edge cut

In message NullCentroid message

Null Null Cluster

Out message Centroid message

Null Null Cluster Null

Combiner N/A N/A N/A Cluster combiner N/A

Composable computation (cont)

compute candidates to move to partitions

probabilistically move vertices

Continue if halting condition not met (i.e. < n vertices moved?)

Balanced Label Propagation

Composable computation (cont)

calculate and send responsibilities

calculate and send availabilities

update exemplars

compute candidates to move to partitions

probabilistically move vertices

Continue if halting condition not met (i.e. < n vertices moved?)

Balanced Label Propagation

Continue if halting condition met (i.e. < n vertices changed exemplars?)

Affinity Propagation

Faster than Hive?

Application Graph Size CPU Time Speedup Elapsed Time Speedup

Page rank(single iteration)

400B+ edges 26x 120x

Friends of friends score 71B+ edges 12.5x 48x

Apache Giraph scalabilityScalability of workers

(200B edges)

Seco

nd

s

0

125

250

375

500

# of Workers

50 100 150 200 250 300

Giraph Ideal

Scalability of edges (50 workers)

Seco

nd

s

0

125

250

375

500

# of Edges

1E+09 7E+10 1E+11 2E+11

Giraph Ideal

A billion edges isn’t cool. You know what’s cool?

A TRILLION edges.

Page rank on 200 machines with 1 trillion

(1,000,000,000,000) edges <4 minutes / iteration!

* Results from 6/30/2013 with one-to-all messaging + request

processing improvements

Why balanced partitioningRandom partitioning == good balance

BUT ignores entity affinity

0 1

2

3

4 5

6

7

8 9

10

11

Balanced partitioning applicationResults from one service:

Cache hit rate grew from 70% to 85%, bandwidth cut in 1/2

!

!

0

2

3

5

6 9

11

1 4 7

8

10

Balanced label propagation results

* Loosely based on Ugander and Backstrom. Balanced label propagation for partitioning massive graphs, WSDM '13

Avoiding out-of-coreExample: Mutual friends calculation between neighbors

1. Send your friends a list of your friends

2. Intersect with your friend list

!

1.23B (as of 1/2014)

200+ average friends (2011 S1)

8-byte ids (longs)

= 394 TB / 100 GB machines

3,940 machines (not including the graph)

A B

C

D

E

A:{D} D:{A,E} E:{D}

B:{} C:{D} D:{C}

A:{C} C:{A,E} E:{C}

!C:{D} D:{C}

!!E:{}

Superstep splittingSubsets of sources/destinations edges per superstep

* Currently manual - future work automatic!

A

Sources: A (on), B (off) Destinations: A (on), B (off)

B

BB

AA

A

Sources: A (on), B (off) Destinations: A (off), B (on)

B

BB

AA

A

Sources: A (off), B (on) Destinations: A (on), B (off)

B

BB

AA

A

Sources: A (off), B (on) Destinations: A (off), B (on)

B

BB

AA

DebuggingDebugging with GiraphicJam

Giraph in ProductionOver 1.5 years in production

Over 100 jobs processed a week

30+ applications in our internal application repository

Sample production job - 700B+ edges

Very stable

• Checkpointing disabled (highly loaded HDFS adds instability)

• Retries handle intermittent failures

Giraph roadmap

2/12 - 0.1 Spring 2014 - 1.15/13 - 1.0

Relaxing BSP - 1.2?

• Giraph++ (IBM research)

• Giraphx (University at Buffalo, SUNY)

Future workEvaluate alternative computing models

Performance

Lower the barrier to entry

Applications

Our team

!

Maja Kabiljo

Greg Malewicz

Pavan Athivarapu

Avery Ching

Sambavi Muthukrishnan

Education

2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph