48
Carnegie Mellon Universit GraphLab Tutorial Yucheng Low 2

GraphLab Tutorial

Embed Size (px)

DESCRIPTION

2. GraphLab Tutorial. Yucheng Low. GraphLab Team. Yucheng Low. Joseph Gonzalez. Aapo Kyrola. Danny Bickson. Carlos Guestrin. Jay Gu. Development History. GraphLab 0.5 (2010). Internal Experimental Code. Insanely Templatized. First Open Source Release (< June 2011 LGPL - PowerPoint PPT Presentation

Citation preview

Page 1: GraphLab  Tutorial

Carnegie Mellon University

GraphLab TutorialYucheng Low

2

Page 2: GraphLab  Tutorial

GraphLab Team

YuchengLow

AapoKyrola

JayGu

JosephGonzalez

DannyBickson

Carlos Guestrin

Page 3: GraphLab  Tutorial

GraphLab 0.5 (2010) Internal Experimental Code

Insanely Templatized

Development History

GraphLab 1 (2011)

Nearly Everything is Templatized

First Open Source Release (< June 2011 LGPL >= June 2011 APL)

GraphLab 2 (2012)

Many Things are Templatized

Shared Memory : Jan 2012Distributed : May 2012

Page 4: GraphLab  Tutorial

Graphlab 2 Technical Design Goals

Improved useabilityDecreased compile timeAs good or better performance than GraphLab 1Improved distributed scalability

… other abstraction changes … (come to the talk!)

Page 5: GraphLab  Tutorial

Development HistoryEver since GraphLab 1.0, all active development are open source (APL):

code.google.com/p/graphlabapi/

(Even current experimental code. Activated with a --experimental flag on ./configure )

Page 6: GraphLab  Tutorial

Guaranteed Target Platforms• Any x86 Linux system with gcc >= 4.2• Any x86 Mac system with gcc 4.2.1 ( OS X 10.5 ?? )

• Other platforms?

… We welcome contributors.

Page 7: GraphLab  Tutorial

Tutorial OutlineGraphLab in a few slides + PageRankChecking out GraphLab v2Implementing PageRank in GraphLab v2Overview of different GraphLab schedulersPreview of Distributed GraphLab v2

(may not work in your checkout!)Ongoing work… (however much as time allows)

Page 8: GraphLab  Tutorial

WarningA preview of code still in intensive development!

Things may or may not work for you!

Interface may still change!

GraphLab 1 GraphLab 2 still has a number of performance regressions we are ironing out.

Page 9: GraphLab  Tutorial

PageRank ExampleIterate:

Where:α is the random reset probabilityL[j] is the number of links on page j

1 32

4 65

Page 10: GraphLab  Tutorial

10

The GraphLab Framework

Scheduler Consistency Model

Graph BasedData Representation

Update FunctionsUser Computation

Page 11: GraphLab  Tutorial

11

Data GraphA graph with arbitrary data (C++ Objects) associated with each vertex and edge

Vertex Data:• Webpage• Webpage Features

Edge Data:• Link weight

Graph:• Link graph

Page 12: GraphLab  Tutorial

12

The GraphLab Framework

Scheduler Consistency Model

Graph BasedData Representation

Update FunctionsUser Computation

Page 13: GraphLab  Tutorial

pagerank(i, scope){ // Get Neighborhood data (R[i], Wij, R[j]) scope;

// Update the vertex data

// Reschedule Neighbors if needed if R[i] changes then reschedule_neighbors_of(i); }

;][)1(][][

iNj

ji jRWiR

Update Functions

13

An update function is a user defined program which when applied to a vertex transforms the data in the scope of the vertex

Page 14: GraphLab  Tutorial

14

Dynamic Schedule

e f g

kjih

dcbaCPU 1

CPU 2

a

h

a

b

b

i

Process repeats until scheduler is empty

Page 15: GraphLab  Tutorial

Source Code Interjection 1

Graph, update functions, and schedulers

Page 16: GraphLab  Tutorial

--scope=vertex--scope=edge

Page 17: GraphLab  Tutorial

Consistency

Trade-offConsistency “Throughput”

# “iterations” per second

Goal of ML algorithm: Converge

False Trade-off

Page 18: GraphLab  Tutorial

18

Ensuring Race-Free CodeHow much can computation overlap?

Page 19: GraphLab  Tutorial

19

The GraphLab Framework

Scheduler Consistency Model

Graph BasedData Representation

Update FunctionsUser Computation

Page 20: GraphLab  Tutorial

Importance of ConsistencyFast ML Algorithm development cycle:

Build

Test

Debug

Tweak Model

Necessary for framework to behave predictably and consistently and avoid problems caused by non-determinism.Is the execution wrong? Or is the model wrong?

20

Page 21: GraphLab  Tutorial

Full Consistency

Guaranteed safety for all update functions

Page 22: GraphLab  Tutorial

Full Consistency

Parallel update only allowed two vertices apart Reduced opportunities for parallelism

Page 23: GraphLab  Tutorial

Obtaining More Parallelism

Not all update functions will modify the entire scope!

Belief Propagation: Only uses edge dataGibbs Sampling: Only needs to read adjacent vertices

Page 24: GraphLab  Tutorial

Edge Consistency

Page 25: GraphLab  Tutorial

Obtaining More Parallelism

“Map” operations. Feature extraction on vertex data

Page 26: GraphLab  Tutorial

Vertex Consistency

Page 27: GraphLab  Tutorial

The GraphLab Framework

Scheduler Consistency Model

Graph BasedData Representation

Update FunctionsUser Computation

27

Page 28: GraphLab  Tutorial

Shared VariablesGlobal aggregation through Sync OperationA global parallel reduction over the graph dataSynced variables recomputed at defined intervals while update functions are running

Sync: HighestPageRank

Sync: Loglikelihood

28

Page 29: GraphLab  Tutorial

Source Code Interjection 2

Shared variables

Page 30: GraphLab  Tutorial

What can we do with these primitives?

…many many things…

Page 31: GraphLab  Tutorial

Matrix FactorizationNetflix Collaborative Filtering

Alternating Least Squares Matrix Factorization

Model: 0.5 million nodes, 99 million edges

Netflix

Users

Movies

d

Page 32: GraphLab  Tutorial

NetflixSpeedup Increasing size of the matrix factorization

Page 33: GraphLab  Tutorial

Video Co-SegmentationDiscover “coherent”segment types acrossa video (extends Batra et al. ‘10)

1. Form super-voxels video2. EM & inference in Markov random field

Large model: 23 million nodes, 390 million edges

GraphLab

Ideal

Page 34: GraphLab  Tutorial

Many MoreTensor FactorizationBayesian Matrix FactorizationGraphical Model Inference/LearningLinear SVMEM clusteringLinear Solvers using GaBPSVDEtc.

Page 35: GraphLab  Tutorial

Distributed Preview

Page 36: GraphLab  Tutorial

GraphLab 2 Abstraction

Changes(an overview couple of them)

(Come to the talk for the rest!)

Page 37: GraphLab  Tutorial

Exploiting Update Functors

(for the greater good)

Page 38: GraphLab  Tutorial

Exploiting Update Functors (for the greater good)

1. Update Functors store state2. Scheduler schedules update functor instances.

3. We can use update functors as a controlled asynchronous message passing to communicate between vertices!

Page 39: GraphLab  Tutorial

Delta Based Update Functorsstruct pagerank : public iupdate_functor<graph, pagerank> {

double delta;pagerank(double d) : delta(d) { }void operator+=(pagerank& other) { delta +=

other.delta; }void operator()(icontext_type& context) {

vertex_data& vdata = context.vertex_data();

vdata.rank += delta;if(abs(delta) > EPSILON) {

double out_delta = delta * (1 – RESET_PROB) *

1/context.num_out_edges(edge.source());

context.schedule_out_neighbors(pagerank(out_delta));}

}};// Initial Rank: R[i] = 0;// Initial Schedule: pagerank(RESET_PROB);

Page 40: GraphLab  Tutorial

Asynchronous Message PassingObviously not all computation can be written this way. But when it can; it can be extremely fast.

Page 41: GraphLab  Tutorial

Factorized Updates

Page 42: GraphLab  Tutorial

PageRank in GraphLab

struct pagerank : public iupdate_functor<graph, pagerank> {

void operator()(icontext_type& context) {vertex_data& vdata =

context.vertex_data(); double sum = 0;foreach ( edge_type edge,

context.in_edges() )sum +=

context.const_edge_data(edge).weight *

context.const_vertex_data(edge.source()).rank;double old_rank = vdata.rank;vdata.rank = RESET_PROB + (1-RESET_PROB) *

sum;double residual = abs(vdata.rank –

old_rank) /

context.num_out_edges();if (residual > EPSILON)

context.reschedule_out_neighbors(pagerank());}

};

Page 43: GraphLab  Tutorial

PageRank in GraphLab

struct pagerank : public iupdate_functor<graph, pagerank> {

void operator()(icontext_type& context) {vertex_data& vdata =

context.vertex_data(); double sum = 0;foreach ( edge_type edge,

context.in_edges() )sum +=

context.const_edge_data(edge).weight *

context.const_vertex_data(edge.source()).rank;double old_rank = vdata.rank;vdata.rank = RESET_PROB + (1-RESET_PROB) *

sum;double residual = abs(vdata.rank –

old_rank) /

context.num_out_edges();if (residual > EPSILON)

context.reschedule_out_neighbors(pagerank());}

};

Atomic Single Vertex Apply

Parallel Scatter [Reschedule]

Parallel “Sum” Gather

Page 44: GraphLab  Tutorial

Decomposable Update Functors

Decompose update functions into 3 phases:

+ + … + Δ

Y YY

ParallelSum

User Defined:

Gather( ) ΔY

Δ1 + Δ2 Δ3

Y Scope

Gather

Y

YApply( , Δ) Y

Apply the accumulated value to center vertex

User Defined:

Apply

Y

Scatter( )

Update adjacent edgesand vertices.

User Defined:Y

Scatter

Page 45: GraphLab  Tutorial

Factorized PageRankstruct pagerank : public iupdate_functor<graph, pagerank> { double accum = 0, residual = 0;

void gather(icontext_type& context, const edge_type& edge) {

accum += context.const_edge_data(edge).weight *

context.const_vertex_data(edge.source()).rank;}void merge(const pagerank& other) { accum +=

other.accum; }void apply(icontext_type& context) {

vertex_data& vdata = context.vertex_data();double old_value = vdata.rank;vdata.rank = RESET_PROB + (1 - RESET_PROB)

* accum; residual = fabs(vdata.rank – old_value) /

context.num_out_edges();}void scatter(icontext_type& context, const

edge_type& edge) {if (residual > EPSILON)

context.schedule(edge.target(), pagerank());

}};

Page 46: GraphLab  Tutorial

Demo of *everything*

PageRank

Page 47: GraphLab  Tutorial

Ongoing WorkExtensions to improve performance on large graphs.

(See the GraphLab talk later!!)Better distributed Graph representation methodsPossibly better Graph PartitioningOff-core Graph storageContinually changing graphs

All New rewrite of distributed GraphLab (come back in May!)

Page 48: GraphLab  Tutorial

Ongoing WorkExtensions to improve performance on large graphs.

(See the GraphLab talk later!!)Better distributed Graph representation methodsPossibly better Graph PartitioningOff-core Graph storageContinually changing graphs

All New rewrite of distributed GraphLab (come back in May!)