Upload
tausiq
View
58
Download
1
Embed Size (px)
DESCRIPTION
GraphLab A New Framework for Parallel Machine Learning. Yucheng Low Aapo Kyrola Carlos Guestrin. Joseph Gonzalez Danny Bickson Joe Hellerstein. Exponential Parallelism. Exponentially Increasing Parallel Performance. Exponentially Increasing Sequential Performance. - PowerPoint PPT Presentation
Citation preview
Carnegie Mellon
GraphLabA New Framework for
Parallel Machine Learning
Yucheng LowAapo Kyrola Carlos Guestrin
Joseph GonzalezDanny BicksonJoe Hellerstein
2
1988
1990
1992
1994
1996
1998
2000
2002
2004
2006
2008
2010
0.01
0.1
1
10
Exponential Parallelism
Exponentially Increasing
Sequential Pe
rformance
Constant SequentialPerformance
Proc
esso
r Spe
ed G
Hz
Exponentially Increasing Parallel Performance
Release Date
13 Million Wikipedia Pages3.6 Billion photos on Flickr
Parallel Programming is Hard
Designing efficient Parallel Algorithms is hard
Race conditions and deadlocksParallel memory bottlenecksArchitecture specific concurrencyDifficult to debug
ML experts repeatedly address the same parallel design challenges
3
Avoid these problems by using high-level abstractions.
Graduate students
CPU 1 CPU 2 CPU 3 CPU 4
MapReduce – Map Phase
4
Embarrassingly Parallel independent computation
12.9
42.3
21.3
25.8
No Communication needed
CPU 1 CPU 2 CPU 3 CPU 4
MapReduce – Map Phase
5
Embarrassingly Parallel independent computation
12.9
42.3
21.3
25.8
24.1
84.3
18.4
84.4
No Communication needed
CPU 1 CPU 2 CPU 3 CPU 4
MapReduce – Map Phase
6
Embarrassingly Parallel independent computation
12.9
42.3
21.3
25.8
17.5
67.5
14.9
34.3
24.1
84.3
18.4
84.4
No Communication needed
CPU 1 CPU 2
MapReduce – Reduce Phase
7
12.9
42.3
21.3
25.8
24.1
84.3
18.4
84.4
17.5
67.5
14.9
34.3
2226.
26
1726.
31
Fold/Aggregation
Related Data
8
Interdependent Computation: Not MapReduceable
Parallel Computing and ML
Not all algorithms are efficiently data parallel
9
Data-Parallel Complex Parallel Structure
CrossValidation
Feature Extraction
BeliefPropagation
SVM
KernelMethods
Deep BeliefNetworks
NeuralNetworks
Tensor Factorization Sampling
Lasso
Common Properties
10
1) Sparse Data Dependencies
2) Local Computations
3) Iterative Updates
• Sparse Primal SVM• Tensor/Matrix Factorization
• Expectation Maximization• Optimization
• Sampling• Belief Propagation
Operation A
Operation B
Gibbs Sampling
11
X4 X5 X6
X9X8
X3X2X1
X7
1) Sparse Data Dependencies
2) Local Computations
3) Iterative Updates
GraphLab is the SolutionDesigned specifically for ML needs
Express data dependenciesIterative
Simplifies the design of parallel programs:
Abstract away hardware issuesAutomatic data synchronizationAddresses multiple hardware architectures
Implementation here is multi-coreDistributed implementation in progress
12
Carnegie Mellon
GraphLabA New Framework for
Parallel Machine Learning
GraphLab
14
Data Graph Shared Data Table
Scheduling
Update Functions and Scopes
GraphLabModel
Data Graph
15
A Graph with data associated with every vertex and edge.
:Data
x3: Sample valueC(X3): sample counts
Φ(X6,X9): Binary potential
X1
X2
X3
X5
X6
X7
X8
X9
X10
X4
X11
Update Functions
16
Update Functions are operations which are applied on a vertex and transform the data in the scope of the vertex
Gibbs Update: - Read samples on adjacent vertices - Read edge potentials - Compute a new sample for the current vertex
Update Function Schedule
17
e f g
kjih
dcbaCPU 1
CPU 2
ahaibd
Update Function Schedule
18
e f g
kjih
dcbaCPU 1
CPU 2
aibd
Static ScheduleScheduler determines the
order of Update Function Evaluations
19
Synchronous Schedule: Every vertex updated simultaneously
Round Robin Schedule: Every vertex updated sequentially
Converged Slowly ConvergingFocus Effort
Need for Dynamic Scheduling
20
Dynamic Schedule
21
e f g
kjih
dcbaCPU 1
CPU 2
ahab
b
i
Dynamic ScheduleUpdate Functions can insert new tasks into
the schedule
22
FIFO Queue Wildfire BP [Selvatici et al.]Priority Queue Residual BP [Elidan et al.]Splash Schedule Splash BP [Gonzalez et al.]
Obtain different algorithms simply by changing a flag!
--scheduler=fifo --scheduler=priority --scheduler=splash
Global Information
What if we need global information?
23
Sum of all the vertices?
Algorithm Parameters?Sufficient Statistics?
Shared Data Table (SDT)Global constant parameters
24
Constant:Total # Samples
Constant: Temperature
Accumulate Function:
Sync OperationSync is a fold/reduce operation over the graph
25
Sync!
1 3 2
1211
3251
0
Apply Function:Add
Divide by |V|
9222
Accumulate performs an aggregation over verticesApply makes a final modification to the accumulated dataExample: Compute the average of all the vertices
Shared Data Table (SDT)Global constant parametersGlobal computation (Sync Operation)
26
Constant:Total # Samples
Sync: SampleStatistics
Sync: LoglikelihoodConstant: Temperature
Carnegie Mellon
Safetyand
Consistency
27
Write-Write Race
28
Write-Write Race If adjacent update functions write simultaneously
Left update writes: Right update writes:Final Value
Race Conditions + Deadlocks
Just one of the many possible racesRace-free code is extremely difficult to write
29
GraphLab design ensures race-free operation
Scope Rules
30
Guaranteed safety for all update functions
Full Consistency
31
Only allow update functions two vertices apart to be run in parallelReduced opportunities for parallelism
Obtaining More Parallelism
32
Not all update functions will modify the entire scope!
Belief Propagation: Only uses edge dataGibbs Sampling: Only needs to read adjacent vertices
Edge Consistency
33
Obtaining More Parallelism
34
“Map” operations. Feature extraction on vertex data
Vertex Consistency
35
Sequential ConsistencyGraphLab guarantees sequential
consistency
36
For every parallel execution, there exists a sequential execution of update functions which will produce the same result.
CPU 1
CPU 2
CPU 1
Parallel
Sequential
time
GraphLab
37
GraphLabModel
Data Graph Shared Data Table
Scheduling
Update Functions and
Scopes
Carnegie Mellon
Experiments
38
ExperimentsShared Memory Implemention in C++ using PthreadsTested on a 16 processor machine
4x Quad Core AMD Opteron 838464 GB RAM
Belief Propagation +Parameter Learning
Gibbs SamplingCoEMLasso
39
Compressed SensingSVMPageRankTensor Factorization
Graphical Model Learning
40
3D retinal image denoising
Data Graph: 256x64x64 vertices
Update Function
Belief PropagationSync
Acc: Compute inference statisticsApply:Take a gradient step
Sync: Edge-potential
Graphical Model Learning
41
0 2 4 6 8 10 12 14 160
2
4
6
8
10
12
14
16
Number of CPUs
Spee
dup
Optimal
Bette
r
Approx. Priority Schedule
Splash Schedule
15.5x speedup on 16 cpus
Graphical Model Learning
42
Inference
Gradient Step
Standard parameter learning takes gradient only after inference is compute
Parallel Inference + Gradient Step
With GraphLab:Take gradient step while inference is running
Runt
ime
3x faster!
2100 sec
700 sec
Iterated Simultaneous
Gibbs SamplingTwo methods for sequentially consistency:
43
ScopesEdge Scope
graphlab(gibbs, edge, sweep);
SchedulingGraph Coloring
CPU 1CPU 2CPU 3
t0t1t2t3
graphlab(gibbs, vertex, colored);
Gibbs SamplingProtein-protein interaction networks [Elidan et al. 2006]
Pair-wise MRF14K Vertices100K Edges
10x SpeedupScheduling reduceslocking overhead
44
0 2 4 6 8 10 12 14 1602468
10121416
Number of CPUs
Spee
dup
Optimal
Bette
r
Round robin schedule
Colored Schedule
CoEM (Rosie Jones, 2005)Named Entity Recognition Task
Vertices
Edges
Small 0.2M 20MLarge 2M 200M
the dog
Australia
Catalina Island
<X> ran quickly
travelled to <X>
<X> is pleasant
Hadoop 95 Cores 7.5 hrs
Is “Dog” an animal?Is “Catalina” a place?
CoEM (Rosie Jones, 2005)
4646
Optimal
0 2 4 6 8 10 12 14 160
2
4
6
8
10
12
14
16
Number of CPUs
Spee
dup
Bette
r
Small
LargeGraphLab 16 Cores 30 min
15x Faster!6x fewer CPUs!
Hadoop 95 Cores 7.5 hrs
Lasso
47
L1 regularized Linear Regression
Shooting Algorithm (Coordinate Descent)Due to the properties of the update, full consistency is needed
Lasso
48
L1 regularized Linear Regression
Shooting Algorithm (Coordinate Descent)Due to the properties of the update, full consistency is needed
Lasso
49
L1 regularized Linear Regression
Shooting Algorithm (Coordinate Descent)Due to the properties of the update, full consistency is needed
Finance Dataset from Kogan et al [2009].
Full Consistency
50
Optimal
0 2 4 6 8 10 12 14 160
2
4
6
8
10
12
14
16
Number of CPUs
Spee
dup
Bette
r
Dense
Sparse
0 2 4 6 8 10 12 14 160
2
4
6
8
10
12
14
16
Number of CPUs
Spee
dup
Relaxing Consistency
51Why does this work? (Open Question)
0 2 4 6 8 10 12 14 160
2
4
6
8
10
12
14
16
Number of CPUs
Spee
dup
Bette
r Optimal
Dense
Sparse
GraphLabAn abstraction tailored to Machine Learning Provides a parallel framework which compactly expresses
Data/computational dependenciesIterative computation
Achieves state-of-the-art parallel performance on a variety of problemsEasy to use
52
Future WorkDistributed GraphLab
Load BalancingMinimize CommunicationLatency HidingDistributed Data ConsistencyDistributed Scalability
GPU GraphLabMemory bus bottle neckWarp alignment
State-of-the-art performance for <Your Algorithm Here> .
53
Carnegie Mellon
Parallel GraphLab 1.0
Available Today
http://graphlab.ml.cmu.edu
54
Documentation… Code… Tutorials…