Upload
others
View
14
Download
0
Embed Size (px)
Citation preview
SySCD: A System-Aware Parallel Coordinate Descent AlgorithmNikolas Ioannou*◇, Celestine Mendler-Dünner*▽, Thomas Parnell◇
ü Identify bottlenecks in state-of-the- art parallel coordinate descent solvers
ü Algorithmic improvements to reduce runtimeü Over 10x faster training compared to optimized
system-agnostic parallel implementationsü SySCD inherits convergence guarantees from
distributed methods and improves their sample efficiency through a dynamic repartitioning scheme
Contributions
all threads work on the same shared vector 𝑣 and collisions on 𝒗 are solved opportunistically [1]
Baseline Buckets we train a bucket of consecutive training examples at a time
ü 𝛼 is accessed in a cache line efficient manner
ü Less indices need to be randomized
ü CPU prefetching is improved
ü Less randomness hurts convergence
𝑣 is replicated across threads and synchronized periodically (inspired by [2])Data Parallelism
×
ü
ü
ü
Dynamic Partitioning NUMA - awareness
ü Reduce shared vector 𝑣 access
ü Parallel shuffling of local coordinates
ü Local-only view hurts convergence
ü Increase exchange between parallel workers
ü Better convergence behavior
ü Shuffling of coordinates not local to each thread
ü Avoid overheads of shuff-ling across NUMA nodes
ü Thread, 𝛼, and 𝑣 have numa-affinity
ü Reduced shuffling hurts convergence×
ü
ü
static partitioning across NUMA nodesdynamic partitioning within NUMA nodes
repartitioning of the data across threads in each epoch
0
100
200
300
400
0
20
40
60
0 8 16 24 32
Trai
n tim
e (s
)
# ep
ochs
to c
onve
rge
#Threads
Without…With buckets
0
1
2
3
4
0
100
200
300
400
0 8 16 24 32
Trai
n tim
e (s
)
# ep
ochs
to c
onve
rge
#CoCoA partitions
epochstrain time
0
50
100
150
200
250
0
20
40
60
80
0 8 16 24 32
Trai
n tim
e (s
)
# ep
ochs
to c
onve
rge
#Threads
static partitioningdynamic partitioning
0
50
100
150
200
250
0
20
40
60
0 8 16 24 32
Trai
n tim
e (s
)
# ep
ochs
to c
onve
rge
#Threads
Without numa-optsWith numa-opts
Performance Results • 4-node Intel Xeon (E5- 4620) with 512GB RAM• 2-node IBM Power9 with 1TB RAM
sklearn [lbfgs]
sklearn [sag]
sklearn [saga]
sklearn [liblinear]
snap.ml MT
snap.ml 1T
h2o
vw
0.636
0.637
0.638
0.639
0.640
0.641
1 10 100 1000
LogL
oss (
Test
)
Time (s)
epsilon (x86_64)
ü
ü
ü
[1] PASSCoDe: Parallel Asynchronous Stochastic Dual Coordinate Descent, C. Hsieh, H. Yu, I. Dhillon. ICML (2015)[2] CoCoA: A General Framework for Communication-Efficient Distributed Optimization, V. Smith, S. Forte, C. Ma, M. Takac, M. Jordan, M. Jaggi. JMLR (2018)[3] Snap ML: A Hierarchical Framework For Machine Learning, C. Dünner, T. Parnell, D. Sarigiannis, N. Ioannou, A. Anghel, G. Ravi, M. Kandasamy, H. Pozidis. NeurIPS (2018)
Baseline
Bottlenecks1) model access pattern (𝛼)2) random shuffling of coordinates3) shared vector updates 𝑣
0.00
0.01
0.10
1.00
0 8 16 24 32
Trai
n tim
e pe
r epo
ch (s
)
#Threads
wildno shared updatesno shared updates + no shuffling
012345
𝛼 randomshuffling
𝛼 sequentialshuffling
𝛼 sequentialno shuffling
Trai
n tim
e pe
r ep
och
(s)
×
ü
ü
in parallel across threads
Convergence Analysis
multi-threaded implementationsingle-threaded implementation
50%40%
𝐾 : # NUMA nodes 𝑃 : # threads per NUMA node
𝐵 : bucket size𝑇(: number of CD updates performed on each bucket𝑇): number of buckets processed by each thread𝑇*: number of communication rounds on each NUMA node
SySCD can be analyzed as a hierarchical distributed method which locally implements a randomized block coordinate descent solver
cache line sizedesign parameters:
*
*SySCD is implemented in Snap ML [3] and results in average speedups of
5x vs. baseline18x vs. sklearn
◇IBM Research – Zurich▽UC Berkeley*equal contribution
sklearn [lbfgs]
sklearn [sag]
sklearn [saga]
sklearn [liblinear]snap.ml MT
snap.ml 1T
vw
0.44
0.45
0.46
0.47
0.48
10 100 1000 10000
LogL
oss (
Test
)
Time (s)
criteo-kaggle (x86_64)
1
10
100
0 8 16 24 32 40
Trai
ning
tim
e (s
)
#Threads
higgs
wild - P9wild - x86_64SySCD - P9SySCD - x86_64