SySCD: A System-Aware Parallel Coordinate Descent AlgorithmSySCD: A System-Aware Parallel Coordinate Descent Algorithm Nikolas Ioannou* , Celestine Mendler-Dünner* , Thomas Parnell

SySCD: A System-Aware Parallel Coordinate Descent AlgorithmNikolas Ioannou*◇, Celestine Mendler-Dünner*▽, Thomas Parnell◇

ü Identify bottlenecks in state-of-the- art parallel coordinate descent solvers

ü Algorithmic improvements to reduce runtimeü Over 10x faster training compared to optimized

system-agnostic parallel implementationsü SySCD inherits convergence guarantees from

distributed methods and improves their sample efficiency through a dynamic repartitioning scheme

Contributions

all threads work on the same shared vector 𝑣 and collisions on 𝒗 are solved opportunistically [1]

Baseline Buckets we train a bucket of consecutive training examples at a time

ü 𝛼 is accessed in a cache line efficient manner

ü Less indices need to be randomized

ü CPU prefetching is improved

ü Less randomness hurts convergence

𝑣 is replicated across threads and synchronized periodically (inspired by [2])Data Parallelism

×

ü

ü

ü

Dynamic Partitioning NUMA - awareness

ü Reduce shared vector 𝑣 access

ü Parallel shuffling of local coordinates

ü Local-only view hurts convergence

ü Increase exchange between parallel workers

ü Better convergence behavior

ü Shuffling of coordinates not local to each thread

ü Avoid overheads of shuff-ling across NUMA nodes

ü Thread, 𝛼, and 𝑣 have numa-affinity

ü Reduced shuffling hurts convergence×

ü

ü

static partitioning across NUMA nodesdynamic partitioning within NUMA nodes

repartitioning of the data across threads in each epoch

0

100

200

300

400

0

20

40

60

0 8 16 24 32

Trai

n tim

e (s

)

# ep

ochs

to c

onve

rge

#Threads

Without…With buckets

0

1

2

3

4

0

100

200

300

400

0 8 16 24 32

Trai

n tim

e (s

)

# ep

ochs

to c

onve

rge

#CoCoA partitions

epochstrain time

0

50

100

150

200

250

0

20

40

60

80

0 8 16 24 32

Trai

n tim

e (s

)

# ep

ochs

to c

onve

rge

#Threads

static partitioningdynamic partitioning

0

50

100

150

200

250

0

20

40

60

0 8 16 24 32

Trai

n tim

e (s

)

# ep

ochs

to c

onve

rge

#Threads

Without numa-optsWith numa-opts

Performance Results • 4-node Intel Xeon (E5- 4620) with 512GB RAM• 2-node IBM Power9 with 1TB RAM

sklearn [lbfgs]

sklearn [sag]

sklearn [saga]

sklearn [liblinear]

snap.ml MT

snap.ml 1T

h2o

vw

0.636

0.637

0.638

0.639

0.640

0.641

1 10 100 1000

LogL

oss (

Test

)

Time (s)

epsilon (x86_64)

ü

ü

ü

[1] PASSCoDe: Parallel Asynchronous Stochastic Dual Coordinate Descent, C. Hsieh, H. Yu, I. Dhillon. ICML (2015)[2] CoCoA: A General Framework for Communication-Efficient Distributed Optimization, V. Smith, S. Forte, C. Ma, M. Takac, M. Jordan, M. Jaggi. JMLR (2018)[3] Snap ML: A Hierarchical Framework For Machine Learning, C. Dünner, T. Parnell, D. Sarigiannis, N. Ioannou, A. Anghel, G. Ravi, M. Kandasamy, H. Pozidis. NeurIPS (2018)

Baseline

Bottlenecks1) model access pattern (𝛼)2) random shuffling of coordinates3) shared vector updates 𝑣

0.00

0.01

0.10

1.00

0 8 16 24 32

Trai

n tim

e pe

r epo

ch (s

)

#Threads

wildno shared updatesno shared updates + no shuffling

012345

𝛼 randomshuffling

𝛼 sequentialshuffling

𝛼 sequentialno shuffling

Trai

n tim

e pe

r ep

och

(s)

×

ü

ü

in parallel across threads

Convergence Analysis

multi-threaded implementationsingle-threaded implementation

50%40%

𝐾 : # NUMA nodes 𝑃 : # threads per NUMA node

𝐵 : bucket size𝑇(: number of CD updates performed on each bucket𝑇): number of buckets processed by each thread𝑇*: number of communication rounds on each NUMA node

SySCD can be analyzed as a hierarchical distributed method which locally implements a randomized block coordinate descent solver

cache line sizedesign parameters:

*

*SySCD is implemented in Snap ML [3] and results in average speedups of

5x vs. baseline18x vs. sklearn

◇IBM Research – Zurich▽UC Berkeley*equal contribution

sklearn [lbfgs]

sklearn [sag]

sklearn [saga]

sklearn [liblinear]snap.ml MT

snap.ml 1T

vw

0.44

0.45

0.46

0.47

0.48

10 100 1000 10000

LogL

oss (

Test

)

Time (s)

criteo-kaggle (x86_64)

1

10

100

0 8 16 24 32 40

Trai

ning

tim

e (s

)

#Threads

higgs

wild - P9wild - x86_64SySCD - P9SySCD - x86_64

Documents

SySCD: A System-Aware Parallel Coordinate Descent AlgorithmSySCD: A System-Aware Parallel Coordinate Descent Algorithm Nikolas Ioannou* , Celestine Mendler-Dünner* , Thomas Parnell