1
SySCD : A System - Aware Parallel Coordinate Descent Algorithm Nikolas Ioannou* , Celestine Mendler - D ünner * , Thomas Parnell ü Identify bottlenecks in state-of-the- art parallel coordinate descent solvers ü Algorithmic improvements to reduce runtime ü Over 10x faster training compared to optimized system-agnostic parallel implementations ü SySCD inherits convergence guarantees from distributed methods and improves their sample efficiency through a dynamic repartitioning scheme Contributions all threads work on the same shared vector and collisions on are solved opportunistically [1] Baseline Buckets we train a bucket of consecutive training examples at a time ü is accessed in a cache line efficient manner Less indices need to be randomized CPU prefetching is improved Less randomness hurts convergence is replicated across threads and synchronized periodically (inspired by [2]) Data Parallelism × ü ü ü Dynamic Partitioning NUMA - awareness Reduce shared vector access Parallel shuffling of local coordinates Local-only view hurts convergence Increase exchange between parallel workers Better convergence behavior Shuffling of coordinates not local to each thread Avoid overheads of shuff- ling across NUMA nodes Thread, , and have numa-affinity Reduced shuffling hurts convergence × ü ü static partitioning across NUMA nodes dynamic partitioning within NUMA nodes repartitioning of the data across threads in each epoch 0 100 200 300 400 0 20 40 60 0 8 16 24 32 Train time (s) # epochs to converge #Threads Without… With buckets 0 1 2 3 4 0 100 200 300 400 0 8 16 24 32 Train time (s) # epochs to converge #CoCoA partitions epochs train time 0 50 100 150 200 250 0 20 40 60 80 0 8 16 24 32 Train time (s) # epochs to converge #Threads static partitioning dynamic partitioning 0 50 100 150 200 250 0 20 40 60 0 8 16 24 32 Train time (s) # epochs to converge #Threads Without numa-opts With numa-opts Performance Results 4-node Intel Xeon (E5- 4620) with 512GB RAM 2-node IBM Power9 with 1TB RAM sklearn [lbfgs] sklearn [sag] sklearn [saga] sklearn [liblinear] snap.ml MT snap.ml 1T h2o vw 0.636 0.637 0.638 0.639 0.640 0.641 1 10 100 1000 LogLoss (Test) Time (s) epsilon (x86_64) ü ü ü [1] PASSCoDe: Parallel Asynchronous Stochastic Dual Coordinate Descent, C. Hsieh, H. Yu, I. Dhillon. ICML (2015) [2] CoCoA: A General Framework for Communication-Efficient Distributed Optimization, V. Smith, S. Forte, C. Ma, M. Takac, M. Jordan, M. Jaggi . JMLR (2018) [3] Snap ML: A Hierarchical Framework For Machine Learning, C. Dünner, T. Parnell, D. Sarigiannis, N. Ioannou, A. Anghel, G. Ravi, M. Kandasamy, H. Pozidis. NeurIPS (2018) Baseline Bottlenecks 1) model access pattern ( ) 2) random shuffling of coordinates 3) shared vector updates 0.00 0.01 0.10 1.00 0 8 16 24 32 Train time per epoch (s) #Threads wild no shared updates no shared updates + no shuffling 0 1 2 3 4 5 random shuffling sequential shuffling sequential no shuffling Train time per epoch (s) × ü ü in parallel across threads Convergence Analysis multi-threaded implementation single-threaded implementation 50% 40% : # NUMA nodes : # threads per NUMA node : bucket size ( : number of CD updates performed on each bucket ) : number of buckets processed by each thread * : number of communication rounds on each NUMA node SySCD can be analyzed as a hierarchical distributed method which locally implements a randomized block coordinate descent solver cache line size design parameters: * * SySCD is implemented in Snap ML [3] and results in average speedups of 5x vs. baseline 18x vs. sklearn IBM Research – Zurich UC Berkeley *equal contribution sklearn [lbfgs] sklearn [sag] sklearn [saga] sklearn [liblinear] snap.ml MT snap.ml 1T vw 0.44 0.45 0.46 0.47 0.48 10 100 1000 10000 LogLoss (Test) Time (s) criteo-kaggle (x86_64) 1 10 100 0 8 16 24 32 40 Training time (s) #Threads higgs wild - P9 wild - x86_64 SySCD - P9 SySCD - x86_64

SySCD: A System-Aware Parallel Coordinate Descent AlgorithmSySCD: A System-Aware Parallel Coordinate Descent Algorithm Nikolas Ioannou* , Celestine Mendler-Dünner* , Thomas Parnell

  • Upload
    others

  • View
    14

  • Download
    0

Embed Size (px)

Citation preview

Page 1: SySCD: A System-Aware Parallel Coordinate Descent AlgorithmSySCD: A System-Aware Parallel Coordinate Descent Algorithm Nikolas Ioannou* , Celestine Mendler-Dünner* , Thomas Parnell

SySCD: A System-Aware Parallel Coordinate Descent AlgorithmNikolas Ioannou*◇, Celestine Mendler-Dünner*▽, Thomas Parnell◇

ü Identify bottlenecks in state-of-the- art parallel coordinate descent solvers

ü Algorithmic improvements to reduce runtimeü Over 10x faster training compared to optimized

system-agnostic parallel implementationsü SySCD inherits convergence guarantees from

distributed methods and improves their sample efficiency through a dynamic repartitioning scheme

Contributions

all threads work on the same shared vector 𝑣 and collisions on 𝒗 are solved opportunistically [1]

Baseline Buckets we train a bucket of consecutive training examples at a time

ü 𝛼 is accessed in a cache line efficient manner

ü Less indices need to be randomized

ü CPU prefetching is improved

ü Less randomness hurts convergence

𝑣 is replicated across threads and synchronized periodically (inspired by [2])Data Parallelism

×

ü

ü

ü

Dynamic Partitioning NUMA - awareness

ü Reduce shared vector 𝑣 access

ü Parallel shuffling of local coordinates

ü Local-only view hurts convergence

ü Increase exchange between parallel workers

ü Better convergence behavior

ü Shuffling of coordinates not local to each thread

ü Avoid overheads of shuff-ling across NUMA nodes

ü Thread, 𝛼, and 𝑣 have numa-affinity

ü Reduced shuffling hurts convergence×

ü

ü

static partitioning across NUMA nodesdynamic partitioning within NUMA nodes

repartitioning of the data across threads in each epoch

0

100

200

300

400

0

20

40

60

0 8 16 24 32

Trai

n tim

e (s

)

# ep

ochs

to c

onve

rge

#Threads

Without…With buckets

0

1

2

3

4

0

100

200

300

400

0 8 16 24 32

Trai

n tim

e (s

)

# ep

ochs

to c

onve

rge

#CoCoA partitions

epochstrain time

0

50

100

150

200

250

0

20

40

60

80

0 8 16 24 32

Trai

n tim

e (s

)

# ep

ochs

to c

onve

rge

#Threads

static partitioningdynamic partitioning

0

50

100

150

200

250

0

20

40

60

0 8 16 24 32

Trai

n tim

e (s

)

# ep

ochs

to c

onve

rge

#Threads

Without numa-optsWith numa-opts

Performance Results • 4-node Intel Xeon (E5- 4620) with 512GB RAM• 2-node IBM Power9 with 1TB RAM

sklearn [lbfgs]

sklearn [sag]

sklearn [saga]

sklearn [liblinear]

snap.ml MT

snap.ml 1T

h2o

vw

0.636

0.637

0.638

0.639

0.640

0.641

1 10 100 1000

LogL

oss (

Test

)

Time (s)

epsilon (x86_64)

ü

ü

ü

[1] PASSCoDe: Parallel Asynchronous Stochastic Dual Coordinate Descent, C. Hsieh, H. Yu, I. Dhillon. ICML (2015)[2] CoCoA: A General Framework for Communication-Efficient Distributed Optimization, V. Smith, S. Forte, C. Ma, M. Takac, M. Jordan, M. Jaggi. JMLR (2018)[3] Snap ML: A Hierarchical Framework For Machine Learning, C. Dünner, T. Parnell, D. Sarigiannis, N. Ioannou, A. Anghel, G. Ravi, M. Kandasamy, H. Pozidis. NeurIPS (2018)

Baseline

Bottlenecks1) model access pattern (𝛼)2) random shuffling of coordinates3) shared vector updates 𝑣

0.00

0.01

0.10

1.00

0 8 16 24 32

Trai

n tim

e pe

r epo

ch (s

)

#Threads

wildno shared updatesno shared updates + no shuffling

012345

𝛼 randomshuffling

𝛼 sequentialshuffling

𝛼 sequentialno shuffling

Trai

n tim

e pe

r ep

och

(s)

×

ü

ü

in parallel across threads

Convergence Analysis

multi-threaded implementationsingle-threaded implementation

50%40%

𝐾 : # NUMA nodes 𝑃 : # threads per NUMA node

𝐵 : bucket size𝑇(: number of CD updates performed on each bucket𝑇): number of buckets processed by each thread𝑇*: number of communication rounds on each NUMA node

SySCD can be analyzed as a hierarchical distributed method which locally implements a randomized block coordinate descent solver

cache line sizedesign parameters:

*

*SySCD is implemented in Snap ML [3] and results in average speedups of

5x vs. baseline18x vs. sklearn

◇IBM Research – Zurich▽UC Berkeley*equal contribution

sklearn [lbfgs]

sklearn [sag]

sklearn [saga]

sklearn [liblinear]snap.ml MT

snap.ml 1T

vw

0.44

0.45

0.46

0.47

0.48

10 100 1000 10000

LogL

oss (

Test

)

Time (s)

criteo-kaggle (x86_64)

1

10

100

0 8 16 24 32 40

Trai

ning

tim

e (s

)

#Threads

higgs

wild - P9wild - x86_64SySCD - P9SySCD - x86_64