19
Vivek Kumar 1 , Karthik Murthy 1 , Vivek Sarkar 1 , Yili Zheng 2 1 Rice University 2 Lawrence Berkeley National Laboratory Optimized Distributed Work-Stealing

Optimized Distributed Work-Stealing - Vivek Kumar · Vivek Kumar1, Karthik Murthy1, Vivek Sarkar1, Yili Zheng2 1 Rice University 2 Lawrence Berkeley National Laboratory Optimized

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Optimized Distributed Work-Stealing - Vivek Kumar · Vivek Kumar1, Karthik Murthy1, Vivek Sarkar1, Yili Zheng2 1 Rice University 2 Lawrence Berkeley National Laboratory Optimized

Vivek Kumar1, Karthik Murthy1, Vivek Sarkar1, Yili Zheng2

1 Rice University 2 Lawrence Berkeley National Laboratory

Optimized Distributed Work-Stealing

Page 2: Optimized Distributed Work-Stealing - Vivek Kumar · Vivek Kumar1, Karthik Murthy1, Vivek Sarkar1, Yili Zheng2 1 Rice University 2 Lawrence Berkeley National Laboratory Optimized

Cores/Socket System Share in Top500

Optimized Distributed Work-Stealing | Kumar et al. | IA3 2016

1 (45%)

2 (55%)

Multicore Nodes in Supercomputers

Graph plotted using the data obtained from https://www.top500.org/statistics/list/

November 2006 June 2016

Page 3: Optimized Distributed Work-Stealing - Vivek Kumar · Vivek Kumar1, Karthik Murthy1, Vivek Sarkar1, Yili Zheng2 1 Rice University 2 Lawrence Berkeley National Laboratory Optimized

•  Productivity – Several existing APIs for scientific computing – Hard to parallelize complex irregular computations

using existing APIs •  Ideal candidate for runtime based global load-balancing

•  Performance on multicore nodes – Using a process per core (e.g., MPI everywhere)

on a node not scalable – Hybrid programming using thread pool per node

•  How to design a high performance implementation of global load-balancing

Optimized Distributed Work-Stealing | Kumar et al. | IA3 2016

Problem Statement

3

Productivity and Performance Challenge

Page 4: Optimized Distributed Work-Stealing - Vivek Kumar · Vivek Kumar1, Karthik Murthy1, Vivek Sarkar1, Yili Zheng2 1 Rice University 2 Lawrence Berkeley National Laboratory Optimized

4

✔ Library-based API in a PGAS library to express irregular computations

C++11 lambda function based API that provides serial elision

✔ Novel implementation of distributed work-stealing That introduces a new victim selection policy that avoid all

inter-node failed steals

✔ Detailed performance study That demonstrates the benefit using scaling irregular applications up to

12k cores of Edison supercomputer

✔ Results That shows that our approach delivers performance benefits up to 7%

Contributions

Optimized Distributed Work-Stealing | Kumar et al. | IA3 2016

Page 5: Optimized Distributed Work-Stealing - Vivek Kumar · Vivek Kumar1, Karthik Murthy1, Vivek Sarkar1, Yili Zheng2 1 Rice University 2 Lawrence Berkeley National Laboratory Optimized

Load Balancing using Work-Stealing

Optimized Distributed Work-Stealing | Kumar et al. | IA3 2016

Motivating Analysis

5

w1

Work-stealing in a thread pool

pop tasks

push tasks

w2 w3 w4

steal tasks deque

tail

head

Core Core Core Core

•  Thread pool (intra-node) based implementations perform stealing using low overhead CAS operations

Page 6: Optimized Distributed Work-Stealing - Vivek Kumar · Vivek Kumar1, Karthik Murthy1, Vivek Sarkar1, Yili Zheng2 1 Rice University 2 Lawrence Berkeley National Laboratory Optimized

Distributed Work-Stealing

Optimized Distributed Work-Stealing | Kumar et al. | IA3 2016

Motivating Analysis

6

Intra-node steals

Inter-node steals

•  Inter-node steals are much costlier than intra-node steals

Page 7: Optimized Distributed Work-Stealing - Vivek Kumar · Vivek Kumar1, Karthik Murthy1, Vivek Sarkar1, Yili Zheng2 1 Rice University 2 Lawrence Berkeley National Laboratory Optimized

Failed Steal Attempts

Optimized Distributed Work-Stealing | Kumar et al. | IA3 2016

Motivating Analysis

7

•  Thief fails to steal a task from victim

Intra-node failed steals

Inter-node failed steals

Inter-node failed steals are more

costly than intra-node steals

Chances to fail with

same victim multiple times

Page 8: Optimized Distributed Work-Stealing - Vivek Kumar · Vivek Kumar1, Karthik Murthy1, Vivek Sarkar1, Yili Zheng2 1 Rice University 2 Lawrence Berkeley National Laboratory Optimized

Our Approach

•  Use HabaneroUPC++ PGAS library for multicore cluster [Kumar et. al., PGAS 2014] –  Several asynchronous tasking APIs

•  Provide a programming model to express irregular computation

•  Implement a high performance distributed work-stealing runtime that completely removes all inter-node failed steal attempts

Optimized Distributed Work-Stealing | Kumar et al. | IA3 2016 8

Nod

e_A

Nod

e_X

One process with a thread pool at each node

Page 9: Optimized Distributed Work-Stealing - Vivek Kumar · Vivek Kumar1, Karthik Murthy1, Vivek Sarkar1, Yili Zheng2 1 Rice University 2 Lawrence Berkeley National Laboratory Optimized

HabaneroUPC++ Programming Model

•  C++11 lambda-function based API •  Provides serial elision and improves productivity

Optimized Distributed Work-Stealing | Kumar et al. | IA3 2016

Implementation

9

asyncAny ( [=] {

irregular_computation();

}); //distributed work-stealing

Page 10: Optimized Distributed Work-Stealing - Vivek Kumar · Vivek Kumar1, Karthik Murthy1, Vivek Sarkar1, Yili Zheng2 1 Rice University 2 Lawrence Berkeley National Laboratory Optimized

Distributed Work-Stealing Runtime

•  Two different implementations in HabaneroUPC++ •  BaselineWS

–  Uses prior work + some optimizations

•  SuccessOnlyWS –  Extends BaselineWS by using a novel victim selection

policy that complete removes all inter-node failed steals

Optimized Distributed Work-Stealing | Kumar et al. | IA3 2016

Implementation

10

Page 11: Optimized Distributed Work-Stealing - Vivek Kumar · Vivek Kumar1, Karthik Murthy1, Vivek Sarkar1, Yili Zheng2 1 Rice University 2 Lawrence Berkeley National Laboratory Optimized

Optimized Distributed Work-Stealing | Kumar et al. | IA3 2016

Step-1: Failed steal at

intra-node level

Step-2: Local worker

request the leader for inter-node steal

Step-6: Remote victim can send tasks, else it’s a inter-node failed steal

BaselineWS in HabaneroUPC++

Step-4: Lock and wait for tasks from victim

Step-3: Leader finds a victim that has sufficient number of tasks (RDMA)

11

Implementation

Step-5: Remote leader attempts to steal tasks from its local thread-pool

Page 12: Optimized Distributed Work-Stealing - Vivek Kumar · Vivek Kumar1, Karthik Murthy1, Vivek Sarkar1, Yili Zheng2 1 Rice University 2 Lawrence Berkeley National Laboratory Optimized

Optimized Distributed Work-Stealing | Kumar et al. | IA3 2016

Step-1: Failed steal at

intra-node level

Step-2: Local worker

request the leader for inter-node steal

SuccessOnlyWS in HabaneroUPC++

Step-3: Leader finds a victim that has sufficient number of tasks (RDMA)

12

Implementation

Step-5: Remote leader attempts to steal tasks from its local thread-pool

Step-6: Repeat Step-3 and Step-4

Step-4: Asynchronous task request

Step-7: One or more remote victim will send tasks. Break out of Step-5 (also if application terminates)

Page 13: Optimized Distributed Work-Stealing - Vivek Kumar · Vivek Kumar1, Karthik Murthy1, Vivek Sarkar1, Yili Zheng2 1 Rice University 2 Lawrence Berkeley National Laboratory Optimized

Methodology

•  Benchmarks –  Two UTS trees T1WL and T3WL –  NQueens

•  Computing infrastructure –  Edison supercomputer at NERSC

•  2x12 cores per node

Optimized Distributed Work-Stealing | Kumar et al. | IA3 2016

Experimental Evaluation

13

Page 14: Optimized Distributed Work-Stealing - Vivek Kumar · Vivek Kumar1, Karthik Murthy1, Vivek Sarkar1, Yili Zheng2 1 Rice University 2 Lawrence Berkeley National Laboratory Optimized

Results

Optimized Distributed Work-Stealing | Kumar et al. | IA3 2016

Experimental Evaluation

14

1

10

100

1000

10000

100000

1e+06

1536 3072 6144 12288

To

tal f

aile

d s

tea

ls (

log

sca

le)

Total cores

T1WL T3WL NQ

1

1.01

1.02

1.03

1.04

1.05

1.06

1.07

1.08

1.09

1.1

1536 3072 6144 12288

Sp

ee

du

p o

ver

Ba

selin

eW

S

Total cores

T1WL T3WL NQ

Note: More results are available in the paper

(a) Total inter-node failed steals in BaselineWS

(b) SuccessOnlyWS speedup over BaselineWS

Total cores (24 cores/node) Total cores (24 cores/node)

Higher inter-node failed steals in BaselineWS => Better performance in SuccessOnlyWS

T1WL T3WL NQueens

Page 15: Optimized Distributed Work-Stealing - Vivek Kumar · Vivek Kumar1, Karthik Murthy1, Vivek Sarkar1, Yili Zheng2 1 Rice University 2 Lawrence Berkeley National Laboratory Optimized

Summary and Conclusion •  Inter-node steals are costlier than intra-node steals •  Failed inter-node steals could hamper performance •  C++11 lambda function based API to in

HabaneroUPC++ to express complex irregular computation that can participate in distributed work-stealing

•  A novel implementation of distributed work-stealing runtime in HabaneroUPC++ PGAS library that completely removes all inter-node failed steals

•  Our novel runtime delivers performance benefits up to 7%

Optimized Distributed Work-Stealing | Kumar et al. | IA3 2016 15

Summary

Page 16: Optimized Distributed Work-Stealing - Vivek Kumar · Vivek Kumar1, Karthik Murthy1, Vivek Sarkar1, Yili Zheng2 1 Rice University 2 Lawrence Berkeley National Laboratory Optimized

Backup Slides

Optimized Distributed Work-Stealing | Kumar et al. | IA3 2016

Page 17: Optimized Distributed Work-Stealing - Vivek Kumar · Vivek Kumar1, Karthik Murthy1, Vivek Sarkar1, Yili Zheng2 1 Rice University 2 Lawrence Berkeley National Laboratory Optimized

Existing Techniques for Inter-node Stealing

Optimized Distributed Work-Stealing | Kumar et al. | IA3 2016 17

•  Thread pool based hybrid runtimes [Lifflander et. al., HPDC’12, Paudel et. al., ICPP’13]

•  Communication worker maintain ready queue of tasks even before a remote request arrives [Paudel et. al., ICPP’13]

•  Load-aware steal attempts to reduce chances of failure [Dinan et. al., ICPP’08]

•  First try random victims and on failing contact set of victims (lifelines) that promises to send tasks whenever they have it ready [Saraswat et. al., PPoPP’11]

Related Work

Page 18: Optimized Distributed Work-Stealing - Vivek Kumar · Vivek Kumar1, Karthik Murthy1, Vivek Sarkar1, Yili Zheng2 1 Rice University 2 Lawrence Berkeley National Laboratory Optimized

Inter-node Steal Request from Thief

Optimized Distributed Work-Stealing | Kumar et al. | IA3 2016

Implementation

18

procedure Steal_AsyncAny while (global termination is not detected) V = get a random remote rank if (V has declared task availability in PGAS space) // RDMA if (I did not try to steal from V) queue my rank at V if TryLock (V) is success

save my rank at V wait until V send tasks or decline

Unlock (V) break from while loop if I just received asyncAny tasks if I receive asyncAny tasks from any victim forget that I contacted this victim reset my task receiving status

BaselineWS Runtime

SuccessOnlyWS Runtime 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Page 19: Optimized Distributed Work-Stealing - Vivek Kumar · Vivek Kumar1, Karthik Murthy1, Vivek Sarkar1, Yili Zheng2 1 Rice University 2 Lawrence Berkeley National Laboratory Optimized

Inter-node Task Transfer from Victim

Optimized Distributed Work-Stealing | Kumar et al. | IA3 2016

Implementation

19

procedure Send_AsyncAny while (there are pending inter-node steal requests) T = get rank of one of the queued remote thief steal tasks from my local workers and send to T forget that T contacted me break out of the while loop if local steal failed

T = get rank of the only waiting remote thief steal tasks from local workers and send to T declare that now I don’t have any waiting remote thief publish in PGAS space asyncAny count at my place

BaselineWS Runtime

SuccessOnlyWS Runtime

1 2 3 4 5 6 7 8 9 10 11