49
1 Alex Aiken Stanford Legion: Programming Heterogeneous, Distributed Parallel Machines Joint work involving LANL, NVIDIA & Stanford

Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries

1

Alex Aiken Stanford

Legion: Programming Heterogeneous, Distributed Parallel Machines

Joint work involving LANL, NVIDIA & Stanford

Page 2: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries

2

Modern Supercomputers

 Heterogeneity   Processor kinds   Relative performance

 Distributed Memory   Non-uniform   Distinct from processors

 Growing disparities   FLOPS >> bandwidth   bandwidth >> latency

Page 3: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries

3

Programming System Goals

High Performance We must be fast

Performance Portability

Across many kinds of machines and over many generations

Programmability

Sequential semantics, parallel execution

Page 4: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries

4

Can We Fulfill These Goals Today? Yes … at great cost:

Task graph for one time step on one node… … of a mini-app

Do you want to schedule that graph? (High Performance)

Do you want to re-schedule that graph for every new

machine? (Performance Portability)

Do you want to be responsible for generating that graph?

(Programmability)

Today: programmer’s responsibility

Tomorrow: programming system’s responsibility

Page 5: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries

5

The Crux

 How do we describe the data?   Most programming systems focus on control   Minimal facilities for organization/structure of data

 Why?

 Answer: Solve the aliasing problem   Can two references refer to the same data?

 Answer: Decouple naming data from layout/location

Page 6: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries

6

Legion Approach

 Capture the structure of program data

 Decouple specification from mapping

 Asynchronous tasking

 Automate   data movement   parallelism discovery   synchronization   hiding long latency

Page 7: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries

7

Page 8: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries

8

Example: Circuit Simulation

Page 9: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries

9

Example: Circuit Simulation

task simulate_circuit(Region[Node] N, Region[Wires] W)

{ }

Tasks are the unit of parallel execution.

Logical regions are (typed) collections Logical: no implied layout no implied location

Page 10: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries

10

Partitioning

Page 11: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries

11

Partitioning task  simulate_circuit(Region[Node]  N,  Region[Wires]  W)    

   

 {                            [  P,  S  ]  =  par??on(ps_map,  N)                          

   }    

SP

N

Page 12: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries

12

Partitioning task  simulate_circuit(Region[Node]  N,  Region[Wires]  W)    

   

 {              Array[Region[Node]]  private,  shared,  ghost;              …              [  P,  S  ]  =  par??on(ps_map,  N)              private  =  par??on(private_map,  P)              shared  =  par??on(shared_map,  S)                ghost  =  par??on(ghost_map,  S)                        

}    

N

s1 s3 …

SP

g1 g3 … p1 p3 …

Page 13: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries

13

Region Trees

SP

p1 p3 … s1 s3 … g1 g3 …

N

Page 14: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries

14

… g1 g3 … s1 s3 … p3

Region Trees

N

Locality

S

p1

disjoint

Independence

P

N

Page 15: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries

15

Region Trees

N

Locality

N

SP

p1 p3 … s1 s3 … g1 g3 …

disjoint

Independence

Page 16: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries

16

Region Trees

N

Locality

N

SP

p1 p3 … s1 s3 … g1 g3 …

Independence Aliasing

Page 17: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries

17

Region Trees

N

Locality

N

SP

p1 p3 … s1 s3 … g1 g3 …

Independence Aliasing

possibly overlap

Page 18: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries

18

Legion Tasks

task  calc_currents(Piece  p)  :          

 task  distribute_charge(Piece  p)  :                                                                                    

subtasks

task  simulate_circuit(Region[Node]  N,  Region[Wires]  W)  :          

{            …            calc_currents(piece[0]);            calc_currents(piece[1]);            distribute_charge(piece[0]);            distribute_charge(piece[1]);            …  }  

Page 19: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries

19

Legion Tasks task  simulate_circuit(Region[Node]  N,  Region[Wires]  W)  :      

   {            …            calc_currents(piece[0],                  ,                ,                );            calc_currents(piece[1],                  ,                ,                );            distribute_charge(piece[0],                ,                  ,                );            distribute_charge(piece[1],                ,                  ,                );            …  }  

task  calc_currents(Piece  p)  :          

 task  distribute_charge(Piece  p)  :                                                                                    

N, W

p.private, p.shared, p.ghost, p.wires

p.private, p.shared, p.ghost, p.wires

A task must declare the regions it will use.

Subtask containment: A subtask can only use (sub)regions accessible to its parent task.

Tasks appear to execute in program order.

p0  

p1   s1  

s0   g0  

g1  

p0   s0   g0  

p1   s1   g1  

Page 20: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries

20

Interference

If T1#T2 then T1 and T2 can execute in parallel.

Informal definition: Two tasks T1 and T2 are non-interfering T1#T2 if there is no dependence between them on their region arguments.

Page 21: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries

21

task  simulate_circuit(Region[Node]  N,  Region[Wires]  W)  :      {            …            calc_currents(piece[0],                  ,                ,                );            calc_currents(piece[1],                  ,                ,                );            distribute_charge(piece[0],                ,                  ,                );            distribute_charge(piece[1],                ,                  ,                );            …  }  

Execution Model

Interferes?

Tasks are issued in program order.

Interferes? Interferes?

p0  

p1   s1  

s0   g0  

g1  

p0   s0   g0  

p1   s1   g1  

Page 22: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries

22

task  simulate_circuit(Region[Node]  N,  Region[Wires]  W)  :          

{            …            calc_currents(piece[0],                  ,                ,                );            calc_currents(piece[1],                  ,                ,                );            distribute_charge(piece[0],                ,                  ,                );            distribute_charge(piece[1],                ,                  ,                );            …  }  

Privileges

N, W

p.private, p.shared, p.ghost, p.wires

p.private, p.shared, p.ghost, p.wires

ReadWrite(p.wires), Read(p.private, p.shared, p.ghost)

ReadOnly(p.wires), Reduce(p.private, p.shared, p.ghost)

ReadWrite(N,W)

task  calc_currents(Piece  p)  :          

 task  distribute_charge(Piece  p)  :                                                                                    

p0  

p1   s1  

s0   g0  

g1  

p0   s0   g0  

p1   s1   g1  

Page 23: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries

23

Non-Interference Dimensions

 Several dimensions of # operator

  Entries (rows)   Privileges   Fields (columns)

 Logical regions are a relational data model

  Partitioning is selection (σ)   Field-slicing is projection (π)   Don’t support all relational operators

Node

Node

Node

Node

Node

Node

Node

Node

Node

Node

Voltage Capac. Induct. Charge

Page 24: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries

24

Legion Summary  Logical regions: a relational data model

  Support partitioning and slicing   Convey locality, independence, aliasing

  Implicit task parallelism   Task may have arbitrary sub-tasks   Tasks declare region usage including privileges and fields

 Tasks appear to execute in program order   Execute in parallel when non-interference established

 Machine independent specification of application

Page 25: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries

25

Page 26: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries

26

Legion Runtime System

t0:: r5, r7 t1: r0, r2 t2: r1, r2 t3: r4, r6

Legion Runtime Parallel

Distributed Execution

Tasks : Regions :: Instructions : Registers

Dep. Analysis Distribute Map Execute Resolve

Spec. Complete Commit

A Distributed Hierarchical Out-of-Order Task Processor

T

T T T

T T T T

Parent Task

Page 27: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries

27

Dependence Analysis

CC[0] CC[1]

task  simulate_circuit(Region[Node]  N,  Region[Wires]  W)  :        ReadWrite(N,W)  

{            …            calc_currents(piece[0],                                  );            calc_currents(piece[1]                                    );            distribute_charge(piece[0]                                  );            distribute_charge(piece[1]                                  );            …  }            task  calc_currents(Piece  p)  :    ReadWrite(p.wires),    Read(p.private,  p.shared,  p.ghost)  

   task  distribute_charge(Piece  p)  :    ReadOnly(p.wires),  Reduce(p.private,  p.shared,  p.ghost)  

N

SP

p0 p1 s0 s1 g0 g1

disjoint

read only

Page 28: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries

28

Dependence Analysis

DC[0]

task  simulate_circuit(Region[Node]  N,  Region[Wires]  W)  :        ReadWrite(N,W)  

{            …            calc_currents(piece[0],                                  );            calc_currents(piece[1],                                  );            distribute_charge(piece[0],                                  );            distribute_charge(piece[1],                                  );            …  }            task  calc_currents(Piece  p)  :    ReadWrite(p.wires),    Read(p.private,  p.shared,  p.ghost)  

   task  distribute_charge(Piece  p)  :    ReadOnly(p.wires),  Reduce(p.private,  p.shared,  p.ghost)  

N

SP

p0 p1 s0 s1 g0 g1

WAR w/CC[1]

CC[1]

CC[0]

read-after-write of wires w/CC[0]

Page 29: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries

29

Mapping Interface  Programmer selects:

  Where tasks run   Where regions are placed

 Mapping computed dynamically

 Decouple correctness from performance

29  

t1

t2

t3

t4t5

rc

rw

rw1 rw2

rn

rn1 rn2

$

$

$

$

N U M A

N U M A

FB

D R A M

x86

CUDA

x86

x86

x86 Dep.

Analysis Distribute Map Execute Resolve Spec. Complete Commit

Page 30: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries

30

Correctness Independent of Mapping N

SP

p0 p1 s0 s1 g0 g1

Node 1 Node 0

CC[1]

CC[0]

DC[0]

copy

task  simulate_circuit(Region[Node]  N,  Region[Wires]  W)  :        ReadWrite(N,W)  

{            …            calc_currents(piece[0],                                  );            calc_currents(piece[1],                                  );            distribute_charge(piece[0],                                  );            distribute_charge(piece[1],                                  );            …  }            task  calc_currents(Piece  p)  :    ReadWrite(p.wires),    Read(p.private,  p.shared,  p.ghost)  

   task  distribute_charge(Piece  p)  :    ReadOnly(p.wires),  Reduce(p.private,  p.shared,  p.ghost)  

Page 31: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries

31

Distribution

Dep. Analysis Distribute Map Execute Resolve

Spec. Complete Commit

Node 0 Node 1

T1 T2   After tasks are mapped they are distributed to target node

  Task execution can generate sub-tasks

  Do we need inter-node dependence checks?

Subtask containment: A subtask can only use (sub)regions accessible to its parent task.

t1 t2

Page 32: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries

32

Independence Theorem

Let t1 be a subtask of T1 and t2 be a subtask of T2. Then

T1#T2 => t1#t2

T1 T2

t1 t2

#

#

Page 33: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries

33

Independence Theorem

Let t1 be a subtask of T1 and t2 be a subtask of T2. Then

T1#T2 => t1#t2

Proof: Use subtask containment.

Note: Similar property holds in functional languages, but it holds in Legion even though we may imperatively mutate regions.

Observation: It is sufficient to test interference only of sibling tasks.

Page 34: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries

34

Runtime Summary

 A distributed hierarchical out-of-order task processor   Analogous to hardware processors

 Can exploit parallelism implicitly:   Task-, data-, and nested-parallelism

 Runtime builds task graph ahead of execution to hide latency and costs of dynamic analysis

 Decouples mapping decisions from correctness   Enables efficient porting and (auto) tuning

Page 35: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries

35

Page 36: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries

36

S3D

 Production combustion simulation

 Written in ~200K lines of Fortran

 Direct numerical simulation using explicit methods

Page 37: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries

37

S3D Versions  Supports many chemical mechanism

  DME (30 species) Heptane (52 species)

 Fortran + MPI Vectorizes well   MPI used for multi-core

  “Hybrid” OpenACC   Recent work by Cray/Nvidia/DoE

 Legion interoperates with MPI

25

Detailed chemical kinetics are expensive

component in the simulation of chemically reacting flows. It isimportant because the fidelity of all subsequent steps of mecha-nism reduction depends on the fidelity of the detailed mechanism.In other words, the comprehensiveness of a reduced mechanismcannot exceed that of the detailed mechanism from which it isdeduced. This is a challenging task because, firstly, it is difficult tobe certain that all possible important species and reactions areidentified and included in the detailed mechanism. Furthermore,the number of reactions and species involved is large, and thedetermination of the rate constants of each of the identified reac-tions, either experimentally or computationally, is not a trivial task.

Lacking a systematic, first-principle procedure to identify allrelevant species and reactions that would render a mechanismcomprehensive, comprehensiveness can be considered based on theability of the mechanism to describe combustion phenomena asextensively as possible. There are two levels of considerations. First,since the nature of the collision dynamics is determined by theidentity of the colliding molecules as well as the frequency andenergetics of the collision, a comprehensive chemical description interms of themacroscopic thermodynamic properties would requireextensive coverage in the range of temperature, pressure, andcomposition of the reacting mixture. Second, in terms of combus-tion phenomena, comprehensiveness would require considerationsof homogeneous and diffusive ignition which cover low-, interme-diate- and high-temperature chemistry, steady burning andextinction which cover high-temperature chemistry, and premixedand nonpremixed flames which cover the relative concentrationsand mixedness of fuel and oxidizer. The global combustionresponses of interest would include the laminar flame speed, igni-tion and extinction strain rates, detonation induction length,detailed thermal and concentration structures of flames and deto-nations, oscillatory and pulsed unsteady effects to potentiallydiscriminate reactions of different time scales, and pollutantchemistry.

A final requirement for comprehensiveness is fuel hierarchy. Forexample, since hydrogen and CO oxidation constitute a part ofmethane oxidation, a methane mechanism must degenerate tothose for hydrogen and CO when all elementary reactions notrelated to them are stripped away. Thus amechanism developed fora fuel must contain descriptions of its intermediates and simplerfuels as its sub-mechanisms.

It is clear that since the size of a mechanism depends on theextent of comprehensiveness, some reduction can be achieved forrestricted comprehensiveness. Perhaps the most obvious restriction

is to fix the pressure to atmospheric because many fundamentaland practical combustion phenomena and processes take placeunder atmospheric pressure. Other restrictions can also beimposed, such as lean combustion, high-temperature flameswithout considering the possible presence of ignition described bylow-temperature chemistry, and homogeneous charge combustion.However, except for well-controlled laboratory-scale experiments,the combustion mode is frequently a mixed one in most complexand practical combustion situations, involving for example bothpremixed and nonpremixed reactants, or both ignition and flames.Consequently it is more conservative to apply unrestrictedcomprehensive mechanisms in simulations of complex flows.

4. Overview of mechanism reduction andfacilitated computation

The availability of a comprehensive detailed reaction mecha-nism does not mean that it can be readily adopted for computa-tional simulation. In fact, except for the smallest of fuels such ashydrogen andmethane, and for such simple combustion systems asthe 1-D laminar flame, detailed mechanisms of the larger fuels aresimply too large for simulation without substantial reduction.Fig. 10 shows the size of more than 20 detailed and moderatelyreduced skeletal mechanisms for hydrocarbon fuels of variousmolecular complexities compiled over the last two decades [15].Several interesting observations can be made here. First, thenumber of species, K, and reactions, I, increase with the size of themolecule, roughly in an exponential trend. Specifically, it is seenthat while typical mechanisms for C1 and C2 species consist of lessthan about a hundred species, those for realistic engine fuels consistof hundreds of species and thousands of reactions. Mechanisms ofsuch sizes are even difficult to apply in 1-D flame simulations. As anextreme example, the size of the compiled detailed mechanism formethyl decanoate [16], a biomass fuel surrogate, consists of 3036species and 8555 reactions. Computation using this mechanism istime consuming even for 0-D simulations.

The second observation from Fig. 10 is that the size of themechanisms tends to grow with time, as new discoveries inchemical kinetics are continuously being made. Furthermore, theemergence of computer-aided automatic mechanism generation[17–20] and computer software for rate parameter evaluation, such

10-5

10-4

10-3

10-2

10-1

0.5 0.6 0.7 0.8 0.9 1.0

Methane/Air, φ=1.0, p=1 atm

GRI-Mech 1.212-Step10-Step4-Step

Auto

-Igni

tion

Del

ay (s

ec)

1000/T (1/K)

Fig. 9. Comparison of predicted ignition delay times of atmospheric, stoichiometricmethane–air mixtures using various reduced mechanisms and the detailed mecha-nism, showing the inadequacy of the four-step class of mechanisms.

101 102 103 104

102

103

104

before 20002000 to 2005after 2005

iso-octane (LLNL)

iso-octane (ENSIC-CNRS)

n-butane (LLNL)

CH4 (Konnov)

neo-pentane (LLNL)

C2H4 (San Diego)

CH4 (Leeds)

MethylDecanoate(LLNL)

C16 (LLNL)

C14 (LLNL)C12 (LLNL)

C10 (LLNL)

USC C1-C4USC C2H4

PRF

n-heptane (LLNL)

skeletal iso-octane (Lu & Law)skeletal n-heptane (Lu & Law)

1,3-ButadieneDME (Curran)C1-C3 (Qin et al)

GRI3.0

Num

ber o

f rea

ctio

ns, I

Number of species, K

GRI1.2

I = 5K

Fig. 10. Size of selected detailed and skeletal mechanisms for hydrocarbon fuels,together with the approximate years when the mechanisms were compiled.

T.F. Lu, C.K. Law / Progress in Energy and Combustion Science 35 (2009) 192–215196

From Lu and Law, PECS, 2009

•  Chemical source term evaluation is computationally intensive

•  Thousands of elementary reaction steps accumulated to global species reaction rates

•  Often the target for model reductions or algorithmic improvements

•  How fast can we compute detailed chemical kinetics on accelerators?

5

Planning the science simulation

•  Recent 3D simulation on Jaguar was used to extrapolate and plan a target Titan simulation

•  Planned simulation will have more grid points and/or larger chemistry

•  Will need a month on 12,000 hybrid nodes of Titan

Figure 5: Computational domain and grid to be used for simulations of the CRF HCCI engine.

Figure 6: Reaction and diffusion structures for OH radical during the third stage thermal explosion of a high-pressureDME fueled autoignition process.Recent 3D DNS of auto-ignition with 30-species

DME chemistry (Bansal et al. 2011)

Page 38: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries

38

Parallelism in S3D

 Data is large 3D cartesian grid of cells

 Typical per-node subgrid is 483 or 643 cells   Nearly all kernels are per-cell   Embarrassingly data parallel

 Hundreds of tasks   Significant task-level parallelism

 Except…   Computational intensity is low   Large working sets per cell (1000s of temporaries)   Performance limiter is data, not compute

Page 39: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries

39

S3D Tasks in Legion

rhs  =  rhsf(q)  

Page 40: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries

40

S3D Task Parallelism

 One call to Right-Hand-Side-Function (RHSF) as seen by the Legion runtime

  Called 6 times per time step by Runge-Kutta solver   Width == task parallelism   H2 mechanism (only 9 species) Heptane (52 species) is significantly wider

 Manual task scheduling would be difficult!

Page 41: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries

41

Mapping for Heptane 483

4 AMD Interlagos

Integer cores for Legion Runtime

8 AMD Interlagos FP

cores for application

NVIDIA Kepler K20

Dynamic Analysis for (rhsf+2) Clean-up/meta tasks

Page 42: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries

42

Heptane Mapping for 963  Handle larger problem sizes per node

  Higher computation-to-communication ratios   More power efficient

 Not enough room in 6 GB GPU framebuffer OpenACC requires code changes

 Legion analysis is independent of problem size   Larger tasks -> fewer runtime cores

Page 43: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries

43

Page 44: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries

44

Legion S3D DME Performance

 1.71X - 2.33X faster between 1024 and 8192 nodes  Larger problem sizes have higher efficiency

1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192Nodes

0

50000

100000

150000

200000

250000

Thro

ughp

utPe

rNod

e(P

oint

s/s)

Titan Legion 483

Titan Legion 643

Titan Legion 963

Keeneland Legion 483

Keeneland Legion 643

Keeneland Legion 963

Titan OpenACC 483

Titan OpenACC 643

Page 45: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries

45

Legion Heptane Performance  1.73X - 2.85X faster between 1024 and 8192 nodes  Higher throughput on Keeneland (balanced CPU+GPUs)

1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192Nodes

0

50000

100000

150000

200000

Thro

ughp

utPe

rNod

e(P

oint

s/s)

Titan Legion 483

Titan Legion 643

Titan Legion 963

Keeneland Legion 483

Keeneland Legion 643

Keeneland Legion 963

Titan OpenACC 483

Titan OpenACC 643

Page 46: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries

46 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192

Nodes

0

10000

20000

30000

40000

50000

60000

70000

80000

Thro

ughp

utPe

rNod

e(P

oint

s/s)

Mixed CPU-GPU 483

Mixed CPU-GPU 643

All-GPU 323

All-GPU 483

Legion PRF Performance  116 species mechanism, >2X as large as heptane  Legion uses different mapping approach

Page 47: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries

47

Page 48: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries

48

Liszt Legion

local liszt InitLength(e : dragon.edges) var delta = e.head.pos - e.tail.pos e.rest_len = L.len(delta) end

dragon.edges:foreach(InitLength)

Phase Analysis Legion Permissions

edges.head : LEGION_READ edges.tail : LEGION_READ edges.rest_len : LEGION_READ_WRITE

CF AF IL CF AF

ME ME

Legion Task Graph

dragon.edges:map(InitLength) for i = 1,300 do dragon.vertices:foreach(ComputeForces) dragon.vertices:foreach(ApplyForces) dragon.vertices:foreach(MeasureEnergy) end

Seq. Bulk Data-Parallel

Page 49: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries

49

Legion

 Legion website: http://legion.stanford.edu

Github repo: http://github.com/stanfordlegion

 Questions?