17
ET E.T. International, Inc. X-Stack: Programming Challenges, Runtime Systems, and Tools Brandywine Team May2013 DynAX Innovations in Programming Models, Compilers and Runtime Systems for Dynamic Adaptive Event-Driven Execution Models

ET E.T. International, Inc. X-Stack: Programming Challenges, Runtime Systems, and Tools Brandywine Team May2013

Embed Size (px)

Citation preview

Page 1: ET E.T. International, Inc. X-Stack: Programming Challenges, Runtime Systems, and Tools Brandywine Team May2013

ET

E.T. International, Inc.

X-Stack: Programming Challenges, Runtime Systems, and Tools

Brandywine Team

May2013

DynAXInnovations in Programming Models, Compilers and Runtime Systems for

Dynamic Adaptive Event-Driven Execution Models

Page 2: ET E.T. International, Inc. X-Stack: Programming Challenges, Runtime Systems, and Tools Brandywine Team May2013

E.T. International, Inc.

2

ObjectivesScalability Expose, express, and exploit O(1010) concurrency

Locality Locality aware data types, algorithms, and optimizations

Programmability

Easy expression of asynchrony, concurrency, locality

Portability Stack portability across heterogeneous architectures

Energy Efficiency

Maximize static and dynamic energy savings while managing the tradeoff between energy efficiency, resilience, and performance

Resilience Gradual degradation in the face of many faults

Interoperability

Leverage legacy code through a gradual transformation towards exascale performance

Applications Support NWChem

Page 3: ET E.T. International, Inc. X-Stack: Programming Challenges, Runtime Systems, and Tools Brandywine Team May2013

E.T. International, Inc.

3

Brandywine Xstack Software Stack

SWARM(Runtime System)

SCALE(Compiler)

HTA (Library)

R-Stream(Compiler)

NWChem + Co-Design Applications

Rescin

ded P

rimitiv

e D

ata

Types .

E.T. International, Inc.

E.T. International, Inc.

Page 4: ET E.T. International, Inc. X-Stack: Programming Challenges, Runtime Systems, and Tools Brandywine Team May2013

E.T. International, Inc.

SWARMMPI, OpenMP, OpenCL SWARM

Asynchronous Event-Driven Tasks Dependencies Resources Active Messages Control Migration

VS.

Communicating Sequential Processes Bulk Synchronous Message Passing

Tim

e Tim

e

Active threads

Waiting

4

Page 5: ET E.T. International, Inc. X-Stack: Programming Challenges, Runtime Systems, and Tools Brandywine Team May2013

E.T. International, Inc.

5

SWARM• Principles of Operation

Codelets* Basic unit of parallelism

* Nonblocking tasks

* Scheduled upon satisfaction of precedent constraints

Hierarchical Locale Tree: spatial position, data locality

Lightweight Synchronization

Active Global Address Space (planned)

• Dynamics Asynchronous Split-phase Transactions: latency hiding

Message Driven Computation

Control-flow and Dataflow Futures

Error Handling

Fault Tolerance (planned)

Page 6: ET E.T. International, Inc. X-Stack: Programming Challenges, Runtime Systems, and Tools Brandywine Team May2013

E.T. International, Inc.

Cholesky DAG•POTRF → TRSM

•TRSM → GEMM, SYRK

•SYRK → POTRFPOTRF TRSM SYRK GEMM

1:

2:

3:

6

• Implementations:OpenMP

SWARM

Page 7: ET E.T. International, Inc. X-Stack: Programming Challenges, Runtime Systems, and Tools Brandywine Team May2013

E.T. International, Inc.

Naïve O

penMP

Tuned O

penMP

SW

AR

M

Cholesky Decomposition: Xeon

1 2 3 4 5 6 7 8 9 101112123456789

101112

OpenMPSWARM

# Threads

Spee

dup

over

Ser

ial

7

Page 8: ET E.T. International, Inc. X-Stack: Programming Challenges, Runtime Systems, and Tools Brandywine Team May2013

E.T. International, Inc.

Cholesky Decomposition: Xeon Phi

8

Ope

nMP

SW

AR

M

Xeon Phi: 240 Threads

OpenMP fork-join programming suffers on many-core chips (e.g. Xeon Phi).

SWARM removes these synchronizations.

Page 9: ET E.T. International, Inc. X-Stack: Programming Challenges, Runtime Systems, and Tools Brandywine Team May2013

E.T. International, Inc.

Cholesky: SWARM vs ScaLapack/MKLS

caL

apac

kS

WA

RM

16 node cluster: Intel Xeon E5-2670 16-core 2.6GHz

Asynchrony is key in large dense linear algebra

2 4 8 16 32 640

2000

4000

6000

8000

10000

12000

14000

16000

ScaLapack/MKL

SWARM

# Nodes

GFLO

PS9

Page 10: ET E.T. International, Inc. X-Stack: Programming Challenges, Runtime Systems, and Tools Brandywine Team May2013

E.T. International, Inc.

Code Transition to Exascale1. Determine application execution, communication, and

data access patterns

2. Find ways to accelerate application execution directly.

3. Consider data access pattern to better lay out data across distributed heterogeneous nodes.

4. Convert single-node synchronization to asynchronous control-flow/data-flow (OpenMP -> asynchronous scheduling)

5. Remove bulk-synchronous communications where possible (MPI -> asynchronous communication)

6. Synergize inter-node and intra-node code

7. Determine further optimizations afforded by asynchronous model.

Method successfully deployed for NWChem code transition10

Page 11: ET E.T. International, Inc. X-Stack: Programming Challenges, Runtime Systems, and Tools Brandywine Team May2013

E.T. International, Inc.

Self Consistent Field Module From NWChem

•NWChem used by 1000’s of researchers

•Code is designed to be highly scalable to petaflop scale

•Thousands of man-hours expensed on tuning and performance

•Self Consistent Field (SCF) module is a key component of NWChem

•ETI has worked with PNNL to extract the algorithm from NWChem to study how to improve it.As part of the DOE XStack program

11

Page 12: ET E.T. International, Inc. X-Stack: Programming Challenges, Runtime Systems, and Tools Brandywine Team May2013

E.T. International, Inc.

Serial Optimizations

Origin

al

Sym

met

ry o

f g()

BLAS/

LAPA

CK

Prec

ompu

te_x

000d

_g v

alue

s

Fock

Mat

rix S

ymm

etry

0

2

4

6

8

10

12

14

16

18

Serial Optimizations

Speedup

12

Page 13: ET E.T. International, Inc. X-Stack: Programming Challenges, Runtime Systems, and Tools Brandywine Team May2013

E.T. International, Inc.

Single Node Parallelization

1 3 5 7 9

11

13

15

17

19

21

23

25

27

29

31

0

2

4

6

8

10

12

14

16

18

20

22

Speedup of OpenMP versions and SWARM

Dynamic,v1Dynamic,v2Dynamic,v3Guided,v1Guided,v2Guided,v3Static,v1Static,v2Static,v3SWARMIdeal

# Threads

Speedup

13

Page 14: ET E.T. International, Inc. X-Stack: Programming Challenges, Runtime Systems, and Tools Brandywine Team May2013

E.T. International, Inc.

Multi-Node Parallelization

16 32 64 128 256 512 1024 204810

100

1000

10000

SCF Multi-Node Execution Scaling

SWARM

MPI

# Cores

Executi

on T

ime (

seconds)

16 32 64 128 256 512 1024 2048

0.1

1

10

100

SCF Multinode Speedup

SWARM

MPI

# Cores

Speedup o

ver

Sin

gle

Node

14

Page 15: ET E.T. International, Inc. X-Stack: Programming Challenges, Runtime Systems, and Tools Brandywine Team May2013

E.T. International, Inc.

15

Information Repository• All of this information is available in more detail at the

Xstack wiki:http://www.xstackwiki.com

Page 16: ET E.T. International, Inc. X-Stack: Programming Challenges, Runtime Systems, and Tools Brandywine Team May2013

E.T. International, Inc.

16

Questions?

Page 17: ET E.T. International, Inc. X-Stack: Programming Challenges, Runtime Systems, and Tools Brandywine Team May2013

E.T. International, Inc.

17

Acknowledgements• Co-PIs:

Benoit Meister (Reservoir)

David Padua (Univ. Illinois)

John Feo (PNNL)

• Other team members:ETI: Mark Glines, Kelly Livingston, Adam Markey

Reservoir: Rich Lethin

Univ. Illinois: Adam Smith

PNNL: Andres Marquez

• DOESonia Sachs, Bill Harrod