Upload
neil-blair
View
222
Download
0
Tags:
Embed Size (px)
Citation preview
ET
E.T. International, Inc.
X-Stack: Programming Challenges, Runtime Systems, and Tools
Brandywine Team
May2013
DynAXInnovations in Programming Models, Compilers and Runtime Systems for
Dynamic Adaptive Event-Driven Execution Models
E.T. International, Inc.
2
ObjectivesScalability Expose, express, and exploit O(1010) concurrency
Locality Locality aware data types, algorithms, and optimizations
Programmability
Easy expression of asynchrony, concurrency, locality
Portability Stack portability across heterogeneous architectures
Energy Efficiency
Maximize static and dynamic energy savings while managing the tradeoff between energy efficiency, resilience, and performance
Resilience Gradual degradation in the face of many faults
Interoperability
Leverage legacy code through a gradual transformation towards exascale performance
Applications Support NWChem
E.T. International, Inc.
3
Brandywine Xstack Software Stack
SWARM(Runtime System)
SCALE(Compiler)
HTA (Library)
R-Stream(Compiler)
NWChem + Co-Design Applications
Rescin
ded P
rimitiv
e D
ata
Types .
E.T. International, Inc.
E.T. International, Inc.
E.T. International, Inc.
SWARMMPI, OpenMP, OpenCL SWARM
Asynchronous Event-Driven Tasks Dependencies Resources Active Messages Control Migration
VS.
Communicating Sequential Processes Bulk Synchronous Message Passing
Tim
e Tim
e
Active threads
Waiting
4
E.T. International, Inc.
5
SWARM• Principles of Operation
Codelets* Basic unit of parallelism
* Nonblocking tasks
* Scheduled upon satisfaction of precedent constraints
Hierarchical Locale Tree: spatial position, data locality
Lightweight Synchronization
Active Global Address Space (planned)
• Dynamics Asynchronous Split-phase Transactions: latency hiding
Message Driven Computation
Control-flow and Dataflow Futures
Error Handling
Fault Tolerance (planned)
E.T. International, Inc.
Cholesky DAG•POTRF → TRSM
•TRSM → GEMM, SYRK
•SYRK → POTRFPOTRF TRSM SYRK GEMM
1:
2:
3:
6
• Implementations:OpenMP
SWARM
E.T. International, Inc.
Naïve O
penMP
Tuned O
penMP
SW
AR
M
Cholesky Decomposition: Xeon
1 2 3 4 5 6 7 8 9 101112123456789
101112
OpenMPSWARM
# Threads
Spee
dup
over
Ser
ial
7
E.T. International, Inc.
Cholesky Decomposition: Xeon Phi
8
Ope
nMP
SW
AR
M
Xeon Phi: 240 Threads
OpenMP fork-join programming suffers on many-core chips (e.g. Xeon Phi).
SWARM removes these synchronizations.
E.T. International, Inc.
Cholesky: SWARM vs ScaLapack/MKLS
caL
apac
kS
WA
RM
16 node cluster: Intel Xeon E5-2670 16-core 2.6GHz
Asynchrony is key in large dense linear algebra
2 4 8 16 32 640
2000
4000
6000
8000
10000
12000
14000
16000
ScaLapack/MKL
SWARM
# Nodes
GFLO
PS9
E.T. International, Inc.
Code Transition to Exascale1. Determine application execution, communication, and
data access patterns
2. Find ways to accelerate application execution directly.
3. Consider data access pattern to better lay out data across distributed heterogeneous nodes.
4. Convert single-node synchronization to asynchronous control-flow/data-flow (OpenMP -> asynchronous scheduling)
5. Remove bulk-synchronous communications where possible (MPI -> asynchronous communication)
6. Synergize inter-node and intra-node code
7. Determine further optimizations afforded by asynchronous model.
Method successfully deployed for NWChem code transition10
E.T. International, Inc.
Self Consistent Field Module From NWChem
•NWChem used by 1000’s of researchers
•Code is designed to be highly scalable to petaflop scale
•Thousands of man-hours expensed on tuning and performance
•Self Consistent Field (SCF) module is a key component of NWChem
•ETI has worked with PNNL to extract the algorithm from NWChem to study how to improve it.As part of the DOE XStack program
11
E.T. International, Inc.
Serial Optimizations
Origin
al
Sym
met
ry o
f g()
BLAS/
LAPA
CK
Prec
ompu
te_x
000d
_g v
alue
s
Fock
Mat
rix S
ymm
etry
0
2
4
6
8
10
12
14
16
18
Serial Optimizations
Speedup
12
E.T. International, Inc.
Single Node Parallelization
1 3 5 7 9
11
13
15
17
19
21
23
25
27
29
31
0
2
4
6
8
10
12
14
16
18
20
22
Speedup of OpenMP versions and SWARM
Dynamic,v1Dynamic,v2Dynamic,v3Guided,v1Guided,v2Guided,v3Static,v1Static,v2Static,v3SWARMIdeal
# Threads
Speedup
13
E.T. International, Inc.
Multi-Node Parallelization
16 32 64 128 256 512 1024 204810
100
1000
10000
SCF Multi-Node Execution Scaling
SWARM
MPI
# Cores
Executi
on T
ime (
seconds)
16 32 64 128 256 512 1024 2048
0.1
1
10
100
SCF Multinode Speedup
SWARM
MPI
# Cores
Speedup o
ver
Sin
gle
Node
14
E.T. International, Inc.
15
Information Repository• All of this information is available in more detail at the
Xstack wiki:http://www.xstackwiki.com
E.T. International, Inc.
16
Questions?
E.T. International, Inc.
17
Acknowledgements• Co-PIs:
Benoit Meister (Reservoir)
David Padua (Univ. Illinois)
John Feo (PNNL)
• Other team members:ETI: Mark Glines, Kelly Livingston, Adam Markey
Reservoir: Rich Lethin
Univ. Illinois: Adam Smith
PNNL: Andres Marquez
• DOESonia Sachs, Bill Harrod