Application Challenges for Sustained Petascale William Gropp wgropp

Application Challenges for Sustained PetascaleWilliam Groppwww.cs.illinois.edu/~wgropp

Performance, then Productivity

• Note the “then” – not “instead of”– For “easier” problems, it is correct to invert these

• For the very hardest problems, we must focus on getting the best performance possible– Rely on other approaches to manage the complexity of the codes– Performance can be understood and engineered (note I did not

say predicted)• We need to start now, to get practice

– “Vector” instructions, GPUs, extreme scale networks– Because Exascale platforms will be even more complex and

harder to use effectively

A “Bottom Up” Look at the Problem

• Focus on the features of Cray XE/XK system as model of petascale and trans-petascale systems– Heterogeneous at multiple levels:

• Node functional unit, node types

– Network and network bandwidth may (but only may) be more representative of future directions

• A major challenge is to handle details in a way that is portable and efficient (OpenACC ( www.openacc-standard.org/ ) may not solve all of our problems … )

Node: Getting the most performance out of a NUMA SMP

• Process/Thread mapping to chip/core– What is the quantitative model that permits reasoning and automating this

process?– Describing a “mapping” (almost) assumes a static mapping. How does a more

dynamic behavior fit to current techniques?• Efficient use of node/chip to memory

– Prefetch, memory hierarchy optimizations, tuning for dynamic data patterns• Efficient use of core/node computational resources

– 8/16 core split; vector instructions

• Both algorithm and programming model implications– Algorithm can reflect execution model; realization requires a

programming model

Many Nodes

• Process mapping to nodes– Topology sensitive mapping– Quantitative reasoning about mapping– What changes if work is dynamic?– Relationship to multi-component applications• Application is heterogeneous – how does that impact

mapping, both initial and over time? What are the right software models?

Discovering Performance Opportunities

• Lets look at a single process sending to its neighbors. We expect the rate to be roughly twice that for the halo (since this test is only sending, not sending and receiving)

System 4 neighbors 8 Neighbors

Periodic Periodic

BG/L 488 490 389 389

BG/L, VN 294 294 239 239

BG/P 1139 1136 892 892

BG/P, VN 468 468 600 601

XT3 1005 1007 1053 1045

XT4 1634 1620 1773 1770

XT4 SN 1701 1701 1811 1808

Discovering Performance Opportunities

• Ratios of a single sender to all processes sending (in rate)• Expect a factor of roughly 2 (since processes must also receive)

System 4 neighbors 8 Neighbors

Periodic Periodic

BG/L 2.24 2.01

BG/L, VN 1.46 1.81

BG/P 3.8 2.2

BG/P, VN 2.6 5.5

XT3 7.5 8.1 9.08 9.41

XT4 10.7 10.7 13.0 13.7

XT4 SN 5.47 5.56 6.73 7.06

Interconnect

• Need more general approaches for avoiding contention– Recent example: shuffle data in collectives to reduce

contention (w Paul Sack, PPoPP 2012)

• Overlap communication/computation• Exploit one-sided programming models• Avoiding alltoall in algorithms– Global FFTs move too much data for the value received– Need a better understanding of the accuracy requirements

of the application

A “Top Down” Look At The Problem

• Consider the application and the mapping of the problem (not just current algorithm) to current and future hardware

• One example: the use of FFTs for DNS simulations or for particle-mesh Ewald is just one of many possible choices – and other choices may provide sufficient accuracy at lower cost on large scale platforms– Data motion is costly, not floating point operations

Need for Adaptivity• Uniform meshes rarely optimal

– More work than necessary– Note that minimizing floating-point operations will not minimize

running time – perfect irregular mesh is also not optimal• Once adaptive meshing/model approximations used, need

to address load balance, avoid the use of synchronizing operations– No barriers– Nothing that looks like a barrier (MPI_Allreduce)

• See MPI_Iallreduce, likely to appear in MPI 3

– Care with operations that are weakly synchronizing– e.g., neighbor communication (it synchronizes, just not as tightly)• Using MPI_Send synchronizes

Sharing an SMP• Having many cores available makes

everyone think that they can use them to solve other problems (“no one would use all of them all of the time”)

• However, compute-bound scientific calculations are often written as if all compute resources are owned by the application

• Such static scheduling leads to performance loss

• Pure dynamic scheduling adds overhead, but is better

• Careful mixed strategies are even better• Recent results give 10-16%

performance improvements on large, scalable systems

• Thanks to Vivek Kale

Need for Aggregation

• Functional units are cheap – Small amount of area, relatively small amount of power– Memory motion is expensive– Easy to arrange many floating point units, in different patterns

• Classic vectors (“Classic” Cray, NEC SX)• Commodity vectors (2 or 4 elements)• Streams• GPU

– All have different requirements on both the algorithms (e.g., work with full vectors) and programming (e.g., satisfy alignment rules)

– Compilers will be able to help but will not solve the problem• Need better ways to generate fast and maintainable code

Need for Appropriate Data Structures

• Choice of data structure strongly affects ability of the system to provide good performance– Key is to work with the hardware provided for

improving memory system performance, rather than using it as a crutch

– This choice often requires a large scale view of the problem and is not susceptible to typical autotuning approaches• Refactoring tools may help existing application

Effective Sparse Matrix-Vector Implementation• We have modified the

S-CSR and S-BCSR to match the requirements for vectorization

• We can use OSKI to optimize “within the loops”

• Need corresponding approach for x86 and GPU; method to hide details

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

1.80

2.00

Perf

orm

ance

Rati

o

stream_un2

BLK12-VSX

S-CSR-2

S-CSR-4

S-CSR-2-VSX

S-CSR-4-VSX

SpMV on BlueBiou

Memory Locality and Multiphysics

• Vertically integrated (all modules within same “locality domain”)– Not horizontally in processor blocks– Adapt for load balance• Challenges

– Minimize memory motion– Work within limited memory

– Likely approach: interleave components in regions (nodes, if nodes have 1000’s of cores)

Locality Domains• In hardware, the

memory is in a hierarchy – core, memory stick, chip, node, module, rack, ..

• Algorithm/implementation needs to respect this hierarchy

Summary• The new Blue Waters is a good test bed for

extreme scale– Heterogeneous at all levels– Algorithms need to be more flexible, dynamic– Programming models need to be more flexible

with details but with a realistic execution model– Applications need to reconsider choice of

algorithms to match changing costs– Quantification of performance one way to tie it all

together

Thanks• Torsten Hoefler

– Performance modeling lead, Blue Waters; MPI datatype

• David Padua, Maria Garzaran, Saeed Maleki– Compiler vectorization

• Dahai Guo– Streamed format exploiting

prefetch• Vivek Kale

– SMP work partitioning• Hormozd Gahvari

– AMG application modeling• Marc Snir and William Kramer

– Performance model advocates

• Abhinav Bhatele– Process/node mapping

• Elena Caraba– Nonblocking Allreduce in CG

• Van Bui– Performance model-based evaluation

of programming models• Paul Sack

– Collectives in the presence of contention

• Funding provided by:– Blue Waters project (State of Illinois

and the University of Illinois)– Department of Energy, Office of

Science– National Science Foundation

Documents

Application Challenges for Sustained Petascale William Gropp wgropp