60
Techniques for Developing Efficient Petascale Applications Laxmikant (Sanjay) Kale http://charm.cs.uiuc.e du Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana Champaign

Techniques for Developing Efficient Petascale Applications

  • Upload
    harva

  • View
    34

  • Download
    0

Embed Size (px)

DESCRIPTION

Techniques for Developing Efficient Petascale Applications. Laxmikant (Sanjay) Kale http://charm.cs.uiuc.edu Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana Champaign. Application Oriented Parallel Abstractions. - PowerPoint PPT Presentation

Citation preview

Page 1: Techniques for Developing Efficient Petascale Applications

Techniques for Developing Efficient Petascale Applications

Laxmikant (Sanjay) Kale

http://charm.cs.uiuc.eduParallel Programming Laboratory

Department of Computer Science

University of Illinois at Urbana Champaign

Page 2: Techniques for Developing Efficient Petascale Applications

4

Application Oriented Parallel Abstractions

NAMD Charm++

Other

Applications

Issues

Techniques & libraries

Synergy between Computer Science Research and Apps has been beneficial to both

ChaNGa

LeanCP Space-time meshing

Rocket Simulation

04/21/23 SciDAC-08

Page 3: Techniques for Developing Efficient Petascale Applications

5

Application Oriented Parallel Abstractions

NAMD Charm++

Other

Applications

Issues

Techniques & libraries

Synergy between Computer Science Research and Apps has been beneficial to both

ChaNGa

LeanCP Space-time meshing

Rocket Simulation

04/21/23 SciDAC-08

Page 4: Techniques for Developing Efficient Petascale Applications

04/21/23 PSC PetaScale 6

Enabling CS technology of parallel objects and intelligent runtime systems (Charm++ and AMPI) has led to several collaborative applications in CSE

Page 5: Techniques for Developing Efficient Petascale Applications

7

Outline• Techniques for Petascale Programming:

– My biases: isoefficiency analysis, overdecompistion, automated resource management, sophisticated perf. analysis tools

• BigSim: • Early development of Applications on future machines

• Dealing with Multicore processors and SMP nodes

• Novel parallel programming models

04/21/23 704/21/23 SciDAC-08

Page 6: Techniques for Developing Efficient Petascale Applications

8

TechniquesFor Scaling to Multi-PetaFLOP/s

04/21/23 SciDAC-08

Page 7: Techniques for Developing Efficient Petascale Applications

9

1.Analyze Scalability of Algorithms(say via the iso-efficiency metric)

04/21/23 SciDAC-08

Page 8: Techniques for Developing Efficient Petascale Applications

10

Isoefficiency Analysis

• An algorithm (*) is scalable if– If you double the number of processors available to it, it can

retain the same parallel efficiency by increasing the size of the problem by some amount

– Not all algorithms are scalable in this sense..

– Isoefficiency is the rate at which the problem size must be increased, as a function of number of processors, to keep the same efficiency

Parallel efficiency= T1/(Tp*P)

T1 : Time on one processor

Tp: Time on P processors

Problem

size

processors

Equal efficiency curves

04/21/23 SciDAC-08

Vipin Kumar (circa 1990)

Page 9: Techniques for Developing Efficient Petascale Applications

1104/21/23 11

NAMD: A Production MD program

NAMD• Fully featured program

• NIH-funded development

• Distributed free of charge (~20,000 registered users)

• Binaries and source code

• Installed at NSF centers

• User training and support

• Large published simulations

04/21/23 SciDAC-08

Page 10: Techniques for Developing Efficient Petascale Applications

12

Molecular Dynamics in NAMD• Collection of [charged] atoms, with bonds

– Newtonian mechanics

– Thousands of atoms (10,000 – 5,000,000)

• At each time-step– Calculate forces on each atom

• Bonds:

• Non-bonded: electrostatic and van der Waal’s– Short-distance: every timestep

– Long-distance: using PME (3D FFT)

– Multiple Time Stepping : PME every 4 timesteps

– Calculate velocities and advance positions

• Challenge: femtosecond time-step, millions needed!

Collaboration with K. Schulten, R. Skeel, and coworkers04/21/23 SciDAC-08

Page 11: Techniques for Developing Efficient Petascale Applications

13

Traditional Approaches: non isoefficient

• Replicated Data:– All atom coordinates stored on each processor

• Communication/Computation ratio: P log P

• Partition the Atoms array across processors– Nearby atoms may not be on the same processor

– C/C ratio: O(P)

• Distribute force matrix to processors– Matrix is sparse, non uniform,

– C/C Ratio: sqrt(P)

04/21/23 SciDAC-08

Page 12: Techniques for Developing Efficient Petascale Applications

14

Spatial Decomposition Via Charm

•Atoms distributed to cubes based on their location

• Size of each cube :

•Just a bit larger than cut-off radius

•Communicate only with neighbors

•Work: for each pair of nbr objects

•C/C ratio: O(1)

•However:

•Load Imbalance

•Limited Parallelism

Cells, Cubes or“Patches”

Charm++ is useful to handle this

04/21/23 SciDAC-08

Page 13: Techniques for Developing Efficient Petascale Applications

15

Object Based Parallelization for MD:

Force Decomposition + Spatial Decomposition

•Now, we have many objects to load balance:

•Each diamond can be assigned to any proc.

• Number of diamonds (3D):

–14·Number of Patches

–2-away variation:

–Half-size cubes

–5x5x5 interactions

–3-away interactions: 7x7x704/21/23 SciDAC-08

Page 14: Techniques for Developing Efficient Petascale Applications

Performance of NAMD: STMV

No. of cores

STMV: ~1million atoms

Page 15: Techniques for Developing Efficient Petascale Applications

17

2. Decouple decomposition from Physical Processors

04/21/23 SciDAC-08

Page 16: Techniques for Developing Efficient Petascale Applications

18

Migratable Objects (aka Processor Virtualization)

User View

System implementation

Programmer: [Over] decomposition into virtual processors

Runtime: Assigns VPs to processors

Enables adaptive runtime strategies

Implementations: Charm++, AMPI

• Software engineering– Number of virtual processors can be

independently controlled– Separate VPs for different modules

• Message driven execution– Adaptive overlap of communication– Predictability :

• Automatic out-of-core• Prefetch to local stores

– Asynchronous reductions

• Dynamic mapping– Heterogeneous clusters

• Vacate, adjust to speed, share– Automatic checkpointing– Change set of processors used– Automatic dynamic load balancing– Communication optimization

Benefits

04/21/23 SciDAC-08

Page 17: Techniques for Developing Efficient Petascale Applications

19

Migratable Objects (aka Processor Virtualization)

• Software engineering– Number of virtual processors can be

independently controlled– Separate VPs for different modules

• Message driven execution– Adaptive overlap of communication– Predictability :

• Automatic out-of-core• Prefetch to local stores

– Asynchronous reductions

• Dynamic mapping– Heterogeneous clusters

• Vacate, adjust to speed, share– Automatic checkpointing– Change set of processors used– Automatic dynamic load balancing– Communication optimization

Benefits

Real Processors

MPI processes

Virtual Processors (user-level migratable threads)

Programmer: [Over] decomposition into virtual processors

Runtime: Assigns VPs to processors

Enables adaptive runtime strategies

Implementations: Charm++, AMPI

04/21/23 SciDAC-08

Page 18: Techniques for Developing Efficient Petascale Applications

20

Parallel Decomposition and Processors

• MPI-style encourages– Decomposition into P pieces, where P is the number of physical

processors available

– If your natural decomposition is a cube, then the number of processors must be a cube

– …

• Charm++/AMPI style “virtual processors”– Decompose into natural objects of the application

– Let the runtime map them to processors

– Decouple decomposition from load balancing

04/21/23 SciDAC-08

Page 19: Techniques for Developing Efficient Petascale Applications

2104/21/23 LSU PetaScale 21

Decomposition independent of numCores

• Rocket simulation example under traditional MPI vs. Charm++/AMPI framework

– Benefit: load balance, communication optimizations, modularity

Solid

Fluid

Solid

Fluid

Solid

Fluid. . .

1 2 P

Solid1

Fluid1

Solid2

Fluid2

Solidn

Fluidm. . .

Solid3. . .

04/21/23 SciDAC-08

Page 20: Techniques for Developing Efficient Petascale Applications

04/21/23 IITB2003 22

Fault Tolerance

• Automatic Checkpointing– Migrate objects to disk

– In-memory checkpointing as an option

– Automatic fault detection and restart

• “Impending Fault” Response– Migrate objects to other processors

– Adjust processor-level parallel data structures

• Scalable fault tolerance– When a processor out of 100,000

fails, all 99,999 shouldn’t have to run back to their checkpoints!

– Sender-side message logging

– Latency tolerance helps mitigate costs

– Restart can be speeded up by spreading out objects from failed processor

Page 21: Techniques for Developing Efficient Petascale Applications

2304/21/23 LSU PetaScale 23

3. Use Dynamic Load BalancingBased on the

Principle of Persistence

Principle of persistence

Computational loads and communication patterns tend to persist, even in dynamic computations

So, recent past is a good predictor or near future

04/21/23 SciDAC-08

Page 22: Techniques for Developing Efficient Petascale Applications

Two step balancing• Computational entities (nodes, structured grid

points, particles…) are partitioned into objects– Say, a few thousand FEM elements per chunk, on average

• Objects are migrated across processors for balancing load– Much smaller problem than repartitioning entire mesh, e.g.

– Occasional repartitioning for some problems

04/21/23 SciDAC-08 24

Page 23: Techniques for Developing Efficient Petascale Applications

25

Load Balancing Steps

Regular Timesteps

Instrumented Timesteps

Detailed, aggressive Load Balancing

Refinement Load Balancing

04/21/23 SciDAC-08

Page 24: Techniques for Developing Efficient Petascale Applications

26

ChaNGa: Parallel Gravity

• Collaborative project (NSF ITR)– with Prof. Tom Quinn, Univ. of Washington

• Components: gravity, gas dynamics• Barnes-Hut tree codes

– Oct tree is natural decomposition:• Geometry has better aspect ratios, and so you “open” fewer

nodes up.• But is not used because it leads to bad load balance• Assumption: one-to-one map between sub-trees and

processors• Binary trees are considered better load balanced

– With Charm++: Use Oct-Tree, and let Charm++ map subtrees to processors

04/21/23 SciDAC-08

Page 25: Techniques for Developing Efficient Petascale Applications

2704/21/23 SciDAC-08

Page 26: Techniques for Developing Efficient Petascale Applications

28

ChaNGa Load balancing

dwarf 5M on 1,024 BlueGene/L processors

5.6s 6.1s

04/21/23 SciDAC-08

Messages x1000 Bytes transferred (MB)

0

5000

10000

15000

20000

25000

30000

Main title

Before LB

After LB

Simple balancers worsened performance!

Page 27: Techniques for Developing Efficient Petascale Applications

29

Load balancing with OrbRefineLB

dwarf 5M on 1,024 BlueGene/L processors

5.6s 5.0s

04/21/23 SciDAC-08

Need sophisticated balancers, and ability to choose the right ones automatically

Page 28: Techniques for Developing Efficient Petascale Applications

ChaNGa: Parallel Gravity Code

Developed in Collaboration with Tom Quinn (Univ. Washington) using Charm++

ChaNGa Preliminary Performance

3004/21/23 SciDAC-08

Page 29: Techniques for Developing Efficient Petascale Applications

Load balancing for large machines: I• Centralized balancers achieve best balance

– Collect object-communication graph on one processor

– But won’t scale beyond tens of thousands of nodes

• Fully distributed load balancers– Avoid bottleneck but.. Achieve poor load balance

– Not adequately agile

• Hierarchical load balancers– Careful control of what information goes up and down the

hierarchy can lead to fast, high-quality balancers

04/21/23 SciDAC-08 31

Page 30: Techniques for Developing Efficient Petascale Applications

Load balancing for large machines: II• Interconnection topology starts to matter again

– Was hidden due to wormhole routing etc.

– Latency variation is still small..

– But bandwidth occupancy is a problem

• Topology aware load balancers– Some general heuristic have shown good performance

• But may require too much compute power

– Also, special-purpose heuristic work fine when applicable

– Still, many open challenges

04/21/23 SciDAC-08 32

Page 31: Techniques for Developing Efficient Petascale Applications

33

OpenAtomCar-Parinello ab initio MD

NSF ITR 2001-2007, IBM

04/21/23 SciDAC-08

Molecular Clusters : Nanowires:

Semiconductor Surfaces: 3D-Solids/Liquids:

G. MartynaM. Tuckerman

L. Kale

Page 32: Techniques for Developing Efficient Petascale Applications

Goal : The accurate treatment of complex heterogeneous nanoscale systems

soft under-layerwrite head

read head

bit

recording

medium

Courtsey: G. Martyna

Page 33: Techniques for Developing Efficient Petascale Applications

New OpenAtom Collaboration (DOE)• Principle Investigators

– G. Martyna (IBM TJ Watson)

– M. Tuckerman (NYU)– L.V. Kale (UIUC)– K. Schulten (UIUC)– J. Dongarra (UTK/ORNL)

• Current effort focus – QMMM via integration

with NAMD2– ORNL Cray XT4 Jaguar

(31,328 cores)

• A unique parallel decomposition of the Car-Parinello method.

• Using Charm++ virtualization we can efficiently scale small (32 molecule) systems to thousands of processors

04/21/23 SciDAC-08 35

Page 34: Techniques for Developing Efficient Petascale Applications

Decomposition and Computation Flow

3604/21/23 SciDAC-08

Page 35: Techniques for Developing Efficient Petascale Applications

OpenAtom Performance

BlueGene/LBlueGene/P

Cray XT3

Page 36: Techniques for Developing Efficient Petascale Applications

Topology aware mapping of objects

Page 37: Techniques for Developing Efficient Petascale Applications

Improvements wrought by network topological aware mapping of computational work to processors

The simulation of the left panel maps computational work to processors taking the network connectivity into account while the right panel simulation does not . The ``black’’ or idle time processors spent waiting for computational work to arrive on processors is significantly reduced at left. (256waters, 70R, on BG/L 4096 cores)

Page 38: Techniques for Developing Efficient Petascale Applications

OpenAtom: Topology Aware Mapping on BG/L

Page 39: Techniques for Developing Efficient Petascale Applications

OpenAtom: Topology Aware Mapping on Cray XT/3

Page 40: Techniques for Developing Efficient Petascale Applications

42

4. Analyze Performance withSophisticated Tools

04/21/23 SciDAC-08

Page 41: Techniques for Developing Efficient Petascale Applications

43

Exploit sophisticated Performance analysis tools

• We use a tool called “Projections”

• Many other tools exist

• Need for scalable analysis

• A recent example:– Trying to identify the next performance obstacle for NAMD

• Running on 8192 processors, with 92,000 atom simulation

• Test scenario: without PME

• Time is 3 ms per step, but lower bound is 1.6ms or so..

04/21/23 SciDAC-08

Page 42: Techniques for Developing Efficient Petascale Applications

4404/21/23 SciDAC-08

Page 43: Techniques for Developing Efficient Petascale Applications

4504/21/23 SciDAC-08

Page 44: Techniques for Developing Efficient Petascale Applications

4604/21/23 SciDAC-08

Page 45: Techniques for Developing Efficient Petascale Applications

47

5. Model and Predict PerformanceVia Simulation

04/21/23 SciDAC-08

Page 46: Techniques for Developing Efficient Petascale Applications

How to tune performance for a future machine?• For example, Blue Waters will arrive in 2011

– But need to prepare applications for it stating now

• Even for extant machines:– Full size machine may not be available as often as needed for

tuning runs

• A simulation-based approach is needed

• Our Approach: BigSim– Full scale Emulation

– Trace driven Simulation

04/21/23 SciDAC-08 48

Page 47: Techniques for Developing Efficient Petascale Applications

BigSim: Emulation• Emulation:

– Run an existing, full-scale MPI, AMPI or Charm++ application

– Using an emulation layer that pretends to be (say) 100k cores

• Leverages Charm++’s object-based virtualization

– Generates traces:

• Characteristics of SEBs (Sequential Execution Blocks)

• Dependences between SEBs and messages

04/21/23 SciDAC-08 49

Page 48: Techniques for Developing Efficient Petascale Applications

BigSim: Trace-Driven Simulation• Trace Driven Parallel Simulation

– Typically on tens to hundreds of processors

– Multiple resolution Simulation of sequential execution:

• from simple scaling to

• cycle-accurate modeling

– Multipl resolution simulation of the Network:

• from simple latency/bw model to

• A detailed packet and switching port level simulation

– Generates Timing traces just as a real app on full scale machine

• Phase 3: Analyze performance– Identify bottlenecks, even w/o predicting exact performance

– Carry out various “what-if” analysis04/21/23 SciDAC-08 50

Page 49: Techniques for Developing Efficient Petascale Applications

51

Projections: Performance visualization

04/21/23 SciDAC-08

Page 50: Techniques for Developing Efficient Petascale Applications

BigSim: Challenges• This simple picture hides many complexities

• Emulation: – Automatic Out-of-core support for large memory footprint apps

• Simulation:– Accuracy vs cost tradeoffs

– Interpolation mechanisms

– Memory optimizations

• Active area of research– But early versions will be available for application developers

for Blue Waters next month

04/21/23 SciDAC-08 52

Page 51: Techniques for Developing Efficient Petascale Applications

04/21/23 SciDAC-08 53

Validation

NAMD Apoa1

0

20

40

60

80

128 256 512 1024 2250

number of processors simulated

time (se

conds)

Actualexecutiontimepredictedtime

Page 52: Techniques for Developing Efficient Petascale Applications

54

Challenges of Multicore ChipsAnd SMP nodes

04/21/23 SciDAC-08

Page 53: Techniques for Developing Efficient Petascale Applications

Challenges of Multicore Chips• SMP nodes have been around for a while

– Typically, users just run separate MPI processes on each core

– Seen to be faster (OpenMP + MPI has not succeeded much)

• Yet, you cannot ignore the shared memory– E.g. For gravity calculations (cosmology), a processor may

request a set of particles from remote nodes

• With shared memory, one can maintain a node-level cache, minimizing communication

• Need model without pitfalls of Pthreads/OpenMP– Costly locks, false sharing, no respect for locality

– E.g. Charm++ avoids these pitfalls, IF its RTS can avoid them

04/21/23 SciDAC-08 55

Page 54: Techniques for Developing Efficient Petascale Applications

Charm++ Experience• Pingpong performance

04/21/23 SciDAC-08 56

Page 55: Techniques for Developing Efficient Petascale Applications

57

Raising the Level of Abstraction

04/21/23 SciDAC-08

Page 56: Techniques for Developing Efficient Petascale Applications

Raising the Level of Abstraction• Clearly (?) we need new programming models

– Many opinions and ideas exist

• Two metapoints:– Need for Exploration (don’t standardize too soon)

– Interoperability

• Allows a “beachhead” for novel paradigms

• Long-term: Allow each module to be coded using the paradigm best suited for it

– Interoperability requires concurrent composition

• This may require message-driven execution

04/21/23 SciDAC-08 58

Page 57: Techniques for Developing Efficient Petascale Applications

Simplifying Parallel Programming• By giving up completeness!

– A paradigm may be simple, but

– not suitable for expressing all parallel patterns

– Yet, if it can cover a significant classes of patters (applications, modules), it is useful

– A collection of incomplete models, backed by a few complete ones, will do the trick

• Our own examples: both outlaw non-determinism– Multiphase Shared Arrays (MSA): restricted shared memory

• LCPC ‘04

– Charisma: Static data flow among collections of objects

• LCR’04, HPDC ‘07

04/21/23 SciDAC-08 59

Page 58: Techniques for Developing Efficient Petascale Applications

04/21/23 SBAC-PAD PetaScale 60

MSA: • In the simple model:• A program consists of

– A collection of Charm threads, and – Multiple collections of data-arrays

• Partitioned into pages (user-specified)

• Each array is in one mode at a time– But its mode may change from phase

to phase

• Modes– Write-once– Read-only– Accumulate– Owner-computes

AB

C CC C

Page 59: Techniques for Developing Efficient Petascale Applications

A View of an Interoperable Future

04/21/23 SciDAC-08 61

Page 60: Techniques for Developing Efficient Petascale Applications

62

Summary

• Petascale Applications will require a large toolbox; My pet tools/techniques:

• Scalable Algorithms with isoefficiency analysis• Object-based decomposition• Dynamic Load balancing• Scalable Performance analysis• Early performance tuning via BigSim

• Multicore chips, SMP nodes, accelerators– Require extra programming effort with locality-respecting

models

• Raise the level of abstraction– Via “Incomplete yet simple” paradigms, and – domain-specific frameworks More Info: http://charm.cs.uiuc.edu /

04/21/23 SciDAC-08