62
1 How Charm works its magic Laxmikant Kale http://charm.cs.uiuc.edu Parallel Programming Laboratory Dept. of Computer Science University of Illinois at Urbana Champaign

How Charm works its magic

Embed Size (px)

DESCRIPTION

How Charm works its magic. Laxmikant Kale http://charm.cs.uiuc.edu Parallel Programming Laboratory Dept. of Computer Science University of Illinois at Urbana Champaign. System implementation. User View. Parallel Programming Environment. Charm++ and AMPI - PowerPoint PPT Presentation

Citation preview

Page 1: How Charm works its magic

1

How Charm works its magic

Laxmikant Kale

http://charm.cs.uiuc.eduParallel Programming Laboratory

Dept. of Computer Science

University of Illinois at Urbana Champaign

Page 2: How Charm works its magic

2

Parallel Programming Environment• Charm++ and AMPI

– Embody the idea of processor virtualization

• Processor Virtualization– Divide the computation into a large number of pieces

• Independent of number of processors• Typically larger than number of processors

– Let the system map objects to processors

User View

System implementation

Page 3: How Charm works its magic

3

Charm++ and AMPI

• Charm++– Parallel C++

– “Arrays” of Objects

– Automatic load balancing

– Prioritization

– Mature System

– Available on all parallel machines we know

• Several applications:– Mol. Dynamics

– QM/MM

– Cosmology

– Materials/processes

– Operations Research

• AMPI = MPI + virtualization – A migration path for MPI codes

– Automatic dynamic load balancing for MPI applications

– Uses Charm++ object arrays and migratable threads

– Bindings for C, C++, Fortran90

• Porting MPI applications– Minimal modifications needed

– Automated via AMPizer

• AMPI progress– Ease of use: automatic packing

– Asynchronous communication

• Split-phase interfaces

Page 4: How Charm works its magic

4

Charm++

• Parallel C++ with Data Driven Objects• Object Arrays/ Object Collections• Object Groups:

– Global object with a “representative” on each PE

• Asynchronous method invocation• Prioritized scheduling• Mature, robust, portable• http://charm.cs.uiuc.edu

Page 5: How Charm works its magic

5

AMPI:

7 MPI processes

Page 6: How Charm works its magic

6

AMPI:

Real Processors

7 MPI “processes”

Implemented as virtual processors (user-level migratable threads)

Page 7: How Charm works its magic

7

Benefits of Virtualization

• Software Engineering– Number of virtual processors can be

independently controlled– Separate VPs for different modules

• Message Driven Execution– Adaptive overlap of communication– Modularity– Predictability:

• Automatic Out-of-core– Asynchronous reductions

• Dynamic mapping– Heterogeneous clusters:

• Vacate, adjust to speed, share– Automatic checkpointing– Change the set of processors used

• Principle of Persistence– Enables Runtime

Optimizations

– Automatic Dynamic Load Balancing

– Communication Optimizations

– Other Runtime Optimizations

More: http://charm.cs.uiuc.edu

We will illustrate:

-- An application breakthrough

-- Cluster Performance Optimization

-- A Communication Optimization

Page 8: How Charm works its magic

8

Technology Demonstration

• Recent breakthrough in Molecular Dynamics performance– NAMD, implemented using Charm++– Demonstrates power of techniques, applicable to CSAR

• Collection of charged atoms, with bonds– Thousands of atoms (10,000 - 500,000)– 1 femtosecond time-step, millions needed!

• At each time-step– Bond forces– Non-bonded: electrostatic and van der Waal’s

• Short-distance: every timestep• Long-distance: every 4 timesteps using PME (3D FFT)• Multiple Time Stepping

– Calculate velocities and advance positionsCollaboration with K. Schulten, R. Skeel, and coworkers

Page 9: How Charm works its magic

9

700 VPs

192 + 144 VPs

30,000 VPs

Virtualized Approach to Parallelization using Charm++

These 30,000+ Virtual Processors (VPs) are mapped to real processors by Charm runtime system

Page 10: How Charm works its magic

10

Asynchronous reductions, and message-driven execution in Charm allow applications to tolerate random variations

Page 11: How Charm works its magic

11

Performance: NAMD on Lemieux

0

500

1000

1500

2000

0 500 1000 1500 2000 2500

Cutoff

PME

MTS

ATPase: 320,000+ atoms including water

To be Published in SC2002: Gordon Bell

Award Finalist

15.6 ms,

0.8 TF

Page 12: How Charm works its magic

12

Component Frameworks

Motivation• Reduce tedium of parallel

programming for commonly used paradigms & parallel data structures

• Encapsulate parallel data structures and algorithms

• Provide easy to use interface, – Sequential programming

style preserved• Use adaptive load balancing

framework• Used to build parallel

components

Frameworks• Unstructured Grids

– Generalized ghost regions– Used in

• RocFrac version,• RocFlu• Outside CSAR

– Fast Collision Detection

• Multiblock Framework– Structured Grids– Automates communication

• AMR– Common for both above

• Particles– Multiphase flows– MD, Tree codes

Page 13: How Charm works its magic

13

Component Frameworks

Objective:• For commonly used

structures, application scientists– Shouldn’t have to deal

with parallel implementation issues

– Should be able to reuse code

Components and Challenges• Unstructured meshes:

Unmesh– Dynamic refinement support

for FEM– Solver interfaces– Multigrid support

• Structured meshes: Mblock– Multigrid support– Study applications

• Particles• Adaptive mesh refinement:

– Shrinking and growing trees– Applicable to the three

above

Page 14: How Charm works its magic

14

Charm/AMPI

MPI/lower layers

D

Unmesh MBlock Particles

AMR support

Application

Orchestration / Intergration Support

Application Components

Parallel Standard Libraries

Solvers

Data transfer

CA B

Framework Components

Page 15: How Charm works its magic

15

Outline• There is magic:

– Overview of charm capabilities

– Virtualization paper

– Summary of charm features:

• Converse– Machine model

– Scheduler and General Msgs

– Communication

– Threads

• Proxies and generated code

• Object Groups (BOCs)

• Support for migration– Seed balancing

– Migration support w reduction

• How migration is used– Vacating workstations and

adjusting to speed

– Adaptive Scheduler

• Principle of persistence:– Measurement based load

balancing

• Centralized strategies, refinement, commlb

• Distributed strategies (nbr)

– Collective communication opts

• Delegation

• Converse client-server interface

• Libraries: – liveviz, fft, ..

Page 16: How Charm works its magic

16

Converse• Converse is a layer on which Charm++ is built

– Provides machine-dependent code

– Provides “utilities” needed by the RTS of many parallel programming languages

– Used for implementing many mini-languages

• Main Components of Converse:– Machine model

– Scheduler and General Msgs

– Communication

– Threads

• Machine model:– Collections of nodes, each node is a collection processes.

– Processes on a node can share memory

– Macros for supporting node-level (shared) and processor level globals

Page 17: How Charm works its magic

17

Data driven execution

Scheduler Scheduler

Message Q Message Q

Page 18: How Charm works its magic

18

Converse Scheduler• The core of converse is message-driven execution

– But scheduled entities are not just “messgese” from remote processors

• Genalized notion of messages: any schedulable entity• From scheduler’s point of view: a block of memory,

– First few bytes encode a handler function (as an index into a table)

– Scheduler, in each iteration:• Polls network, enqueuing messages in a fifo• Selects a message from either the fifo or local-queue• Executes handler of of selected message

– This may result in enqueuing of message in the local-queue– Local queue is prioritized lifo/fifo– Priorities may be integers (smaller: higher) or bitvectors

(lexicographic)

Page 19: How Charm works its magic

19

Converse: communication and threads• Communication support:

– “send” a converse message to a remote processor

– Message must have handler-index encoded at the beginning

– Variety of send-variations (sync/async, memory deallocation) and broadcasts supproted

• Threads: bigger topic– User level threads

– Migratable threads

– Scheduled via converse scheduler

– Suspend and awaken : low level thread package

Page 20: How Charm works its magic

20

Communication ArchitectureCommunication Architecture

Communication API (Send/Recv)

Net MPI Shmem

UDP(machine-eth.c)

TCP(machine-tcp.c)

Myrinet(machine-gm.c)

Page 21: How Charm works its magic

21

Parallel Program StartupParallel Program Startup

• Net version - nodelist

Charmrun node compute node

Rsh/ssh(IP, port)

my node

(IP, port)

Broadcast all nodes

(IP, port)

compute node

Rsh/ssh(IP, port)

Page 22: How Charm works its magic

22

Converse InitializationConverse Initialization• ConverseInit

– Global variables initialization;

– Start worker threads;

• ConverseRunPE for each Charm PE– Per thread initialization;

– Loop into scheduler: CsdScheduler()

Page 23: How Charm works its magic

23

Message formatsMessage formats

• Net version#define CMK_MSG_HEADER_BASIC { CmiUInt2 d0,d1,d2,d3,d4,d5,hdl,d7; }

Dgram Header length handlerxhandler

• MPI version#define CMK_MSG_HEADER_BASIC { CmiUInt2 rank, root, hdl,xhdl,info,d3; }

infohandlerxhandlerrank root d3

Page 24: How Charm works its magic

24

SMP support• MPI-smp as an example

– Create threads: CmiStartThreads

– Worker threads work cycle

• See code in mahcine-smp.c

– Communication thread work cycle

• See code in machine-smp.c

Page 25: How Charm works its magic

25

Outline• There is magic:

– Overview of charm capabilities

– Virtualization paper

– Summary of charm features:

• Converse– Machine model

– Scheduler and General Msgs

– Communication

– Threads

• Proxies and generated code

• Support for migration– Seed balancing

– Migration support w reduction

• How migration is used– Vacating workstations and

adjusting to speed

– Adaptive Scheduler

• Principle of persistence:– Measurement based load

balancing

• Centralized strategies, refinement, commlb

• Distributed strategies (nbr)

– Collective communication opts

• Delegation

• Converse client-server interface

• Libraries: – liveviz, fft, ..

Page 26: How Charm works its magic

26

Charm++ translator

Page 27: How Charm works its magic

27

Need for Proxies• Consider:

– Object x of class A wants to invoke method f of obj y of class B.

– x and y are on different processors

– what should the syntax be?

• y->f( …)? : doesn’t work because y is not a local pointer

• Needed:– Instead of “y” we must use an ID that is valid across processors

– Method Invocation should use this ID

– Some part of the system must pack the parameters and send them

– Some part of the system on the remote processor must invoke the right method on the right object with the parameters supplied

Page 28: How Charm works its magic

28

Charm++ solution: proxy classes• Classes with remotely invokeable methods

– inherit from “chare” class (system defined)

– entry methods can only have one parameter: a subclass of message

• For each chare class D – which has methods that we want to remotely invoke

– The system will automatically generate a proxy class Cproxy_D

– Proxy objects know where the real object is

– Methods invoked on this class simply put the data in an “envelope” and send it out to the destination

• Each chare object has a proxy– CProxy_D thisProxy; // thisProxy inherited from “CBase_D”

– Also you can get a proxy for a chare when you create it:

• CProxy_D myNewChare = CProxy_D::ckNew(arg);

Page 29: How Charm works its magic

29

Generation of proxy classes• How does charm generate the proxy classes?

– Needs help from the programmer

– name classes and methods that can be remotely invoked

– declare this in a special “charm interface” file (pgm.ci)

– Include the generated code in your program

pgm.ci

mainmodule PiMod {

mainchare main {

entry main();

entry results(int pc);

};

chare piPart {

entry piPart(void);

};

Generates

PiMod.def.h

PiMod.def.h

pgm.h

#include “PiMod.decl.h”

..

Pgm.c

#include “PiMod.def.h”

Page 30: How Charm works its magic

30

Object Groups• A group of objects (chares)

– with exactly one representative on each processor

– A single proxy for the group as a whole

– invoke methods in a branch (asynchronously), all branches (broadcast), or in the local branch

– creation:

• agroup = Cproxy_C::ckNew(msg)

– remote invocation:

• p.methodName(msg); // p.methodName(msg, peNum);

• p.ckLocalBranch()->f(….);

Page 31: How Charm works its magic

31

Information sharing abstractions• Observation:

– Information is shared in several specific modes in parallel programs

• Other models support only a limited sets of modes:– Shared memory: everything is shared: sledgehammer

approach

– Message passing: messages are the only method

• Charm++: identifies and supports several modes– Readonly / writeonce

– Tables (hash tables)

– accumulators

– Monotonic variables

Page 32: How Charm works its magic

32

Outline• There is magic:

– Overview of charm capabilities

– Virtualization paper

– Summary of charm features:

• Converse– Machine model

– Scheduler and General Msgs

– Communication

– Threads

• Proxies and generated code

• Support for migration– Seed balancing

– Migration support w reduction

• How migration is used– Vacating workstations and

adjusting to speed

– Adaptive Scheduler

• Principle of persistence:– Measurement based load

balancing

• Centralized strategies, refinement, commlb

• Distributed strategies

– Collective communication opts

• Delegation

• Converse client-server interface

• Libraries: – liveviz, fft, ..

Page 33: How Charm works its magic

33

Seed Balancing• Applies (currently) to singleton chares

– Not chare array elements

– When a new chare is created, the system has freedom to assign it any processor

– The “seed” message (containing constructor parameters) may be moved around among procs until it takes root

• Use– Tree-structured computations

– State-space search, divide-conquer

– Early applications of Charm

– See papers

Page 34: How Charm works its magic

34

Object Arrays• A collection of data-driven objects

– With a single global name for the collection

– Each member addressed by an index

• [sparse] 1D, 2D, 3D, tree, string, ...

– Mapping of element objects to procS handled by the system

A[0] A[1] A[2] A[3] A[..]

User’s view

Page 35: How Charm works its magic

35

Object Arrays• A collection of data-driven objects

– With a single global name for the collection

– Each member addressed by an index

• [sparse] 1D, 2D, 3D, tree, string, ...

– Mapping of element objects to procS handled by the system

A[0] A[1] A[2] A[3] A[..]

A[3]A[0]

User’s view

System view

Page 36: How Charm works its magic

36

Object Arrays• A collection of data-driven objects

– With a single global name for the collection

– Each member addressed by an index

• [sparse] 1D, 2D, 3D, tree, string, ...

– Mapping of element objects to procS handled by the system

A[0] A[1] A[2] A[3] A[..]

A[3]A[0]

User’s view

System view

Page 37: How Charm works its magic

37

Migration support• Forwarding of messages

– Optimized by hop-counts:

– If a message took multiple hops, the receiver sends the current address to sender

– Array manager maintains a cache of known addresses

• Home processor for each element – Defined by a hash function on the index

• Reductions and broadcasts– Must work in presence of migrations!

• Also, in presence of deletions/insertions

– Communication time proportional to number of processors not objects

– Uses spanning tree based on processors, handling migrated “stragglers” separately

Page 38: How Charm works its magic

38

Outline• There is magic:

– Overview of charm capabilities

– Virtualization paper

– Summary of charm features:

• Converse– Machine model

– Scheduler and General Msgs

– Communication

– Threads

• Proxies and generated code

• Support for migration– Seed balancing

– Migration support w reduction

• How migration is used– Vacating workstations and

adjusting to speed

– Adaptive Scheduler

• Principle of persistence:– Measurement based load

balancing

• Centralized strategies, refinement, commlb

• Distributed strategies

– Collective communication opts

• Delegation

• Converse client-server interface

• Libraries: – liveviz, fft, ..

Page 39: How Charm works its magic

39

Flexible mapping: using migratibility• Vacating workstations

– If the “owner” starts using it,• Migrate objects away

– Detection:

• Adjusting to speed:– Static:

• measure speeds at the beginning

• Use speed ratios in load balancing

– Dynamic:• In a time-shared

environment• Measure unix “load”• Migrate proportional set of

objects if machine is loaded

• Adaptive job scheduler– Tells jobs to change the sets of

processors used

– Each job maintains a bit-vector of processors they can use.

– Scheduler can change bit-vector

– Jobs obey by migrating objects

– Forwarding, reductions:

• Residual process is left behind

Page 40: How Charm works its magic

40

Job Monitor

Job SubmissionFile UploadJob Specs

Bids

Job Specs

File Upload

Job Id

Job Id

Cluster

Cluster

Cluster

Faucets: Optimizing Utilization Within/across Clustershttp://charm.cs.uiuc.edu/research/faucets

Page 41: How Charm works its magic

41

When you attach to a job:

• Live performance data (bottom)

• Application data – App. Supplied -- Live

Job Monitoring: Appspector

Page 42: How Charm works its magic

42

Inefficient Utilization Within A Cluster

Job A 10 proce

ssors

Allocate A

Job B

8 processors

B QueuedConflict !16 Processor system

Job A

Job B

Current Job Schedulers can yield low system utilization..A competitive problem in Faucets-like systems

Parallel Servers are “profit centers” in faucets: need high utilization

Page 43: How Charm works its magic

43

Two Adaptive Jobs

Job A Max_pe = 10

Min_pe = 1

A Expands !

Job B

Min_pe = 8Max_pe= 16

Shrink AAllocate B !16 Processor system

Job A

Job B

B FinishesAllocate A !

Adaptive Jobs can shrink or expand the number of processors they use, at runtime: by migrating virtual processors in Charm/AMPI

Page 44: How Charm works its magic

44

AQS: Adaptive Queuing System

• AQS: A Scheduler for Clusters– Has the ability to manage adaptive jobs

• Currently, those implemented in Charm++ and AMPI– Handles regular (non-adaptive) MPI jobs– Experimental results on CSE Turing Cluster

0

20

40

60

80

100

120

12 30 60 100 108

System Load (%)

Sys

tem

Utli

zatio

n (%

)

Traditional Job

Adaptive Jobs

0

50

100

150

200

250

300

350

12 30 60 100 108

System Load (%)

Mea

n R

espo

nse

Tim

e (s

)

Traditional Job Adaptive Jobs

Page 45: How Charm works its magic

45

Outline• There is magic:

– Overview of charm capabilities

– Virtualization paper

– Summary of charm features:

• Converse– Machine model

– Scheduler and General Msgs

– Communication

– Threads

• Proxies and generated code

• Support for migration– Seed balancing

– Migration support w reduction

• How migration is used– Vacating workstations and

adjusting to speed

– Adaptive Scheduler

• Principle of persistence:– Measurement based load

balancing

• Centralized strategies, refinement, commlb

• Distributed strategies (nbr)

– Collective communication opts

• Delegation

• Converse client-server interface

• Libraries: – liveviz, fft, ..

Page 46: How Charm works its magic

46

IV: Principle of Persistence• Once the application is expressed in terms of

interacting objects:– Object communication patterns and computational loads

tend to persist over time

– In spite of dynamic behavior

• Abrupt and large,but infrequent changes (eg:AMR)

• Slow and small changes (eg: particle migration)

• Parallel analog of principle of locality– Heuristics, that holds for most CSE applications

– Learning / adaptive algorithms

– Adaptive Communication libraries

– Measurement based load balancing

Page 47: How Charm works its magic

47

Measurement Based Load Balancing• Based on Principle of persistence• Runtime instrumentation

– Measures communication volume and computation time

• Measurement based load balancers– Use the instrumented data-base periodically to make new

decisions

– Many alternative strategies can use the database

• Centralized vs distributed

• Greedy improvements vs complete reassignments

• Taking communication into account

• Taking dependences into account (More complex)

Page 48: How Charm works its magic

48

Load balancer in action

0

5

10

15

20

25

30

35

40

45

501 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91

Iteration Number

Nu

mb

er

of

Ite

rati

on

s P

er

se

con

dAutomatic Load Balancing in Crack Propagation

1. ElementsAdded 3. Chunks

Migrated

2. Load Balancer Invoked

Page 49: How Charm works its magic

49

Measurement based load balancing

Page 50: How Charm works its magic

50

Outline• There is magic:

– Overview of charm capabilities

– Virtualization paper

– Summary of charm features:

• Converse– Machine model

– Scheduler and General Msgs

– Communication

– Threads

• Proxies and generated code

• Support for migration– Seed balancing

– Migration support w reduction

• How migration is used– Vacating workstations and

adjusting to speed

– Adaptive Scheduler

• Principle of persistence:– Measurement based load

balancing

• Centralized strategies, refinement, commlb

• Distributed strategies (nbr)

– Collective communication opts

• Delegation

• Converse client-server interface

• Libraries: – liveviz, fft, ..

Page 51: How Charm works its magic

51

Optimizing for Communication Patterns• The parallel-objects Runtime System can observe,

instrument, and measure communication patterns– Communication is from/to objects, not processors

– Load balancers can use this to optimize object placement

– Communication libraries can optimize

• By substituting most suitable algorithm for each operation

• Learning at runtime

V. Krishnan, MS Thesis, 1996

Page 52: How Charm works its magic

52

Delegation mechanism• Messages to array elements

– Normally, an array message is handled by the BOC responsible for that array

– But, you can delegate message sending to another library (boc)

• All proxies support the following delegation routines. – void CProxy::ckDelegate(CkGroupID delMgr);

– This is useful for constructing communication libraries

• Suppose you want to combine many small messages

• Use: – At the beginning: proxy.delegate(libID),

– In loops: {lib.start(), proxy[I]->f(m1);…; lib.end()}

– Lib can maintain internal data structures for bugffering msgs

– User’s interface remains unchanged (except for the lib calls)

Page 53: How Charm works its magic

53

• Messages to array elements– Normally, an array message is handled by the BOC

responsible for that array

– But, you can delegate message sending to another library (boc)

• All proxies support the following delegation routines. – void CProxy::ckDelegate(CkGroupID delMgr);

– This is useful for constructing communication libraries

• Suppose you want to combine many small messages

Page 54: How Charm works its magic

54

• Messages to array elements– Normally, an array message is handled by the BOC

responsible for that array

– But, you can delegate message sending to another library (boc)

• All proxies support the following delegation routines.

– void CProxy::ckDelegate(CkGroupID delMgr);

– This is useful for constructing communication libraries

• Suppose you want to combine many small messages

Page 55: How Charm works its magic

55

Communication Optimizations

• Persistence principle– Observing

communication behavior across iterations

– Communication patterns

• Who sends to whom

• Size of messages

• “Collective” Communication– Substitute optimal

algorithms

– Tune parameters

0

20

40

60

80

100

120

140

256 512 1024

Processors

Mesh Direct MPI

This approach makes a difference in applications

-- E.g. Transpose in NAMD

Time/step (ms)

Page 56: How Charm works its magic

56

Outline• There is magic:

– Overview of charm capabilities

– Virtualization paper

– Summary of charm features:

• Converse– Machine model

– Scheduler and General Msgs

– Communication

– Threads

• Proxies and generated code

• Support for migration– Seed balancing

– Migration support w reduction

• How migration is used– Vacating workstations and

adjusting to speed

– Adaptive Scheduler

• Principle of persistence:– Measurement based load

balancing

• Centralized strategies, refinement, commlb

• Distributed strategies (nbr)

– Collective communication opts

• Delegation

• Converse client-server interface

• Libraries: – liveviz, fft, ..

Page 57: How Charm works its magic

57

Outline• There is magic:

– Overview of charm capabilities

– Virtualization paper

– Summary of charm features:

• Converse– Machine model

– Scheduler and General Msgs

– Communication

– Threads

• Proxies and generated code

• Support for migration– Seed balancing

– Migration support w reduction

• How migration is used– Vacating workstations and

adjusting to speed

– Adaptive Scheduler

• Principle of persistence:– Measurement based load

balancing

• Centralized strategies, refinement, commlb

• Distributed strategies (nbr)

– Collective communication opts

• Delegation

• Converse client-server interface

• Libraries: – liveviz, fft, ..

Page 58: How Charm works its magic

58

Object-based Parallelization

User View

System implementation

User is only concerned with interaction between objects

Page 59: How Charm works its magic

59

Data driven execution

Scheduler Scheduler

Message Q Message Q

Page 60: How Charm works its magic

60

Page 61: How Charm works its magic

61

Scaling to 64K/128K processors of BG/L• What issues will arise?

– Communication

• Bandwidth use more important than processor overhead

• Locality:

– Global Synchronizations

• Costly, but not because it takes longer

• Rather, small “jitters” have a large impact

• Sum of Max vs Max of Sum

– Load imbalance important, but low grainsize is crucial

– Critical paths gains importance

Page 62: How Charm works its magic

62

Runtime Optimization Challenges

• Full exploitation of object decomposition– Load balance within individual phases– Automatic Identification of phases

• 10K-100K processors– Inadequate load balancing strategies– Need topology-aware balancing– Also fully-distributed decision making

• But based on “reasonable” global information

• Communication Optimizations– Identify patterns, develop algorithms suitable for each, and implement “learning”

strategies

• Fault tolerance and incremental checkpointing– Objects as the unit of checkpointing.– In-memory checkpoints– Tradeoff between forward-path overhead and fault-handling cost