MPI for BG/L

IBM Research

BG/L Day, Feb 6 2004 © 2004 IBM Corporation

MPI for BG/L

George Almási

IBM Research


Outline

Preliminaries BG/L MPI Software Architecture Optimization framework Status & Future direction

IBM Research


BG/L MPI Who’s who

Watson:John GunnelsBlueMatter team

NCARJohn Dennis, Henry Tufo

LANLAdolfy Hoisie, Fabrizio Petrini, Darren Kerbyson

IBM IndiaMeeta Sharma,Rahul Garg

LLNL Astron

Functionality TestingGlenn Leckband, Jeff Garbisch (Rochester)

Performance TestingKurt Pinnow, Joe Ratterman (Rochester)

Performance AnalysisJesus Labarta (UPC)Nils SmedsBob Walkup, Gyan Bhanot, Frank Suits

MPICH2 frameworkBill Gropp, Rusty Lusk, Brian Toonen, Rajeev Thakur, others (ANL)

BG/L port: library coreCharles Archer (Rochester)George Almasi, Xavier Martorell

Torus primitivesNils SmedsPhilip Heidelberger

Tree primitivesChris ErwayBurk Steinmacher

Users Testers Developers

Enablers System software group

(you know who you are)

IBM Research


The BG/L MPI Design Effort

Started off with constraints and ideas from everywhere, pulling in every direction

Use algorithm X for HW feature Y MPI package choice, battle over required functionality Operating system, job start management constraints

90% of work was to figure out which ideas made immediate sense Immediately implement Implement in the long term, but ditch for the first year Evaluate only when hardware becomes available Forget it

Development framework established by January 2003 Project grew alarmingly:

January 2003: 1 fulltime + 1 postdoc + 1 summer student January 2004: ~ 30 people (implementation, testing, performance)

IBM Research


MPI PMI

MPICH2 based BG/L Software Architecture

collectivespt2pt datatype topo

Abstract Device Interface

CH3

socket

MM

simple

uniprocessor

Message passing Process managementm

pd

MessageLayer

bgltorus

torustorus treetree GIGI bgltorus

TorusDevice

TreeDevice

GIDevice

CIOProtocol

PacketLayer

IBM Research


Connection Manager

Rank 0 (0,0,0)

Rank 1 (0,0,1)

Rank 2 (0,0,2)

Rank n (x,y,z)

…

…

sendQ

sendQ

recvsendQ

sendQ

recv

recv

recv

Progress Engine

Dispatcher

Send manager

Architecture Detail: Message Layer

msg1 msg2 msgP…

Send QueueMessage Data

(un)packetizeruser buffer

protocol & state info

MPID_Request

IBM Research


Performance Limiting Factors in the MPI Design

Torus Network link bandwidth0.25 Bytes/cycle/link (theoretical)0.22 Bytes/cycle/link (effective)12*0.22 = 2.64 Bytes/cycle/node

Streaming memory bandwidth 4.3 Bytes/cycle/CPU memory copies are expensive

Dual core setup, memory coherencyExplicit coherency management via “blind device” and cache flush primitivesRequires communication between processorsBest done in large chunksCoprocessor cannot manage MPI data structures

Network order semantics and routingDeterministic routing: in order, bad torus performanceAdaptive routing: excellent network performance, out-of-order packetsIn-order semantics is expensive

CPU/network interface204 cycles to read a packet;50 – 100 cycles to write a packetAlignment restrictions

Handling badly aligned data is expensive

Short FIFOsNetwork needs frequent attention

Only tree channel 1 is available to MPI CNK is single-threaded; MPICH2 is not

thread safe Context switches are expensive

Interrupt driven execution is slow

Hardware

Software

IBM Research


Optimizing short-message latency

The thing to watch is overhead

Bandwidth CPU load Co-processor Network load

Memory copies take care of alignment Deterministic routing insures MPI semantics

Adaptive routing would double msg layer overheadBalance here may change as we scale to 64k nodes

Today: ½ nearest-neighbor roundtrip latency: 3000 cycles

About 6 s @ 500MHzWithin SOW specs @ 700MHz

Can improve 20-25% by shortening packets

HW32%

msg layer13%

Per-packet overhead

29%

High level (MPI)26%

Composition of roundtrip latency:

Not a factor:not enough

network traffic

IBM Research


Optimizing MPI for High CPU Network Traffic(neighbor to neighbor communication)

Most important thing to optimize for: CPU per packet overhead

At maximum torus utilization, only 90 CPU cycles available to prepare/handle a packet!

Sad (measured) reality:READ: 204, WRITE: 50-100 cycles Plus MPI overhead

Packet overhead reduction Cooked packets:

Contain destination addressAssume intitial dialog (rendezvous)

Rendezvous costs 3000 cycles Saves 100 cycles/packet Allows adaptively routed packets Permits coprocessor mode

Coprocessor mode essential (Allows 180 cycles/CPU/packet) Explicit cache management

5000 cycles/messageSystem support necessary

Coprocessor libraryScratchpad library

Lingering RIT1 memory issues Adaptive routing essential

MPI semantics achieved by initial deterministically routed scout packet

Packet alignment issues handled with 0 memory copies

Overlapping realignment with torus reading

Drawback: only works well for long messages (10KBytes+)

IBM Research


Per-node asymptotic bandwidth in MPI

01

23

45

6

01

23

45

60

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Ba

nd

wid

th (

By

tes

/cy

cle

)

ReceiversSenders

Per-node bandwidth in coprocessor mode

01

23

45

6

01

23

45

60

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Ba

nd

wid

th (

By

tes

/cy

cle

)

ReceiversSenders

Per-node bandwidth in heater mode

IBM Research


The cost of packet re-alignment

The cost (cycles) of reading a packet from the torus into un-aligned memory

0

100

200

300

400

500

600

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

alignment

cyc

les

non-aligned receivereceive + copyIdeal

IBM Research


Optimizing for high network traffic, short messages

High network trafficAdaptive routing absolute necessity

Short messages:Cannot use rendezvous protocol

CPU load not a limiting factorCoprocessor irrelevant

Message reordering solutionWorst-case: up to 1000 cycles/packetPer CPU bandwidth limited to 10% of nominal peak

Flow control solutionQuasi-sync protocol: Ack packets for each unordered messageOnly works for messages long enough

Tmsg > latency

Situation not prevalent on the 8x8x8 network.

Will be one of the scaling problemsCross-section increases with n2

#cpus increases with n3

IBM Research


MPI communication protocols

A mechanism to optimize MPI behavior based on communication requirements

Protocol Status Routing NN BW Dyn BW Latency Copro. Range

Eager Deployed Det. High Low Good No 0.2-10KB

Short Deployed Det. Low Low V. good No 0-240B

Rendezvous Deployed Adaptive V. high Max. Bad Yes 3KB -

Quasi-sync Planned Hybrid Good High ? No 0.5-3KB

IBM Research


MPI communication protocols and their uses

CPU loadNetwork load

Message size

Coprocessor limitRendezvous limit

rendezvousprotocol

short protocol

eagerprotocol

rendezvousprotocol

quasi-sync

IBM Research


MPI in Virtual Node Mode

Splitting resources between Cus50% each of memory, cache50% each of torus hardwareTree channel 0 used by CNKTree channel 1 shared by CPUsCommon memory: scratchpad

Virtual node mode is good forComputationally intensiveSmall memory footprintSmall/medium network traffic

Deployed, used by BlueMatter team

Effect of L3 sharing on virtual node mode

0

20

40

60

80

100

120

140

cg ep ft lu bt spNA

S P

erfo

rman

ce m

easu

re (

MO

ps/

s/p

roce

sso

r)

Heater mode Virtual node mode

IBM Research


Optimal MPI task->torus mapping

NAS BT 2D mesh communication pattern Map on 3D mesh/torus?

Folding and inverting planes in the 3D mesh

NAS BT scaling: Computation scales down with n-2

Communication scales down with n-1

NAS BT Scaling (virtual node mode)

0

10

20

30

40

50

60

70

80

90

100

12

1

16

9

22

5

28

9

36

1

44

1

52

9

62

5

72

9

84

1

96

1

Number of processors

Pe

r-C

PU

pe

rfo

rma

nc

e (

MO

ps

/s/C

PU

)

naïve mapping

optimized mapping

IBM Research


Optimizing MPI Collective Operations

MPICH2 comes with default collective algorithms: Functionally, we are covered But default algorithms not suitable for torus topology

Written with ethernet-like networks in mind Work has started on optimized collectives:

For torus network: broadcast, alltoall For tree network: barrier, broadcast, allreduce

Work on testing for functionality and performance just begun Rochester performance testing team

IBM Research


Broadcast on a mesh (torus)

3S+2R

2S+2R

1S+2R

0S+2R

4S+2R

Based on ideas from Vernon Austel, John Gunnels, Phil Heidelberger, Nils Smeds

Implemented & measured by Nils Smeds

IBM Research


Optimized Tree Collectives25

6

1024

4096

1638

4

6553

6

2621

44

1048

576

4194

304

8

32

128

512

2.399E+08

2.400E+08

2.401E+08

2.402E+08

2.403E+08

2.404E+08

2.405E+08

2.406E+08

Ba

nd

wid

th (

By

tes

/s)

Message size (Bytes)

Pro

ce

ss

ors

Tree Broadcast Bandwidth

256

1024

4096

1638

4

6553

6

2621

44

1048

576

4194

304

8

32

128

512

0.000E+00

5.000E+07

1.000E+08

1.500E+08

2.000E+08

2.500E+08

Ba

nd

wid

th (

By

tes

/s)

Message size (Bytes)

Pro

ce

ss

ors

Tree Integer Allreduce Bandwidth

Implementation w/ Chris Erway & Burk SteinmacherMeasurements from Kurt Pinnow

IBM Research


BG/L MPI: Status Today (2/6/2004)

MPI-1 compliant Passes large majority of Intel/ANL MPI test suiteCoprocessor mode available

50-70% improvement in bandwidthRegularly testedNot fully deployed

Hampered by BLC 1.0 bugs

Virtual node mode availableDeployedNot tested regularly

Process managementUser-defined process to torus mappings available

Optimized collectives:Optimized torus broadcast

Ready for deployment pending code review, optimizations

Optimized tree broadcast, barrier, allreduce

Almost ready for deployment Functionality: OK Performance: a good foundation

IBM Research


Where are we going to hurt next?

Anticipating this year:4 racks in the near (?) future

Don’t anticipate major scaling problems

CEO milestone at end of yearWe are up to 29 of 216. That’s halfway on a log scale.

We have not hit any “unprecedented” sizes yet. LLNL can run MPI jobs on more machines than we have.

Fear factor: combination of congested network and short messages

Lessons from last yearAlignment problemsCo-processor mode

A coding nightmareOverlapping computation with communication

Coprocessor cannot touch data w/o main processor cooperating

Excessive CPU load hard to handleEven with coprocessor, still cannot handle 2.6 Bytes/cycle/nod (yet)

Flow controlUnexpected messages slow reception down

IBM Research


Conclusion

In the middle of moving from functionality mode to performance centric mode

Rochester taking over functionality, routine performance testing Teams in Watson & Rochester collaborating on collective performance

We don’t know how to run 64k MPI processes Imperative to keep design fluid enough to counter surprises Establishing a large community of measuring, analyzing behavior

A lot of performance work needed New protocol(s) Collectives on the torus, tree

Documents

MPI for BG/L