Upload
jena
View
45
Download
0
Embed Size (px)
DESCRIPTION
MPI for BG/L. George Almási. Outline. Preliminaries BG/L MPI Software Architecture Optimization framework Status & Future direction. Watson: John Gunnels BlueMatter team NCAR John Dennis, Henry Tufo LANL Adolfy Hoisie, Fabrizio Petrini, Darren Kerbyson IBM India - PowerPoint PPT Presentation
Citation preview
IBM Research
BG/L Day, Feb 6 2004 © 2004 IBM Corporation
MPI for BG/L
George Almási
IBM Research
BG/L Day, Feb 6 2004 © 2004 IBM Corporation
Outline
Preliminaries BG/L MPI Software Architecture Optimization framework Status & Future direction
IBM Research
BG/L Day, Feb 6 2004 © 2004 IBM Corporation
BG/L MPI Who’s who
Watson:John GunnelsBlueMatter team
NCARJohn Dennis, Henry Tufo
LANLAdolfy Hoisie, Fabrizio Petrini, Darren Kerbyson
IBM IndiaMeeta Sharma,Rahul Garg
LLNL Astron
Functionality TestingGlenn Leckband, Jeff Garbisch (Rochester)
Performance TestingKurt Pinnow, Joe Ratterman (Rochester)
Performance AnalysisJesus Labarta (UPC)Nils SmedsBob Walkup, Gyan Bhanot, Frank Suits
MPICH2 frameworkBill Gropp, Rusty Lusk, Brian Toonen, Rajeev Thakur, others (ANL)
BG/L port: library coreCharles Archer (Rochester)George Almasi, Xavier Martorell
Torus primitivesNils SmedsPhilip Heidelberger
Tree primitivesChris ErwayBurk Steinmacher
Users Testers Developers
Enablers System software group
(you know who you are)
IBM Research
BG/L Day, Feb 6 2004 © 2004 IBM Corporation
The BG/L MPI Design Effort
Started off with constraints and ideas from everywhere, pulling in every direction
Use algorithm X for HW feature Y MPI package choice, battle over required functionality Operating system, job start management constraints
90% of work was to figure out which ideas made immediate sense Immediately implement Implement in the long term, but ditch for the first year Evaluate only when hardware becomes available Forget it
Development framework established by January 2003 Project grew alarmingly:
January 2003: 1 fulltime + 1 postdoc + 1 summer student January 2004: ~ 30 people (implementation, testing, performance)
IBM Research
BG/L Day, Feb 6 2004 © 2004 IBM Corporation
MPI PMI
MPICH2 based BG/L Software Architecture
collectivespt2pt datatype topo
Abstract Device Interface
CH3
socket
MM
simple
uniprocessor
Message passing Process managementm
pd
MessageLayer
bgltorus
torustorus treetree GIGI bgltorus
TorusDevice
TreeDevice
GIDevice
CIOProtocol
PacketLayer
IBM Research
BG/L Day, Feb 6 2004 © 2004 IBM Corporation
Connection Manager
Rank 0 (0,0,0)
Rank 1 (0,0,1)
Rank 2 (0,0,2)
Rank n (x,y,z)
…
…
sendQ
sendQ
recvsendQ
sendQ
recv
recv
recv
Progress Engine
Dispatcher
Send manager
Architecture Detail: Message Layer
msg1 msg2 msgP…
Send QueueMessage Data
(un)packetizeruser buffer
protocol & state info
MPID_Request
IBM Research
BG/L Day, Feb 6 2004 © 2004 IBM Corporation
Performance Limiting Factors in the MPI Design
Torus Network link bandwidth0.25 Bytes/cycle/link (theoretical)0.22 Bytes/cycle/link (effective)12*0.22 = 2.64 Bytes/cycle/node
Streaming memory bandwidth 4.3 Bytes/cycle/CPU memory copies are expensive
Dual core setup, memory coherencyExplicit coherency management via “blind device” and cache flush primitivesRequires communication between processorsBest done in large chunksCoprocessor cannot manage MPI data structures
Network order semantics and routingDeterministic routing: in order, bad torus performanceAdaptive routing: excellent network performance, out-of-order packetsIn-order semantics is expensive
CPU/network interface204 cycles to read a packet;50 – 100 cycles to write a packetAlignment restrictions
Handling badly aligned data is expensive
Short FIFOsNetwork needs frequent attention
Only tree channel 1 is available to MPI CNK is single-threaded; MPICH2 is not
thread safe Context switches are expensive
Interrupt driven execution is slow
Hardware
Software
IBM Research
BG/L Day, Feb 6 2004 © 2004 IBM Corporation
Optimizing short-message latency
The thing to watch is overhead
Bandwidth CPU load Co-processor Network load
Memory copies take care of alignment Deterministic routing insures MPI semantics
Adaptive routing would double msg layer overheadBalance here may change as we scale to 64k nodes
Today: ½ nearest-neighbor roundtrip latency: 3000 cycles
About 6 s @ 500MHzWithin SOW specs @ 700MHz
Can improve 20-25% by shortening packets
HW32%
msg layer13%
Per-packet overhead
29%
High level (MPI)26%
Composition of roundtrip latency:
Not a factor:not enough
network traffic
IBM Research
BG/L Day, Feb 6 2004 © 2004 IBM Corporation
Optimizing MPI for High CPU Network Traffic(neighbor to neighbor communication)
Most important thing to optimize for: CPU per packet overhead
At maximum torus utilization, only 90 CPU cycles available to prepare/handle a packet!
Sad (measured) reality:READ: 204, WRITE: 50-100 cycles Plus MPI overhead
Packet overhead reduction Cooked packets:
Contain destination addressAssume intitial dialog (rendezvous)
Rendezvous costs 3000 cycles Saves 100 cycles/packet Allows adaptively routed packets Permits coprocessor mode
Coprocessor mode essential (Allows 180 cycles/CPU/packet) Explicit cache management
5000 cycles/messageSystem support necessary
Coprocessor libraryScratchpad library
Lingering RIT1 memory issues Adaptive routing essential
MPI semantics achieved by initial deterministically routed scout packet
Packet alignment issues handled with 0 memory copies
Overlapping realignment with torus reading
Drawback: only works well for long messages (10KBytes+)
IBM Research
BG/L Day, Feb 6 2004 © 2004 IBM Corporation
Per-node asymptotic bandwidth in MPI
01
23
45
6
01
23
45
60
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Ba
nd
wid
th (
By
tes
/cy
cle
)
ReceiversSenders
Per-node bandwidth in coprocessor mode
01
23
45
6
01
23
45
60
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Ba
nd
wid
th (
By
tes
/cy
cle
)
ReceiversSenders
Per-node bandwidth in heater mode
IBM Research
BG/L Day, Feb 6 2004 © 2004 IBM Corporation
The cost of packet re-alignment
The cost (cycles) of reading a packet from the torus into un-aligned memory
0
100
200
300
400
500
600
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
alignment
cyc
les
non-aligned receivereceive + copyIdeal
IBM Research
BG/L Day, Feb 6 2004 © 2004 IBM Corporation
Optimizing for high network traffic, short messages
High network trafficAdaptive routing absolute necessity
Short messages:Cannot use rendezvous protocol
CPU load not a limiting factorCoprocessor irrelevant
Message reordering solutionWorst-case: up to 1000 cycles/packetPer CPU bandwidth limited to 10% of nominal peak
Flow control solutionQuasi-sync protocol: Ack packets for each unordered messageOnly works for messages long enough
Tmsg > latency
Situation not prevalent on the 8x8x8 network.
Will be one of the scaling problemsCross-section increases with n2
#cpus increases with n3
IBM Research
BG/L Day, Feb 6 2004 © 2004 IBM Corporation
MPI communication protocols
A mechanism to optimize MPI behavior based on communication requirements
Protocol Status Routing NN BW Dyn BW Latency Copro. Range
Eager Deployed Det. High Low Good No 0.2-10KB
Short Deployed Det. Low Low V. good No 0-240B
Rendezvous Deployed Adaptive V. high Max. Bad Yes 3KB -
Quasi-sync Planned Hybrid Good High ? No 0.5-3KB
IBM Research
BG/L Day, Feb 6 2004 © 2004 IBM Corporation
MPI communication protocols and their uses
CPU loadNetwork load
Message size
Coprocessor limitRendezvous limit
rendezvousprotocol
short protocol
eagerprotocol
rendezvousprotocol
quasi-sync
IBM Research
BG/L Day, Feb 6 2004 © 2004 IBM Corporation
MPI in Virtual Node Mode
Splitting resources between Cus50% each of memory, cache50% each of torus hardwareTree channel 0 used by CNKTree channel 1 shared by CPUsCommon memory: scratchpad
Virtual node mode is good forComputationally intensiveSmall memory footprintSmall/medium network traffic
Deployed, used by BlueMatter team
Effect of L3 sharing on virtual node mode
0
20
40
60
80
100
120
140
cg ep ft lu bt spNA
S P
erfo
rman
ce m
easu
re (
MO
ps/
s/p
roce
sso
r)
Heater mode Virtual node mode
IBM Research
BG/L Day, Feb 6 2004 © 2004 IBM Corporation
Optimal MPI task->torus mapping
NAS BT 2D mesh communication pattern Map on 3D mesh/torus?
Folding and inverting planes in the 3D mesh
NAS BT scaling: Computation scales down with n-2
Communication scales down with n-1
NAS BT Scaling (virtual node mode)
0
10
20
30
40
50
60
70
80
90
100
12
1
16
9
22
5
28
9
36
1
44
1
52
9
62
5
72
9
84
1
96
1
Number of processors
Pe
r-C
PU
pe
rfo
rma
nc
e (
MO
ps
/s/C
PU
)
naïve mapping
optimized mapping
IBM Research
BG/L Day, Feb 6 2004 © 2004 IBM Corporation
Optimizing MPI Collective Operations
MPICH2 comes with default collective algorithms: Functionally, we are covered But default algorithms not suitable for torus topology
Written with ethernet-like networks in mind Work has started on optimized collectives:
For torus network: broadcast, alltoall For tree network: barrier, broadcast, allreduce
Work on testing for functionality and performance just begun Rochester performance testing team
IBM Research
BG/L Day, Feb 6 2004 © 2004 IBM Corporation
Broadcast on a mesh (torus)
3S+2R
2S+2R
1S+2R
0S+2R
4S+2R
Based on ideas from Vernon Austel, John Gunnels, Phil Heidelberger, Nils Smeds
Implemented & measured by Nils Smeds
IBM Research
BG/L Day, Feb 6 2004 © 2004 IBM Corporation
Optimized Tree Collectives25
6
1024
4096
1638
4
6553
6
2621
44
1048
576
4194
304
8
32
128
512
2.399E+08
2.400E+08
2.401E+08
2.402E+08
2.403E+08
2.404E+08
2.405E+08
2.406E+08
Ba
nd
wid
th (
By
tes
/s)
Message size (Bytes)
Pro
ce
ss
ors
Tree Broadcast Bandwidth
256
1024
4096
1638
4
6553
6
2621
44
1048
576
4194
304
8
32
128
512
0.000E+00
5.000E+07
1.000E+08
1.500E+08
2.000E+08
2.500E+08
Ba
nd
wid
th (
By
tes
/s)
Message size (Bytes)
Pro
ce
ss
ors
Tree Integer Allreduce Bandwidth
Implementation w/ Chris Erway & Burk SteinmacherMeasurements from Kurt Pinnow
IBM Research
BG/L Day, Feb 6 2004 © 2004 IBM Corporation
BG/L MPI: Status Today (2/6/2004)
MPI-1 compliant Passes large majority of Intel/ANL MPI test suiteCoprocessor mode available
50-70% improvement in bandwidthRegularly testedNot fully deployed
Hampered by BLC 1.0 bugs
Virtual node mode availableDeployedNot tested regularly
Process managementUser-defined process to torus mappings available
Optimized collectives:Optimized torus broadcast
Ready for deployment pending code review, optimizations
Optimized tree broadcast, barrier, allreduce
Almost ready for deployment Functionality: OK Performance: a good foundation
IBM Research
BG/L Day, Feb 6 2004 © 2004 IBM Corporation
Where are we going to hurt next?
Anticipating this year:4 racks in the near (?) future
Don’t anticipate major scaling problems
CEO milestone at end of yearWe are up to 29 of 216. That’s halfway on a log scale.
We have not hit any “unprecedented” sizes yet. LLNL can run MPI jobs on more machines than we have.
Fear factor: combination of congested network and short messages
Lessons from last yearAlignment problemsCo-processor mode
A coding nightmareOverlapping computation with communication
Coprocessor cannot touch data w/o main processor cooperating
Excessive CPU load hard to handleEven with coprocessor, still cannot handle 2.6 Bytes/cycle/nod (yet)
Flow controlUnexpected messages slow reception down
IBM Research
BG/L Day, Feb 6 2004 © 2004 IBM Corporation
Conclusion
In the middle of moving from functionality mode to performance centric mode
Rochester taking over functionality, routine performance testing Teams in Watson & Rochester collaborating on collective performance
We don’t know how to run 64k MPI processes Imperative to keep design fluid enough to counter surprises Establishing a large community of measuring, analyzing behavior
A lot of performance work needed New protocol(s) Collectives on the torus, tree