Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
1 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 Graham_OpenMPI_SC08
1 Managed by UT-Battelle for the Department of Energy
The Role of InfiniBand Technologies in
High Performance Computing
2 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
Contributors
Gil Bloch
Noam Bloch
Hillel Chapman
Manjunath Gorentla-Venkata
Richard Graham
Michael Kagan
Josh Ladd
Vasily Philipov
Steve Poole
Ishai Rabinovich
Ariel Shahar
Gilad Shainer
Pavel Shamis
3 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
Outline
Spider file system
CORE-Direct
– InfiniBand overview
– New InfiniBand capabilities
– Software design for collective operations
– Results
4 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 Graham_OpenMPI_SC08
4 Managed by UT-Battelle for the Department of Energy
Spider File System at the Oak Ridge
Leadership Computing Facility
5 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
Motivation for Spider File System
Building dedicated file systems for each platforms does not scale operationally
– Storage often 10% or more of new system cost
– Bundled storage often not poised to grow independently of attached machine
– Different curves for storage and compute technology
– Data needs to be moved between different compute islands
For example: Simulation platform to visualization platform
– Dedicated storage is only accessible when its machine is available
– Managing multiple file systems requires more manpower
data sharing path
JaguarXT5
Ewok
Lens
Smoky
Jaguar XT4
SION Network & Spider System
JaguarXT4
JaguarXT5 Ewok
LensSmoky
6 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
Spider: A System At Scale
Over 10.7 PB of RAID 6 Capacity
13,440 1TB drives
192 storage servers
Over 3 TB of memory (Lustre OSS)
Available to many compute systems through high-speed network:
– Over 3,000 IB ports
– Over 5 kilometer cables
Over 26,000 client mounts for I/O
Demonstrated I/O performance: 240 GB/s
Current Status
– in production use on all major OLCF computing platforms
7 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
Spider: Couplet and Scalable Cluster
Disks
280 in 5 trays
DDN Couplet
(2 controllers)
OSS (4 Dell nodes) 24 IB ports
Flextronics Switch
IB Ports
Uplink to
Cisco Core Switch
Disks
280 in 5 trays
DDN Couplet
(2 controllers)
OSS (4 Dell nodes) 24 IB ports
Flextronics Switch
IB Ports
Uplink to
Cisco Core Switch
280 1TB Disks
in 5 disk trays
DDN Couplet
(2 controllers)
OSS (4 Dell nodes) 24 IB ports
Flextronics Switch
IB Ports
Uplink to
Cisco Core Switch
A Scalable Cluster (SC)
SC SC SC SC
SC SC SC SC
SC SC SC SC
SC SC SC SC
16 SC Units on the floor
2 racks for each SC
8 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
Snapshot of Technical Challenges
Solved
Performance
– Asynchronous journaling
– Network congestion avoidance (topology aware I/O)
Scalability
– 26,000 clients
– 7 OST per OSS
– Lesson from server side client statistics
Fault Tolerance and Reliability
– Network, I/O server, Storage Array
SeaStar
Torus
Congestion
! "
#$%&"
' ( &) "
*) *#"
%&) ! "
$! +( &"
$( $#) "
! "
#! ! ! "
' ! ! ! "
*! ! ! "
%! ! ! "
$! ! ! ! "
$#! ! ! "
$' ! ! ! "
! " ) ! ! ! " $! ! ! ! " $) ! ! ! " #! ! ! ! " #) ! ! ! " ( ! ! ! ! "
!"#$%&'()#
* %+ , - .#/0#12'- &(3#
! - + / .4#0//(5. '&(#/ 77#
9 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
Spider - How Did We Get Here?
4 years project
We didn’t just pick up phone and order a center-wide file system
– No single vendor could deliver this system
– Trail blazing was required
Collaborative effort was key to success
– ORNL
– Cray
– DDN
– Cisco
– CFS, SUN, Oracle, and now Whamcloud
10 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 Graham_OpenMPI_SC08
10 Managed by UT-Battelle for the Department of Energy
CORE-Direct Technology
11 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
Problems Being Addressed – Collective
Operations
Collective communication characteristics at scale
– Overlapping computation with communication – true asynchronous communications
– System noise
– Performance
– Scalability
Goal: Avoid using the CPU for communication processing
Offload Communication management to the network
12 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
Collective Communications
Communication pattern involving multiple processes (in MPI, all ranks in the communicator are involved)
Optimized collectives involve a communicator-wide data-dependent communication pattern
Data needs to be manipulated at intermediate stages of a collective operation
Collective operations limit application scalability
Collective operations magnify the effects of system-noise
13 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
Scalability of Collective Operations
Ideal Algorithm Impact of System Noise
3
1
2
4
14 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
Scalability of Collective Operations - II
Offloaded Algorithm Nonblocking Algorithm
- Communication processing
15 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
Approach to solving the problem
Co-design
– Network stack design (Mellanox)
– Hardware development (Mellanox)
– Application level requirement (ORNL)
– MPI/Shmem level implementation (Joint)
16 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
InfiniBand Collective Offload – Key idea
Create local description of the communication patterns
Hand the description to the HCA
Manage collective communications at the network level
Poll for collective completion
Add new support for
– Synchronization primitives (hardware) Send Enable task
Receive Enable task
Wait task
– Multiple Work Request A sequence of network tasks
– Management Queue
17 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
InfiniBand Hardware Changes
Tasks defined in the current standard
• Send
• Receive
• Read
• Write
• Atomic
New support
Synchronization primitives (hardware)
– Send Enable task
– Receive Enable task
– Wait task
Multiple Work Request
– A sequence of network tasks
Management Queue
18 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
Standard InfiniBand Connected Queue
Design
19 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
Small
data
Large
data
Credit
QP
Resource
recycling
Send Recv
Recv CQ
Send Recv
Recv CQ
Send Recv
Recv CQ
Send Recv
Recv CQ
Collective
MQ
MQ CQ Service
MQ
Send
CQ
All send
Queues
Per Communicator
Resources
Per
Peer
Resources
Queue Structure
20 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
Basic Collectives Framework Subgroup Framework
IB IB
OFFLOAD
Pt2Pt SM Socket IBNET Shared
Memory
Collective Framework
Tuned (pt2pt)
Collectives Comp.
MLNX
OFED
ML – Hierarchical
Collectives Comp.
MLNX
OFED
Module Component Architecture
OMPI
Collectives – Software Layers
21 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
Example – 4 Process Recursive Doubling
1 2 3 4
1 2 3 4
1 2 3 4
Step 1
Step 2
22 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
4 Process Barrier Example
Proc 0 Proc 1 Proc 2 Proc 3
Exchange
With proc 1
Exchange
With proc 0
Exchange
With proc 3
Exchange
With proc 2
Exchange
With proc 2
Exchange
With proc 3
Exchange
With proc 0
Exchange
With proc 1
Proc 0 Proc 1 Proc 2 Proc 3
Send to
proc 1
Send to
proc 0
Send to
proc 3
Send to
proc 2
Wait on recv
from 1
Wait on recv
From 0
Wait on recv
From 3
Wait on recv
From 2
Send to
proc 2
Send to
proc 3
Send to
proc 0
Send to
proc 1
Wait on recv
from 2
Wait on recv
From 3
Wait on recv
From 0
Wait on recv
From 1
MWR
Algorithm
23 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
4 Process Barrier Example – Queue view
Proc 0 Proc 1 Proc 2 Proc 3
Recv wait
from 1
Recv wait
from 0
Recv wait
from 3
Recv wait
from 2
Send enable
1
Send enable
0
Send enable
3
Send enable
2
Recv wait
from 2
Recv wait
from 3
Recv wait
from 0
Recv wait
from 1
MQ
Send QP Proc 0 Proc 1 Proc 2 Proc 3
Send to
proc 1 -
enabled
Send to
proc 0 –
enabled
Send to
proc 3 -
enabled
Send to
proc 2 -
enabled
Send to 2 –
not enabled
Send to 3 –
not enabled
Send to 0 –
not enabled
Send to 1 –
not enabled
Completion
24 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
8 Process Barrier Example – Queue view
– no MQ, View at rank 0
QP 1 QP 2 QP 4
Send QP 1 Wait QP 1 Wait QP 1
Send QP 2 Wait QP 2
Send QP 4
Wait QP 4
25 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
Socket
Network
Nod
e
System
Unused
core Occupied core
System Hierarchy
26 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
Benchmarks
27 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
System setup
8 node cluster
Node Architecture
– 3 GHz Intel Xeon
– Dual socket
– Quad core
Network
– ConnextX-2 HCA
– 36 port QDR switch running pre-release firmware
28 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 Graham_OpenMPI_SC08
28 Managed by UT-Battelle for the Department of Energy
Barrier Data
29 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
8 Node Blocking MPI Barrier
30 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
MPI Barrier - Offloaded
31 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
MPI Barrier – Comparison with PtP
32 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
MPIX_Ibarrier Performance
33 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
Nonblocking Barrier – Overlap –
Multiple Work Quanta
34 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
Nonblocking Barrier – Overlap –
1 Work Quanta
35 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 Graham_OpenMPI_SC08
35 Managed by UT-Battelle for the Department of Energy
Barrier Data
Hierarchy
36 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
Flat Barrier Algorithm
1 2 3 4
1 2 3 4
1 2 3 4
Host 1 Host 2
Inter Host
Communication
Step 1
Step 2
37 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
Hierarchical Barrier Algorithm
1 2 3 4
1 2 3 4
1 2 3 4
Host 1 Host 2
Inter Host
Communication
Step 1
Step 2
1 2 3 4
Step 3
38 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
MPI Barrier timings
39 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
Barrier timings – blocking vs.
nonblocking
40 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
Nonblocking Barrier Overlap
41 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 Graham_OpenMPI_SC08
41 Managed by UT-Battelle for the Department of Energy
Broadcast Data
42 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
IB – Large Message Algorithm
ProcessI ProcessJ
QP
Send Send Wait
Recv Recv
CreditQP
Recv Recv
Send Send
QP
SendSendWait
RecvRecv
CreditQP
RecvRecv
SendSend
1)RegisterReceiveMemory
2)No fysender
3)Waitoncreditmessage
4)Senduserdata
43 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
Broadcast Latency – usec per call
Msg size IBOff + SM IBOff P2P + SM Open MPI
– default
MVAPICH
16B 3.48 16.11 2.55 5.58 5.81
1KB 4.87 23.96 5.66 12.20 10.46
8MB 25244 40735 28288 37343 41439
44 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
Nonblocking Broadcast Latency – usec
per call
Msg sizeß IBOff + SM IBOff P2P + SM
16B 3.58 19.79 2.57
1KB 4.96 27.44 5.70
8MB 26100 37855 28781
45 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
Broadcast – small data - hierarchical
46 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
Broadcast – large data - hierarchical
47 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
Overlap Measurement
Benchmark steps:
Polling Method 1. Post broadcast
2. Do work and poll for completion
3. Continue until broadcast completion
Post-work-wait Method 1. Post broadcast
2. Do work
3. Wait for broadcast completion
4. Compare the time of steps 1 – 3 with post-wait
5. Increase the work and repeat steps 1-4 until the time for post-work-wait is greater than post-wait
48 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
Nonblocking Broadcast – Overlap - Poll
49 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
Nonblocking Broadcast – Overlap - Wait
50 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 Graham_OpenMPI_SC08
50 Managed by UT-Battelle for the Department of Energy
All-To-All Data
51 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
All-To-All: 1 Byte
52 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
All-To-All: 64 Bytes
53 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
All-To-All: 128 Bytes
54 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
All-To-All: 4 MB/process
55 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 Graham_OpenMPI_SC08
55 Managed by UT-Battelle for the Department of Energy
Allgather Data
56 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
All-Gather: 1 Byte
57 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
All-Gather: 128 Bytes
58 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
All-Gather: 131072 Bytes
59 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08
Summary
Added hardware support for offloading broadcast operations
Developed MPI-level support for one-copy for asynchronous contiguous large-data transfer
Good collective performance
Good overlap capabilities