Upload
madeline-barrett
View
217
Download
2
Embed Size (px)
Citation preview
© 2012 MELLANOX TECHNOLOGIES 1
The Exascale Interconnect Technology
Rich Graham – Sr. Solutions Architect
© 2012 MELLANOX TECHNOLOGIES 2
Leading Server and Storage Interconnect Provider
Software
Comprehensive End-to-End 10/40/56Gb/s Ethernet and 56Gb/s InfiniBand Portfolio
ICs Switches/GatewaysAdapter Cards Cables
Scalability, Reliability, Power, Performance
© 2012 MELLANOX TECHNOLOGIES 3
HCA Roadmap of Interconnect Innovations
InfiniHost
World’s first InfiniBand HCA
10Gb/s InfiniBandPCI-X host interface1 million msg/sec
InfiniHost III
World’s first PCIe InfiniBand HCA
20Gb/s InfiniBandPCIe 1.0 2 million msg/sec
ConnectX (1,2,3)
World’s first Virtual Protocol
Interconnect (VPI) Adapter
40Gb/s & 56Gb/s PCIe 2.0, 3.0 x833 million msg/sec
Connect-IB
The Exascale Foundation
2002 2005
June2012
2008-11
© 2012 MELLANOX TECHNOLOGIES 4
A new interconnect architecture for compute intensive applications
World’s fastest server and storage interconnect solution providing 100Gb/s injection bandwidth
Enables unlimited clustering scalability with new Dynamically Connected Transport service
Accelerates compute-intensive and parallel-intensive applications with over 130 million msg/sec
Optimized for multi-tenant environments of 100s of Virtual Machines per server
Announcing Connect-IB: The Exascale Foundation
© 2012 MELLANOX TECHNOLOGIES -- CONFIDENTIAL -- 5
New innovative transport – Dynamically Connected Transport service• The new transport service combines the best of:
- Reliable Connected Service – transport reliability
- Unreliable Datagram (UD) – no resources reservation• Scale out for unlimited clustering size of compute and storage• Eliminates overhead and reduces memory footprint
CoreDirect Collective Hardware Offloads• Provides ‘state’ to Work Queue Mechanisms for Collective Offloading in HCA• Frees CPU to do meaningful computation in parallel with collective operations
Derived Data Types• Hardware support for non-contiguous ‘strided’ memory access • Scatter/gather optimizations
Connect-IB Advanced HPC Features
New Transport Mechanism for Unlimited Scalability
© 2012 MELLANOX TECHNOLOGIES 6
Dynamically Connected Transport Service
© 2012 MELLANOX TECHNOLOGIES 7
Transport Scalability• RC requires connection per peer – strains resource requirements at large scale
(O(N))• XRC requires connection per remote node – strains resource requirements at
large scale (O(N))
Transport Performance• UD supports only send/receive semantics – no RDMA or Atomic operations
support
Problems The New Capability addresses
© 2012 MELLANOX TECHNOLOGIES 8
Domically Connected (DC) H/W entities• DC Initiator (DCI) - Data source• DC Target (DCT) – Data Destination
Key concept• Reliable communications- Supports RDMA and Atomics
• Single Initiator can send to multiple destinations• Resource footprint scales as:- Application communication patterns
- Single node communication characteristics
Dynamically Connected Transport Service Basics
© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 9
Communication Time Line – Common Case
© 2012 MELLANOX TECHNOLOGIES 10
COREDirect Enhanced support
© 2012 MELLANOX TECHNOLOGIES 11
Collective communication scalability• For many HPC applications the scalability of such communications determines
application scalability
System noise• Uncoordinated system activity causes the slow down in one process to be
magnified at other processes• Effects increase as the size of the system increases
Collective communication performance
Problems The New Capability addresses
© 2012 MELLANOX TECHNOLOGIES 12
Scalability of Collective Operations
Ideal Algorithm
Impact of System Noise
3
1
2
4
© 2012 MELLANOX TECHNOLOGIES 13
Scalability of Collective Operations
Offloaded Algorithm
Nonblocking Algorithm
- Communication processing
© 2012 MELLANOX TECHNOLOGIES 14
Managed QP progresses a separate counter (instead of by door-bell)
A ‘wait work queue’ entry waits until specified completion queue (QP) reaches specified producer index value
‘Enable tasks’ manage QP’s to be executed by the H/W
Can set receive CQ’s to continue to be active if they overflow• wait events monitor progress
Submit lists of task to multiple QP’s• sufficient to describe collective operations
Can setup a special completion queue to monitor list completion • request CQE from the relevant task
Key Hardware Features
© 2012 MELLANOX TECHNOLOGIES 15
Collective communications Optimizations• Communication pattern involving multiple processes • Optimized collectives involve a communicator-wide data-dependent
communication pattern• Data needs to be manipulated at intermediate stages of a collective operation• Collective operations limit application scalability - For example, system noise
COREDirect – Key Ideas• Create a local description of the communication pattern• Pass the description to the HCA• Manage the collective operation on the network, freeing the CPU to do
meaningful computation• Poll for collective completion
Collective Communication Methodology
© 2012 MELLANOX TECHNOLOGIES 16
Barrier Collective
© 2012 MELLANOX TECHNOLOGIES 17
Alltoall Collective (128 Bytes)
© 2012 MELLANOX TECHNOLOGIES 18
Nonblocking Allgather (Overlap Post-Work-Wait)
© 2012 MELLANOX TECHNOLOGIES 19
Nonblocking Alltoall (Overlap-Wait)
© 2012 MELLANOX TECHNOLOGIES 20
Non-Contiguous Data Type Support
© 2012 MELLANOX TECHNOLOGIES 21
Transfer of non-contiguous data• Often triggers data packing in main memory, adding to the communication
overhead• Increased CPU involvement in communication pre/post-processing
Problems The New Capability addresses
© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 22
Combining Contiguous Memory Regions
© 2012 MELLANOX TECHNOLOGIES 23
Supports non-contiguous strided memory access, scatter/gather
x
y
z
Non-Contiguous Memory Access – Regular Access
© 2012 MELLANOX TECHNOLOGIES 24
THANK YOU