The Role of InfiniBand Technologies in High Performance Computing

1 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 Graham_OpenMPI_SC08

1 Managed by UT-Battelle for the Department of Energy

The Role of InfiniBand Technologies in

High Performance Computing

2 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

Contributors

Gil Bloch

Noam Bloch

Hillel Chapman

Manjunath Gorentla-Venkata

Richard Graham

Michael Kagan

Josh Ladd

Vasily Philipov

Steve Poole

Ishai Rabinovich

Ariel Shahar

Gilad Shainer

Pavel Shamis


Outline

Spider file system

CORE-Direct

– InfiniBand overview

– New InfiniBand capabilities

– Software design for collective operations

– Results



Spider File System at the Oak Ridge

Leadership Computing Facility


Motivation for Spider File System

Building dedicated file systems for each platforms does not scale operationally

– Storage often 10% or more of new system cost

– Bundled storage often not poised to grow independently of attached machine

– Different curves for storage and compute technology

– Data needs to be moved between different compute islands

For example: Simulation platform to visualization platform

– Dedicated storage is only accessible when its machine is available

– Managing multiple file systems requires more manpower

data sharing path

JaguarXT5

Ewok

Lens

Smoky

Jaguar XT4

SION Network & Spider System

JaguarXT4

JaguarXT5 Ewok

LensSmoky


Spider: A System At Scale

Over 10.7 PB of RAID 6 Capacity

13,440 1TB drives

192 storage servers

Over 3 TB of memory (Lustre OSS)

Available to many compute systems through high-speed network:

– Over 3,000 IB ports

– Over 5 kilometer cables

Over 26,000 client mounts for I/O

Demonstrated I/O performance: 240 GB/s

Current Status

– in production use on all major OLCF computing platforms


Spider: Couplet and Scalable Cluster

Disks

280 in 5 trays

DDN Couplet

(2 controllers)

OSS (4 Dell nodes) 24 IB ports

Flextronics Switch

IB Ports

Uplink to

Cisco Core Switch

Disks

280 in 5 trays

DDN Couplet

(2 controllers)


Flextronics Switch

IB Ports

Uplink to

Cisco Core Switch

280 1TB Disks

in 5 disk trays

DDN Couplet

(2 controllers)


Flextronics Switch

IB Ports

Uplink to

Cisco Core Switch

A Scalable Cluster (SC)

SC SC SC SC

SC SC SC SC

SC SC SC SC

SC SC SC SC

16 SC Units on the floor

2 racks for each SC


Snapshot of Technical Challenges

Solved

Performance

– Asynchronous journaling

– Network congestion avoidance (topology aware I/O)

Scalability

– 26,000 clients

– 7 OST per OSS

– Lesson from server side client statistics

Fault Tolerance and Reliability

– Network, I/O server, Storage Array

SeaStar

Torus

Congestion

! "

#$%&"

' ( &) "

*) *#"

%&) ! "

$! +( &"

$( $#) "

! "

#! ! ! "

' ! ! ! "

*! ! ! "

%! ! ! "

$! ! ! ! "

$#! ! ! "

$' ! ! ! "

! " ) ! ! ! " $! ! ! ! " $) ! ! ! " #! ! ! ! " #) ! ! ! " ( ! ! ! ! "

!"#$%&'()#

* %+ , - .#/0#12'- &(3#

! - + / .4#0//(5. '&(#/&#6 77#


Spider - How Did We Get Here?

4 years project

We didn’t just pick up phone and order a center-wide file system

– No single vendor could deliver this system

– Trail blazing was required

Collaborative effort was key to success

– ORNL

– Cray

– DDN

– Cisco

– CFS, SUN, Oracle, and now Whamcloud



CORE-Direct Technology


Problems Being Addressed – Collective

Operations

Collective communication characteristics at scale

– Overlapping computation with communication – true asynchronous communications

– System noise

– Performance

– Scalability

Goal: Avoid using the CPU for communication processing

Offload Communication management to the network


Collective Communications

Communication pattern involving multiple processes (in MPI, all ranks in the communicator are involved)

Optimized collectives involve a communicator-wide data-dependent communication pattern

Data needs to be manipulated at intermediate stages of a collective operation

Collective operations limit application scalability

Collective operations magnify the effects of system-noise


Scalability of Collective Operations

Ideal Algorithm Impact of System Noise

3

1

2

4


Scalability of Collective Operations - II

Offloaded Algorithm Nonblocking Algorithm

- Communication processing


Approach to solving the problem

Co-design

– Network stack design (Mellanox)

– Hardware development (Mellanox)

– Application level requirement (ORNL)

– MPI/Shmem level implementation (Joint)


InfiniBand Collective Offload – Key idea

Create local description of the communication patterns

Hand the description to the HCA

Manage collective communications at the network level

Poll for collective completion

Add new support for

– Synchronization primitives (hardware) Send Enable task

Receive Enable task

Wait task

– Multiple Work Request A sequence of network tasks

– Management Queue


InfiniBand Hardware Changes

Tasks defined in the current standard

• Send

• Receive

• Read

• Write

• Atomic

New support

Synchronization primitives (hardware)

– Send Enable task

– Receive Enable task

– Wait task

Multiple Work Request

– A sequence of network tasks

Management Queue


Standard InfiniBand Connected Queue

Design


Small

data

Large

data

Credit

QP

Resource

recycling

Send Recv

Recv CQ

Send Recv

Recv CQ

Send Recv

Recv CQ

Send Recv

Recv CQ

Collective

MQ

MQ CQ Service

MQ

Send

CQ

All send

Queues

Per Communicator

Resources

Per

Peer

Resources

Queue Structure


Basic Collectives Framework Subgroup Framework

IB IB

OFFLOAD

Pt2Pt SM Socket IBNET Shared

Memory

Collective Framework

Tuned (pt2pt)

Collectives Comp.

MLNX

OFED

ML – Hierarchical

Collectives Comp.

MLNX

OFED

Module Component Architecture

OMPI

Collectives – Software Layers


Example – 4 Process Recursive Doubling

1 2 3 4

1 2 3 4

1 2 3 4

Step 1

Step 2


4 Process Barrier Example

Proc 0 Proc 1 Proc 2 Proc 3

Exchange

With proc 1

Exchange

With proc 0

Exchange

With proc 3

Exchange

With proc 2

Exchange

With proc 2

Exchange

With proc 3

Exchange

With proc 0

Exchange

With proc 1


Send to

proc 1

Send to

proc 0

Send to

proc 3

Send to

proc 2

Wait on recv

from 1

Wait on recv

From 0

Wait on recv

From 3

Wait on recv

From 2

Send to

proc 2

Send to

proc 3

Send to

proc 0

Send to

proc 1

Wait on recv

from 2

Wait on recv

From 3

Wait on recv

From 0

Wait on recv

From 1

MWR

Algorithm


4 Process Barrier Example – Queue view


Recv wait

from 1

Recv wait

from 0

Recv wait

from 3

Recv wait

from 2

Send enable

1

Send enable

0

Send enable

3

Send enable

2

Recv wait

from 2

Recv wait

from 3

Recv wait

from 0

Recv wait

from 1

MQ

Send QP Proc 0 Proc 1 Proc 2 Proc 3

Send to

proc 1 -

enabled

Send to

proc 0 –

enabled

Send to

proc 3 -

enabled

Send to

proc 2 -

enabled

Send to 2 –

not enabled

Send to 3 –

not enabled

Send to 0 –

not enabled

Send to 1 –

not enabled

Completion


8 Process Barrier Example – Queue view

– no MQ, View at rank 0

QP 1 QP 2 QP 4

Send QP 1 Wait QP 1 Wait QP 1

Send QP 2 Wait QP 2

Send QP 4

Wait QP 4


Socket

Network

Nod

e

System

Unused

core Occupied core

System Hierarchy


Benchmarks


System setup

8 node cluster

Node Architecture

– 3 GHz Intel Xeon

– Dual socket

– Quad core

Network

– ConnextX-2 HCA

– 36 port QDR switch running pre-release firmware



Barrier Data


8 Node Blocking MPI Barrier


MPI Barrier - Offloaded


MPI Barrier – Comparison with PtP


MPIX_Ibarrier Performance


Nonblocking Barrier – Overlap –

Multiple Work Quanta


Nonblocking Barrier – Overlap –

1 Work Quanta



Barrier Data

Hierarchy


Flat Barrier Algorithm

1 2 3 4

1 2 3 4

1 2 3 4

Host 1 Host 2

Inter Host

Communication

Step 1

Step 2


Hierarchical Barrier Algorithm

1 2 3 4

1 2 3 4

1 2 3 4

Host 1 Host 2

Inter Host

Communication

Step 1

Step 2

1 2 3 4

Step 3


MPI Barrier timings


Barrier timings – blocking vs.

nonblocking


Nonblocking Barrier Overlap



Broadcast Data


IB – Large Message Algorithm

ProcessI ProcessJ

QP

Send Send Wait

Recv Recv

CreditQP

Recv Recv

Send Send

QP

SendSendWait

RecvRecv

CreditQP

RecvRecv

SendSend

1)RegisterReceiveMemory

2)No fysender

3)Waitoncreditmessage

4)Senduserdata


Broadcast Latency – usec per call

Msg size IBOff + SM IBOff P2P + SM Open MPI

– default

MVAPICH

16B 3.48 16.11 2.55 5.58 5.81

1KB 4.87 23.96 5.66 12.20 10.46

8MB 25244 40735 28288 37343 41439


Nonblocking Broadcast Latency – usec

per call

Msg sizeß IBOff + SM IBOff P2P + SM

16B 3.58 19.79 2.57

1KB 4.96 27.44 5.70

8MB 26100 37855 28781


Broadcast – small data - hierarchical


Broadcast – large data - hierarchical


Overlap Measurement

Benchmark steps:

Polling Method 1. Post broadcast

2. Do work and poll for completion

3. Continue until broadcast completion

Post-work-wait Method 1. Post broadcast

2. Do work

3. Wait for broadcast completion

4. Compare the time of steps 1 – 3 with post-wait

5. Increase the work and repeat steps 1-4 until the time for post-work-wait is greater than post-wait


Nonblocking Broadcast – Overlap - Poll


Nonblocking Broadcast – Overlap - Wait



All-To-All Data


All-To-All: 1 Byte


All-To-All: 64 Bytes


All-To-All: 128 Bytes


All-To-All: 4 MB/process



Allgather Data


All-Gather: 1 Byte


All-Gather: 128 Bytes


All-Gather: 131072 Bytes


Summary

Added hardware support for offloading broadcast operations

Developed MPI-level support for one-copy for asynchronous contiguous large-data transfer

Good collective performance

Good overlap capabilities

Documents

The Role of InfiniBand Technologies in High Performance Computing