59
1 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 Graham_OpenMPI_SC08 1 Managed by UT-Battelle for the Department of Energy The Role of InfiniBand Technologies in High Performance Computing

The Role of InfiniBand Technologies in High Performance Computing

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: The Role of InfiniBand Technologies in High Performance Computing

1 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 Graham_OpenMPI_SC08

1 Managed by UT-Battelle for the Department of Energy

The Role of InfiniBand Technologies in

High Performance Computing

Page 2: The Role of InfiniBand Technologies in High Performance Computing

2 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

Contributors

Gil Bloch

Noam Bloch

Hillel Chapman

Manjunath Gorentla-Venkata

Richard Graham

Michael Kagan

Josh Ladd

Vasily Philipov

Steve Poole

Ishai Rabinovich

Ariel Shahar

Gilad Shainer

Pavel Shamis

Page 3: The Role of InfiniBand Technologies in High Performance Computing

3 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

Outline

Spider file system

CORE-Direct

– InfiniBand overview

– New InfiniBand capabilities

– Software design for collective operations

– Results

Page 4: The Role of InfiniBand Technologies in High Performance Computing

4 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 Graham_OpenMPI_SC08

4 Managed by UT-Battelle for the Department of Energy

Spider File System at the Oak Ridge

Leadership Computing Facility

Page 5: The Role of InfiniBand Technologies in High Performance Computing

5 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

Motivation for Spider File System

Building dedicated file systems for each platforms does not scale operationally

– Storage often 10% or more of new system cost

– Bundled storage often not poised to grow independently of attached machine

– Different curves for storage and compute technology

– Data needs to be moved between different compute islands

For example: Simulation platform to visualization platform

– Dedicated storage is only accessible when its machine is available

– Managing multiple file systems requires more manpower

data sharing path

JaguarXT5

Ewok

Lens

Smoky

Jaguar XT4

SION Network & Spider System

JaguarXT4

JaguarXT5 Ewok

LensSmoky

Page 6: The Role of InfiniBand Technologies in High Performance Computing

6 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

Spider: A System At Scale

Over 10.7 PB of RAID 6 Capacity

13,440 1TB drives

192 storage servers

Over 3 TB of memory (Lustre OSS)

Available to many compute systems through high-speed network:

– Over 3,000 IB ports

– Over 5 kilometer cables

Over 26,000 client mounts for I/O

Demonstrated I/O performance: 240 GB/s

Current Status

– in production use on all major OLCF computing platforms

Page 7: The Role of InfiniBand Technologies in High Performance Computing

7 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

Spider: Couplet and Scalable Cluster

Disks

280 in 5 trays

DDN Couplet

(2 controllers)

OSS (4 Dell nodes) 24 IB ports

Flextronics Switch

IB Ports

Uplink to

Cisco Core Switch

Disks

280 in 5 trays

DDN Couplet

(2 controllers)

OSS (4 Dell nodes) 24 IB ports

Flextronics Switch

IB Ports

Uplink to

Cisco Core Switch

280 1TB Disks

in 5 disk trays

DDN Couplet

(2 controllers)

OSS (4 Dell nodes) 24 IB ports

Flextronics Switch

IB Ports

Uplink to

Cisco Core Switch

A Scalable Cluster (SC)

SC SC SC SC

SC SC SC SC

SC SC SC SC

SC SC SC SC

16 SC Units on the floor

2 racks for each SC

Page 8: The Role of InfiniBand Technologies in High Performance Computing

8 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

Snapshot of Technical Challenges

Solved

Performance

– Asynchronous journaling

– Network congestion avoidance (topology aware I/O)

Scalability

– 26,000 clients

– 7 OST per OSS

– Lesson from server side client statistics

Fault Tolerance and Reliability

– Network, I/O server, Storage Array

SeaStar

Torus

Congestion

! "

#$%&"

' ( &) "

*) *#"

%&) ! "

$! +( &"

$( $#) "

! "

#! ! ! "

' ! ! ! "

*! ! ! "

%! ! ! "

$! ! ! ! "

$#! ! ! "

$' ! ! ! "

! " ) ! ! ! " $! ! ! ! " $) ! ! ! " #! ! ! ! " #) ! ! ! " ( ! ! ! ! "

!"#$%&'()#

* %+ , - .#/0#12'- &(3#

! - + / .4#0//(5. '&(#/&#6 77#

Page 9: The Role of InfiniBand Technologies in High Performance Computing

9 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

Spider - How Did We Get Here?

4 years project

We didn’t just pick up phone and order a center-wide file system

– No single vendor could deliver this system

– Trail blazing was required

Collaborative effort was key to success

– ORNL

– Cray

– DDN

– Cisco

– CFS, SUN, Oracle, and now Whamcloud

Page 10: The Role of InfiniBand Technologies in High Performance Computing

10 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 Graham_OpenMPI_SC08

10 Managed by UT-Battelle for the Department of Energy

CORE-Direct Technology

Page 11: The Role of InfiniBand Technologies in High Performance Computing

11 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

Problems Being Addressed – Collective

Operations

Collective communication characteristics at scale

– Overlapping computation with communication – true asynchronous communications

– System noise

– Performance

– Scalability

Goal: Avoid using the CPU for communication processing

Offload Communication management to the network

Page 12: The Role of InfiniBand Technologies in High Performance Computing

12 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

Collective Communications

Communication pattern involving multiple processes (in MPI, all ranks in the communicator are involved)

Optimized collectives involve a communicator-wide data-dependent communication pattern

Data needs to be manipulated at intermediate stages of a collective operation

Collective operations limit application scalability

Collective operations magnify the effects of system-noise

Page 13: The Role of InfiniBand Technologies in High Performance Computing

13 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

Scalability of Collective Operations

Ideal Algorithm Impact of System Noise

3

1

2

4

Page 14: The Role of InfiniBand Technologies in High Performance Computing

14 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

Scalability of Collective Operations - II

Offloaded Algorithm Nonblocking Algorithm

- Communication processing

Page 15: The Role of InfiniBand Technologies in High Performance Computing

15 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

Approach to solving the problem

Co-design

– Network stack design (Mellanox)

– Hardware development (Mellanox)

– Application level requirement (ORNL)

– MPI/Shmem level implementation (Joint)

Page 16: The Role of InfiniBand Technologies in High Performance Computing

16 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

InfiniBand Collective Offload – Key idea

Create local description of the communication patterns

Hand the description to the HCA

Manage collective communications at the network level

Poll for collective completion

Add new support for

– Synchronization primitives (hardware) Send Enable task

Receive Enable task

Wait task

– Multiple Work Request A sequence of network tasks

– Management Queue

Page 17: The Role of InfiniBand Technologies in High Performance Computing

17 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

InfiniBand Hardware Changes

Tasks defined in the current standard

• Send

• Receive

• Read

• Write

• Atomic

New support

Synchronization primitives (hardware)

– Send Enable task

– Receive Enable task

– Wait task

Multiple Work Request

– A sequence of network tasks

Management Queue

Page 18: The Role of InfiniBand Technologies in High Performance Computing

18 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

Standard InfiniBand Connected Queue

Design

Page 19: The Role of InfiniBand Technologies in High Performance Computing

19 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

Small

data

Large

data

Credit

QP

Resource

recycling

Send Recv

Recv CQ

Send Recv

Recv CQ

Send Recv

Recv CQ

Send Recv

Recv CQ

Collective

MQ

MQ CQ Service

MQ

Send

CQ

All send

Queues

Per Communicator

Resources

Per

Peer

Resources

Queue Structure

Page 20: The Role of InfiniBand Technologies in High Performance Computing

20 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

Basic Collectives Framework Subgroup Framework

IB IB

OFFLOAD

Pt2Pt SM Socket IBNET Shared

Memory

Collective Framework

Tuned (pt2pt)

Collectives Comp.

MLNX

OFED

ML – Hierarchical

Collectives Comp.

MLNX

OFED

Module Component Architecture

OMPI

Collectives – Software Layers

Page 21: The Role of InfiniBand Technologies in High Performance Computing

21 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

Example – 4 Process Recursive Doubling

1 2 3 4

1 2 3 4

1 2 3 4

Step 1

Step 2

Page 22: The Role of InfiniBand Technologies in High Performance Computing

22 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

4 Process Barrier Example

Proc 0 Proc 1 Proc 2 Proc 3

Exchange

With proc 1

Exchange

With proc 0

Exchange

With proc 3

Exchange

With proc 2

Exchange

With proc 2

Exchange

With proc 3

Exchange

With proc 0

Exchange

With proc 1

Proc 0 Proc 1 Proc 2 Proc 3

Send to

proc 1

Send to

proc 0

Send to

proc 3

Send to

proc 2

Wait on recv

from 1

Wait on recv

From 0

Wait on recv

From 3

Wait on recv

From 2

Send to

proc 2

Send to

proc 3

Send to

proc 0

Send to

proc 1

Wait on recv

from 2

Wait on recv

From 3

Wait on recv

From 0

Wait on recv

From 1

MWR

Algorithm

Page 23: The Role of InfiniBand Technologies in High Performance Computing

23 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

4 Process Barrier Example – Queue view

Proc 0 Proc 1 Proc 2 Proc 3

Recv wait

from 1

Recv wait

from 0

Recv wait

from 3

Recv wait

from 2

Send enable

1

Send enable

0

Send enable

3

Send enable

2

Recv wait

from 2

Recv wait

from 3

Recv wait

from 0

Recv wait

from 1

MQ

Send QP Proc 0 Proc 1 Proc 2 Proc 3

Send to

proc 1 -

enabled

Send to

proc 0 –

enabled

Send to

proc 3 -

enabled

Send to

proc 2 -

enabled

Send to 2 –

not enabled

Send to 3 –

not enabled

Send to 0 –

not enabled

Send to 1 –

not enabled

Completion

Page 24: The Role of InfiniBand Technologies in High Performance Computing

24 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

8 Process Barrier Example – Queue view

– no MQ, View at rank 0

QP 1 QP 2 QP 4

Send QP 1 Wait QP 1 Wait QP 1

Send QP 2 Wait QP 2

Send QP 4

Wait QP 4

Page 25: The Role of InfiniBand Technologies in High Performance Computing

25 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

Socket

Network

Nod

e

System

Unused

core Occupied core

System Hierarchy

Page 26: The Role of InfiniBand Technologies in High Performance Computing

26 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

Benchmarks

Page 27: The Role of InfiniBand Technologies in High Performance Computing

27 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

System setup

8 node cluster

Node Architecture

– 3 GHz Intel Xeon

– Dual socket

– Quad core

Network

– ConnextX-2 HCA

– 36 port QDR switch running pre-release firmware

Page 28: The Role of InfiniBand Technologies in High Performance Computing

28 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 Graham_OpenMPI_SC08

28 Managed by UT-Battelle for the Department of Energy

Barrier Data

Page 29: The Role of InfiniBand Technologies in High Performance Computing

29 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

8 Node Blocking MPI Barrier

Page 30: The Role of InfiniBand Technologies in High Performance Computing

30 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

MPI Barrier - Offloaded

Page 31: The Role of InfiniBand Technologies in High Performance Computing

31 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

MPI Barrier – Comparison with PtP

Page 32: The Role of InfiniBand Technologies in High Performance Computing

32 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

MPIX_Ibarrier Performance

Page 33: The Role of InfiniBand Technologies in High Performance Computing

33 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

Nonblocking Barrier – Overlap –

Multiple Work Quanta

Page 34: The Role of InfiniBand Technologies in High Performance Computing

34 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

Nonblocking Barrier – Overlap –

1 Work Quanta

Page 35: The Role of InfiniBand Technologies in High Performance Computing

35 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 Graham_OpenMPI_SC08

35 Managed by UT-Battelle for the Department of Energy

Barrier Data

Hierarchy

Page 36: The Role of InfiniBand Technologies in High Performance Computing

36 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

Flat Barrier Algorithm

1 2 3 4

1 2 3 4

1 2 3 4

Host 1 Host 2

Inter Host

Communication

Step 1

Step 2

Page 37: The Role of InfiniBand Technologies in High Performance Computing

37 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

Hierarchical Barrier Algorithm

1 2 3 4

1 2 3 4

1 2 3 4

Host 1 Host 2

Inter Host

Communication

Step 1

Step 2

1 2 3 4

Step 3

Page 38: The Role of InfiniBand Technologies in High Performance Computing

38 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

MPI Barrier timings

Page 39: The Role of InfiniBand Technologies in High Performance Computing

39 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

Barrier timings – blocking vs.

nonblocking

Page 40: The Role of InfiniBand Technologies in High Performance Computing

40 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

Nonblocking Barrier Overlap

Page 41: The Role of InfiniBand Technologies in High Performance Computing

41 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 Graham_OpenMPI_SC08

41 Managed by UT-Battelle for the Department of Energy

Broadcast Data

Page 42: The Role of InfiniBand Technologies in High Performance Computing

42 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

IB – Large Message Algorithm

ProcessI ProcessJ

QP

Send Send Wait

Recv Recv

CreditQP

Recv Recv

Send Send

QP

SendSendWait

RecvRecv

CreditQP

RecvRecv

SendSend

1)RegisterReceiveMemory

2)No fysender

3)Waitoncreditmessage

4)Senduserdata

Page 43: The Role of InfiniBand Technologies in High Performance Computing

43 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

Broadcast Latency – usec per call

Msg size IBOff + SM IBOff P2P + SM Open MPI

– default

MVAPICH

16B 3.48 16.11 2.55 5.58 5.81

1KB 4.87 23.96 5.66 12.20 10.46

8MB 25244 40735 28288 37343 41439

Page 44: The Role of InfiniBand Technologies in High Performance Computing

44 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

Nonblocking Broadcast Latency – usec

per call

Msg sizeß IBOff + SM IBOff P2P + SM

16B 3.58 19.79 2.57

1KB 4.96 27.44 5.70

8MB 26100 37855 28781

Page 45: The Role of InfiniBand Technologies in High Performance Computing

45 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

Broadcast – small data - hierarchical

Page 46: The Role of InfiniBand Technologies in High Performance Computing

46 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

Broadcast – large data - hierarchical

Page 47: The Role of InfiniBand Technologies in High Performance Computing

47 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

Overlap Measurement

Benchmark steps:

Polling Method 1. Post broadcast

2. Do work and poll for completion

3. Continue until broadcast completion

Post-work-wait Method 1. Post broadcast

2. Do work

3. Wait for broadcast completion

4. Compare the time of steps 1 – 3 with post-wait

5. Increase the work and repeat steps 1-4 until the time for post-work-wait is greater than post-wait

Page 48: The Role of InfiniBand Technologies in High Performance Computing

48 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

Nonblocking Broadcast – Overlap - Poll

Page 49: The Role of InfiniBand Technologies in High Performance Computing

49 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

Nonblocking Broadcast – Overlap - Wait

Page 50: The Role of InfiniBand Technologies in High Performance Computing

50 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 Graham_OpenMPI_SC08

50 Managed by UT-Battelle for the Department of Energy

All-To-All Data

Page 51: The Role of InfiniBand Technologies in High Performance Computing

51 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

All-To-All: 1 Byte

Page 52: The Role of InfiniBand Technologies in High Performance Computing

52 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

All-To-All: 64 Bytes

Page 53: The Role of InfiniBand Technologies in High Performance Computing

53 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

All-To-All: 128 Bytes

Page 54: The Role of InfiniBand Technologies in High Performance Computing

54 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

All-To-All: 4 MB/process

Page 55: The Role of InfiniBand Technologies in High Performance Computing

55 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 Graham_OpenMPI_SC08

55 Managed by UT-Battelle for the Department of Energy

Allgather Data

Page 56: The Role of InfiniBand Technologies in High Performance Computing

56 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

All-Gather: 1 Byte

Page 57: The Role of InfiniBand Technologies in High Performance Computing

57 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

All-Gather: 128 Bytes

Page 58: The Role of InfiniBand Technologies in High Performance Computing

58 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

All-Gather: 131072 Bytes

Page 59: The Role of InfiniBand Technologies in High Performance Computing

59 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

Summary

Added hardware support for offloading broadcast operations

Developed MPI-level support for one-copy for asynchronous contiguous large-data transfer

Good collective performance

Good overlap capabilities