23
© 2019 Mellanox Technologies 1 Paving the Road to Exascale September 2019 Interconnect Your Future

Interconnect Your Future - Hewlett Packard Enterprise...The Need for Intelligent and Faster Interconnect CPU-Centric (Onload) Data-Centric (Offload) Must Wait for the Data Creates

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Interconnect Your Future - Hewlett Packard Enterprise...The Need for Intelligent and Faster Interconnect CPU-Centric (Onload) Data-Centric (Offload) Must Wait for the Data Creates

© 2019 Mellanox Technologies 1

Paving the Road to ExascaleSeptember 2019

Interconnect Your Future

Page 2: Interconnect Your Future - Hewlett Packard Enterprise...The Need for Intelligent and Faster Interconnect CPU-Centric (Onload) Data-Centric (Offload) Must Wait for the Data Creates

© 2019 Mellanox Technologies 2

InfiniBand Accelerates 6 of Top 10 Supercomputers

1 2 3 5 8 10

Page 3: Interconnect Your Future - Hewlett Packard Enterprise...The Need for Intelligent and Faster Interconnect CPU-Centric (Onload) Data-Centric (Offload) Must Wait for the Data Creates

© 2019 Mellanox Technologies 3

5 62 166 264

India’s National Supercomputing

Program

World’s FirstHDR InfiniBandSupercomputer

HDR 200G InfiniBand Accelerated Supercomputers

Page 4: Interconnect Your Future - Hewlett Packard Enterprise...The Need for Intelligent and Faster Interconnect CPU-Centric (Onload) Data-Centric (Offload) Must Wait for the Data Creates

© 2019 Mellanox Technologies 4

Higher Data SpeedsFaster Data Processing

Better Data Security

Adapters SwitchesCables &

Transceivers

SmartNIC System on a Chip

HPC and AI Needs the Most Intelligent Interconnect

Page 5: Interconnect Your Future - Hewlett Packard Enterprise...The Need for Intelligent and Faster Interconnect CPU-Centric (Onload) Data-Centric (Offload) Must Wait for the Data Creates

© 2019 Mellanox Technologies 5

The Need for Intelligent and Faster Interconnect

CPU-Centric (Onload) Data-Centric (Offload)

Must Wait for the DataCreates Performance Bottlenecks

Faster Data Speeds and In-Network Computing Enable Higher Performance and Scale

GPU

CPU

GPU

CPU

Onload Network In-Network Computing

GPU

CPU

CPU

GPU

GPU

CPU

GPU

CPU

GPU

CPU

CPU

GPU

Analyze Data as it Moves!Higher Performance and Scale

Page 6: Interconnect Your Future - Hewlett Packard Enterprise...The Need for Intelligent and Faster Interconnect CPU-Centric (Onload) Data-Centric (Offload) Must Wait for the Data Creates

© 2019 Mellanox Technologies 6

Data Centric Architecture to Overcome Latency Bottlenecks

CPU-Centric (Onload) Data-Centric (Offload)

Communications Latencies of 30-40us

Intelligent Interconnect Paves the Road to Exascale Performance

GPU

CPU

GPU

CPU

GPU

CPU

CPU

GPU

GPU

CPU

GPU

CPU

GPU

CPU

CPU

GPU

Communications Latenciesof 3-4us

Page 7: Interconnect Your Future - Hewlett Packard Enterprise...The Need for Intelligent and Faster Interconnect CPU-Centric (Onload) Data-Centric (Offload) Must Wait for the Data Creates

© 2019 Mellanox Technologies 7

GPUDirect

RDMA

Network

Communication

Application Data Analysis

Real Time

Deep Learning

Mellanox SHARP In-Network Computing

MPI Tag Matching

MPI Rendezvous

Software Defined Virtual Devices

Network Transport Offload

RDMA and GPU-Direct RDMA

SHIELD (Self-Healing Network)

Enhanced Adaptive Routing and Congestion Control

Connectivity Multi-Host Technology

Socket-Direct Technology

Enhanced Topologies

Accelerating All Levels of HPC / AI Frameworks

Page 8: Interconnect Your Future - Hewlett Packard Enterprise...The Need for Intelligent and Faster Interconnect CPU-Centric (Onload) Data-Centric (Offload) Must Wait for the Data Creates

© 2019 Mellanox Technologies 8

Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)

Page 9: Interconnect Your Future - Hewlett Packard Enterprise...The Need for Intelligent and Faster Interconnect CPU-Centric (Onload) Data-Centric (Offload) Must Wait for the Data Creates

© 2019 Mellanox Technologies 9

Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)

Reliable Scalable General Purpose Primitive In-network Tree based aggregation mechanism Large number of groups Multiple simultaneous outstanding operations

Applicable to Multiple Use-cases HPC Applications using MPI / SHMEM Distributed Machine Learning applications

Scalable High Performance Collective Offload Barrier, Reduce, All-Reduce, Broadcast and more Sum, Min, Max, Min-loc, max-loc, OR, XOR, AND Integer and Floating-Point, 16/32/64 bits

DataAggregated

AggregatedResult

Aggregated Result

Data

Host Host Host Host Host

SwitchSwitch

Switch

Page 10: Interconnect Your Future - Hewlett Packard Enterprise...The Need for Intelligent and Faster Interconnect CPU-Centric (Onload) Data-Centric (Offload) Must Wait for the Data Creates

© 2019 Mellanox Technologies 10

SHARP AllReduce Performance Advantages (128 Nodes)

SHARP enables 75% Reduction in LatencyProviding Scalable Flat LatencyScalable Hierarchical

Aggregation and Reduction Protocol

Page 11: Interconnect Your Future - Hewlett Packard Enterprise...The Need for Intelligent and Faster Interconnect CPU-Centric (Onload) Data-Centric (Offload) Must Wait for the Data Creates

© 2019 Mellanox Technologies 11

SHARP AllReduce Performance Advantages 1500 Nodes, 60K MPI Ranks, Dragonfly+ Topology

SHARP Enables Highest PerformanceScalable Hierarchical Aggregation and

Reduction Protocol

Page 12: Interconnect Your Future - Hewlett Packard Enterprise...The Need for Intelligent and Faster Interconnect CPU-Centric (Onload) Data-Centric (Offload) Must Wait for the Data Creates

© 2019 Mellanox Technologies 12

Oak Ridge National Laboratory – Coral Summit SupercomputerSHARP AllReduce Performance Advantages

SHARP Enables Highest PerformanceScalable Hierarchical Aggregation and

Reduction Protocol

Page 13: Interconnect Your Future - Hewlett Packard Enterprise...The Need for Intelligent and Faster Interconnect CPU-Centric (Onload) Data-Centric (Offload) Must Wait for the Data Creates

© 2019 Mellanox Technologies 13

Performs the Gradient AveragingReplaces all physical parameter serversAccelerate AI Performance

SHARP Accelerates AI Performance

The CPU in a parameter server becomes the bottleneck

Page 14: Interconnect Your Future - Hewlett Packard Enterprise...The Need for Intelligent and Faster Interconnect CPU-Centric (Onload) Data-Centric (Offload) Must Wait for the Data Creates

© 2019 Mellanox Technologies 14

NCCL-SHARP Delivers Highest Performance

Page 15: Interconnect Your Future - Hewlett Packard Enterprise...The Need for Intelligent and Faster Interconnect CPU-Centric (Onload) Data-Centric (Offload) Must Wait for the Data Creates

© 2019 Mellanox Technologies 15

Adaptive Routing

Page 16: Interconnect Your Future - Hewlett Packard Enterprise...The Need for Intelligent and Faster Interconnect CPU-Centric (Onload) Data-Centric (Offload) Must Wait for the Data Creates

© 2019 Mellanox Technologies 16

InfiniBand Proven Adaptive Routing Performance

Oak Ridge National Laboratory – Coral Summit supercomputer Bisection bandwidth benchmark, based on mpiGraph

Explores the bandwidth between possible MPI process pairs

AR results demonstrate an average performance of 96% of the maximum bandwidth measured

mpiGraph explores the bandwidth

between possible MPI process pairs. In

the histograms, the single cluster with AR

indicates that all pairs achieve nearly

maximum bandwidth while single-path

static routing has nine clusters as

congestion limits bandwidth, negatively impacting overall application performance.

“The Design, Deployment, and Evaluation of the CORAL Pre-Exascale Systems”, Sudharshan S. Vazhkudai, Arthur S. Bland, Al Geist, Christopher J. Zimmer, Scott Atchley, Sarp Oral, Don E. Maxwell, Veronica G. Vergara Larrea, Wayne Joubert, Matthew A. Ezell, Dustin Leverman, James H. Rogers, Drew Schmidt, Mallikarjun Shankar, Feiyi Wang, Junqi Yin (Oak Ridge National Laboratory) and Bronis R. de Supinski, Adam Bertsch, Robin Goldstone, Chris Chambreau, Ben Casses, Elsa Gonsiorowski, Ian Karlin, Matthew L. Leininger, Adam Moody, Martin Ohmacht, Ramesh Pankajakshan, Fernando Pizzano, Py Watson, Lance D. Weems (Lawrence Livermore National Laboratory) and James Sexton, Jim Kahle, David Appelhans, Robert Blackmore, George Chochia, Gene Davison, Tom Gooding, Leopold Grinberg, Bill Hanson, Bill Hartner, Chris Marroquin, Bryan Rosenburg, Bob Walkup (IBM)

InfiniBand High Network Efficiency - mpiGraph

Oak Ridge National Lab Summit Supercomputer

Static Routing Adaptive Routing

Page 17: Interconnect Your Future - Hewlett Packard Enterprise...The Need for Intelligent and Faster Interconnect CPU-Centric (Onload) Data-Centric (Offload) Must Wait for the Data Creates

© 2019 Mellanox Technologies 17

Network Topologies

Page 18: Interconnect Your Future - Hewlett Packard Enterprise...The Need for Intelligent and Faster Interconnect CPU-Centric (Onload) Data-Centric (Offload) Must Wait for the Data Creates

© 2019 Mellanox Technologies 18

Supporting Variety of Topologies

Torus DragonflyFat Tree

Hypercube HyperX

Page 19: Interconnect Your Future - Hewlett Packard Enterprise...The Need for Intelligent and Faster Interconnect CPU-Centric (Onload) Data-Centric (Offload) Must Wait for the Data Creates

© 2019 Mellanox Technologies 19

HDR InfiniBand

Page 20: Interconnect Your Future - Hewlett Packard Enterprise...The Need for Intelligent and Faster Interconnect CPU-Centric (Onload) Data-Centric (Offload) Must Wait for the Data Creates

© 2019 Mellanox Technologies 20

Highest-Performance 200Gb/s InfiniBand Solutions

TransceiversActive Optical and Copper Cables(10 / 25 / 40 / 50 / 56 / 100 / 200Gb/s)

40 HDR (200Gb/s) InfiniBand Ports80 HDR100 InfiniBand PortsThroughput of 16Tb/s, <90ns Latency

200Gb/s Adapter, 0.6us latency215 million messages per second(10 / 25 / 40 / 50 / 56 / 100 / 200Gb/s)

MPI, SHMEM/PGAS, UPCFor Commercial and Open Source ApplicationsLeverages Hardware Accelerations

System on Chip and SmartNICProgrammable adapterSmart Offloads

Page 21: Interconnect Your Future - Hewlett Packard Enterprise...The Need for Intelligent and Faster Interconnect CPU-Centric (Onload) Data-Centric (Offload) Must Wait for the Data Creates

© 2019 Mellanox Technologies 21

HDR InfiniBand Switches

40 ports of HDR, 200G 80 ports of HDR100, 100G

40 QSFP56 ports

800 ports of HDR, 200G 1600 ports of HDR100, 100G

800 QSFP56 ports

Page 22: Interconnect Your Future - Hewlett Packard Enterprise...The Need for Intelligent and Faster Interconnect CPU-Centric (Onload) Data-Centric (Offload) Must Wait for the Data Creates

© 2019 Mellanox Technologies 22

Highest Performance and Scalability for Exascale Platforms

7X Higher Performance

96%Network Utilization

FlatLatency

5000X HigherResiliency

2X Higher Performance

DeepLearning

HDR 200G

NDR 400G

XDR 1000G

Page 23: Interconnect Your Future - Hewlett Packard Enterprise...The Need for Intelligent and Faster Interconnect CPU-Centric (Onload) Data-Centric (Offload) Must Wait for the Data Creates

© 2019 Mellanox Technologies 23

Thank You