Overview of Parallel and Distributed Systems: Petaflop to ...web.cse.ohio-state.edu/~panda.2/6422/class_slides/overview.pdf · Programming Model – Many discussions towards Partitioned

Overview of Parallel and Distributed Systems:Petaflop to Exaflop &

Trends in Networking Technologies

Dhabaleswar K. (DK) Panda

The Ohio State University

E-mail: [email protected]

http://www.cse.ohio-state.edu/~panda

http://www.cse.ohio-state.edu/%7Epanda

6422-Overview 2Network Based Computing Laboratory

• Different Kinds of Parallel and Distributed Systems

• Petaflops to Exaflops, Trends of Commodity Clusters and Networking Technologies

• Defining Exascale Systems and Challenges

• Challenges in Designing Various Systems

Presentation Overview


• Growth of High Performance Computing– Growth in processor performance

• Chip density doubles every 18 months

– Growth in commodity networking• Increase in speed/features + reducing cost

• Clusters: popular choice for HPC– Scalability, Modularity and Upgradeability

Current and Next Generation Applications and Computing Systems


• Scientific Computing– Message Passing Interface (MPI), including MPI + OpenMP, is the Dominant

Programming Model

– Many discussions towards Partitioned Global Address Space (PGAS) • UPC, OpenSHMEM, CAF, UPC++ etc.

– Hybrid Programming: MPI + PGAS (OpenSHMEM, UPC)

• Deep Learning– Caffe, CNTK, TensorFlow, and many more

• Big Data/Enterprise/Commercial Computing– Focuses on large data and data analysis

– Spark and Hadoop (HDFS, HBase, MapReduce)

– Memcached is also used for Web 2.0

Three Major Computing Categories


Integrated High-End Computing EnvironmentsCompute cluster

Meta-DataManager

I/O ServerNode

MetaData

DataCompute

Node

ComputeNode

I/O ServerNode Data

ComputeNode

I/O ServerNode Data

ComputeNode

LANLANFrontend

Storage cluster

LAN/WAN

.

...

Enterprise Multi-tier Datacenter for Visualization and Mining

Tier1 Tier3

Routers/Servers

Switch

Database Server

Application Server

Routers/Servers

Routers/Servers

Application Server

Application Server

Application Server

Database Server

Database Server

Database Server

Switch SwitchRouters/Servers

Tier2


Cloud Computing Environments

LAN / WAN

Physical Machine

Virtual Machine

Virtual Machine

Physical Machine

Virtual Machine

Virtual Machine

Physical Machine

Virtual Machine

Virtual Machine

Virtual Netw

ork File System

PhysicalMeta-DataManager

MetaData

PhysicalI/O Server

NodeData

PhysicalI/O Server

NodeData

PhysicalI/O Server

NodeData

PhysicalI/O Server

NodeData

Physical Machine

Virtual Machine

Virtual Machine


Big Data Processing with Hadoop Components

• Major components included in this tutorial:

– MapReduce (Batch)– HBase (Query)– HDFS (Storage)– RPC

• Underlying Hadoop Distributed File System (HDFS) used by both MapReduce and HBase

• Model scales but high amount of communication during intermediate phases can be further optimized

HDFS

MapReduce

Hadoop Framework

User Applications

HBase

Hadoop Common (RPC)


Spark Architecture Overview• An in-memory data-processing

framework – Iterative machine learning jobs – Interactive data analytics – Scala based Implementation– Standalone, YARN, Mesos

• Scalable and communication intensive

– Wide dependencies between Resilient Distributed Datasets (RDDs)

– MapReduce-like shuffle operations to repartition RDDs

– Sockets based communication

http://spark.apache.org

http://spark.apache.org/


Memcached Architecture

• Distributed Caching Layer– Allows to aggregate spare memory from multiple nodes

– General purpose

• Typically used to cache database queries, results of API calls• Scalable model, but typical usage very network intensive


• Deep Learning is going through a resurgence– Excellent accuracy for deep/convolutional neural networks

– Public availability of versatile datasets like MNIST, CIFAR, and ImageNet

– Widespread popularity of accelerators like Nvidia GPUs

• DL frameworks and applications– Caffe, Microsoft CNTK, Google TensorFlow and many more..

– Most of the frameworks are exploiting GPUs to accelerate training

– Diverse range of applications – Image Recognition, Cancer Detection, Self-Driving Cars, Speech Processing etc.

• Can MPI runtimes like MVAPICH2 provide efficient support for Deep Learning workloads?

– MPI runtimes typically deal with• relatively small-sizes message (order of kilobytes)

• CPU-based communication buffers

Deep Learning and MPI: State-of-the-art

https://cntk.aihttps://www.tensorflow.org https://github.com/BVLC/caffe

050

100150

2003

-12

2004

-11

2005

-10

2006

-09

2007

-08

2008

-07

2009

-06

2010

-05

2011

-04

2012

-03

2013

-02

2014

-01

2014

-12

2015

-11

Rela

tive

Sear

ch In

tere

st

Years

Deep Learning - Google Trends

Deep Learning - Google Trends

http://www.computervisionblog.com/2015/11/the-deep-learning-gold-rush-of-2015.html








High-End Computing (HEC): Towards Exascale

Expected to have an ExaFlop system in 2020-2021!

100 PFlops in 2016

1 EFlops in 2020-2021?

122 PFlops in 2018


Trends for Commodity Computing Clusters in the Top 500 List (http://www.top500.org)

0102030405060708090100

050

100150200250300350400450500

Perc

enta

ge o

f Clu

ster

s

Num

ber o

f Clu

ster

s

Timeline

Percentage of ClustersNumber of Clusters

87.4%


Drivers of Modern HPC Cluster Architectures

• Multi-core/many-core technologies

• Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE)

• Solid State Drives (SSDs), Non-Volatile Random-Access Memory (NVRAM), NVMe-SSD

• Accelerators (NVIDIA GPGPUs and Intel Xeon Phi)

• Available on HPC Clouds, e.g., Amazon EC2, NSF Chameleon, Microsoft Azure, etc.

Accelerators / Coprocessors high compute density, high

performance/watt>1 TFlop DP on a chip

High Performance Interconnects -InfiniBand

<1usec latency, 100Gbps Bandwidth>Multi-core Processors SSD, NVMe-SSD, NVRAM

Tianhe – 2 TitanK - ComputerSunway TaihuLight


Kernel Space

All Interconnects and Protocols (IB, HSE, Omni-Path & RoCE)Application / Middleware

Verbs

Ethernet Adapter

Ethernet Switch

Ethernet Driver

TCP/IP

InfiniBand Adapter

InfiniBand Switch

IPoIB

IPoIB

Ethernet Adapter

Ethernet Switch

Hardware Offload

TCP/IP

10/40 GigE-TOE

InfiniBand Adapter

InfiniBand Switch

User Space

RSockets

RSockets

iWARP Adapter

Ethernet Switch

TCP/IP

User Space

iWARP

RoCEAdapter

Ethernet Switch

RDMA

User Space

RoCE

InfiniBand Switch

InfiniBand Adapter

RDMA

User Space

IB Native

Sockets

Application / Middleware Interface

Protocol

Adapter

Switch

InfiniBand Adapter

InfiniBand Switch

RDMA

SDP

SDP1/10/25/40/50/100 GigE

Omni-Path Adapter

Omni-Path Switch

User Space

RDMA

100 Gb/s

OFI


Trends of Networking Technologies in TOP500 Systems Interconnect Family – Systems Share


34%

28%

15%

14%

8%

1%

Count

10G Infiniband

Gigabit Ethernet Custom Interconnect

Omnipath Proprietary Network

InfiniBand in the Top500 (June 2018)

13%

36%

8%

33%

9%

1%Performance

10G Infiniband

Gigabit Ethernet Custom Interconnect

Omnipath Proprietary Network


• 139 IB Clusters (27.8%) in the Jun’18 Top500 list

– (http://www.top500.org)

• Installations in the Top 50 (19 systems):

Large-scale InfiniBand Installations

2,282,544 cores (Summit) at ORNL (1st) 155,150 cores (JURECA) at FZJ/Germany (38th)

1,572,480 cores (Sierra) at LLNL (3rd) 72,800 cores Cray CS-Storm in US (40th)

391,680 cores (ABCI) at AIST/Japan (5th) 72,800 cores Cray CS-Storm in US (41st)

253,600 cores (HPC4) in Italy (13th) 78,336 cores (Electra) at NASA/Ames (43rd)

114,480 cores (Juwels Module 1) at FZJ/Germany (23rd) 124,200 cores (Topaz) at ERDC DSRC/USA (44th)

241,108 cores (Pleiades) at NASA/Ames (24th) 60,512 cores NVIDIA DGX-1 at Facebook/USA (45th)

220,800 cores (Pangea) in France (30th) 60,512 cores (DGX Saturn V) at NVIDIA/USA (46th)

144,900 cores (Cheyenne) at NCAR/USA (31st) 113,832 cores (Damson) at AWE/UK (47th)

72,000 cores (IT0 – Subsystem A) in Japan (32nd) 72,000 cores (HPC2) in Italy (49th)

79,488 cores (JOLIOT-CURIE SKL) at CEA/France (34th) and many more!

#2nd system (Sunway TaihuLight) also uses InfiniBand

Upcoming NSF Frontera System will also use InfiniBand

http://www.top500.org/


• 39 Omni-Path Clusters (7.8%) in the Jun’18 Top500 list

– (http://www.top500.org)

Large-scale Omni-Path Installations

570,020 core (Nurion) at KISTI/South Korea (11th) 53,300 core (Makman-3) at Saudi Aramco/Saudi Arabia (78th)

556,104 core (Oakforest-PACS) at JCAHPC in Japan (12th) 34,560 core (Gaffney) at Navy DSRC/USA (85th)

367,024 core (Stampede2) at TACC in USA (15th) 34,560 core (Koehr) at Navy DSRC/USA (86th)

312,936 core (Marconi XeonPhi) at CINECA in Italy (18th) 49,432 core (Mogon II) in Germany (87th)

135,828 core (Tsubame 3.0) at TiTech in Japan (19th) 38,553 core (Molecular Simulator) in Japan (93rd)

153,216 core (MareNostrum) at BSC in Spain (22nd) 35,280 core (Quriosity) at BASF in Germany (94th)

127,520 core (Cobra) in Germany (28th) 54,432 core (Marconi Xeon) at CINECA in Italy (98th)

55,296 core (Mustang) at AFRL/USA (48th) 46,464 core (Peta4) at Cambridge/UK (101st)

95,472 core (Quartz) at LLNL in USA (63rd) 53,352 core (Girzzly) at LANL in USA (136th)

95,472 core (Jade) at LLNL in USA (64th) and many more!

http://www.top500.org/


HSE Scientific Computing Installations• 171 HSE compute systems with ranking in the Jun’18 Top500 list

– 38,400-core installation in China (#95) – new– 38,400-core installation in China (#96) – new– 38,400-core installation in China (#97) – new– 39,680-core installation in China (#99)– 66,560-core installation in China (#157)– 66,280-core installation in China (#159)– 64,000-core installation in China (#160)– 64,000-core installation in China (#161)– 72,000-core installation in China (#164)– 64,320-core installation in China (#185) – new– 78,000-core installation in China (#187)– 75,776-core installation in China (#188) – new– 59,520-core installation in China (#192)– 59,520-core installation in China (#193)– 28,800-core installation in China (#195) – new– 62,400-core installation in China (#197) – new– 64,800-core installation in China (#198)– 66,000-core installation in China (#209) – new– and many more!








• Exascale Systems– Computing capability of 10

18Flops

– 1,000 times more than PetaFlop (1015

)– 1,000,000 times more than TeraFlop (10

12)

What Does Exascale Systems Mean?


Towards Exascale System (2015 and Target)

Systems 2015Tianhe-2

2018 (original target)2020-2024

DifferenceToday & Exascale

System peak 55 PFlop/s 1 EFlop/s ~20x

Power 18 MW(3 Gflops/W)

~20 MW(50 Gflops/W)

O(1)~15x

System memory 1.4 PB(1.024PB CPU + 0.384PB CoP)

32 – 64 PB ~50X

Node performance 3.43TF/s(0.4 CPU + 3 CoP)

1.2 or 15 TF O(1)

Node concurrency 24 core CPU + 171 cores CoP

O(1k) or O(10k) ~5x - ~50x

Total node interconnect BW 6.36 GB/s 200 – 400 GB/s ~40x -~60x

System size (nodes) 16,000 O(100,000) or O(1M) ~6x - ~60x

Total concurrency 3.12M12.48M threads (4 /core)

O(billion) for latency hiding

~100x

MTTI Few/day Many/day O(?)

Courtesy: Prof. Jack Dongarra


Towards Exascale System (2016-17 and Target)

Systems 2016-17Sunway TaihuLight

2020-2021(New Target)


System peak 125.4 PFlop/s 1 EFlop/s ~10x

Power 15 MW(6 Gflops/W)

~20 MW(50 Gflops/W)

O(1)~8x

System memory 1.31 PB 32 – 64 PB ~50X

Node performance 3.0624 TF/s 1.2 or 15 TF O(1)

Node concurrency 260 cores CPU O(1k) or O(10k) ~4x - ~40x

Total node interconnect BW 16 GB/s 200 – 400 GB/s ~12x -~25x

System size (nodes) 40,960 O(100,000) or O(1M) ~2.5x - ~25x

Total concurrency 10M(1 thread /core)


~100x



Towards Exascale System (2018 and Target)

Systems 2018Summit

2020-2021(New Target)


System peak 200 PFlop/s 1 EFlop/s ~5x

Power 13 MW(15 Pflops/MW)

~20 MW(50 Pflops/MW)

O(1)~4x

System memory 10.2 PB 32 – 64 PB ~3-6X

Node performance 42 TF/s (6 x 7 TF) 1.2 or 15 TF O(1)

Node concurrency 44 cores CPU O(1k) or O(10k) ~25x - ~250x

Total node interconnect BW 25 GB/s 200 – 400 GB/s ~8x -~16x

System size (nodes) 4,608 O(100,000) or O(1M) ~25x - ~250x

Total concurrency 202K(4 threads /core)


~5000x



• Energy and Power Challenge– Hard to solve power requirements for data movement

• Memory and Storage Challenge– Hard to achieve high capacity and high data rate

• Concurrency and Locality Challenge– Management of very large amount of concurrency (billion threads)

• Resiliency Challenge– Low voltage devices (for low power) introduce more faults

Basic Design Challenges for Exascale Systems


• Supercomputers require a lot of energy– Power consumption by processors for computation

– Power required by memory and other devices to move data

– Power required to cool the system

• Power requirement by current generation is already high

• Design constraint on Exaflop systems: must reach Exaflops using a total of around 20MW of power

• Gradually being relaxed

Power Constraints for Exascale Systems


How will a Typical Exascale System Look Like?

• 50,000-100,000 nodes• Each node might be heterogeneous

• 256-1,024 CPU cores• 1K-4K accelerator cores

• New kinds of Memory Technology (High Bandwidth Memory, NVRAM)• Running Hybrid Programming

• MPI + Partitioned Global Address (UPC, OpenSHMEM, CAF, etc.)• A significant fraction of the nodes might have SSDs for storage


Broad Set of Challenges in Designing Exascale Systems

CPUs and Accelerators

Memory &Storage

Networking

I/O and File

Systems

Programming Models

Compilers

Languages

Runtime

Resource Management

Fault-Tolerance

Tools &Debugging

Algorithms & Applications








Designing Communication Middleware for Multi-Petaflop and Exaflop Systems: Challenges

Programming ModelsMPI, PGAS (UPC, Global Arrays, OpenSHMEM), CUDA, OpenMP,

OpenACC, Cilk, Hadoop (MapReduce), Spark (RDD, DAG), etc.

Application Kernels/Applications

Networking Technologies(InfiniBand, 40/100GigE,

Aries, and OmniPath)

Multi/Many-coreArchitectures

Accelerators(NVIDIA and MIC)

MiddlewareCo-Design

Opportunities and

Challenges across Various

Layers

PerformanceScalability

Fault-Resilience

Communication Library or Runtime for Programming ModelsPoint-to-point

CommunicationCollective

CommunicationEnergy-

AwarenessSynchronization

and LocksI/O and

File SystemsFault

Tolerance


• Hardware components– Processing cores and memory

subsystem

– I/O bus or links

– Network adapters/switches

• Software components– Communication stack

• Bottlenecks can artificially limit the network performance the user perceives

Major Components in Computing Systems

P0Core0 Core1

Core2 Core3

P1Core0 Core1

Core2 Core3

Memory

MemoryI/OBus

Network Adapter

Network Switch

Processing Bottlenecks

I/O Interface Bottlenecks

Network Bottlenecks


• Ex: TCP/IP, UDP/IP

• Generic architecture for all networks

• Host processor handles almost all aspects of communication

– Data buffering (copies on sender and receiver)

– Data integrity (checksum)

– Routing aspects (IP routing)

• Signaling between different layers– Hardware interrupt on packet arrival or transmission

– Software signals between different layers to handle protocol processing in different priority levels

Processing Bottlenecks in Traditional Protocols

P0Core0 Core1

Core2 Core3

P1Core0 Core1

Core2 Core3

Memory

MemoryI/OBus

Network Adapter

Network Switch

Processing Bottlenecks


• Traditionally relied on bus-based

technologies (last mile bottleneck)– E.g., PCI, PCI-X

– One bit per wire

– Performance increase through:• Increasing clock speed

• Increasing bus width

– Not scalable:• Cross talk between bits

• Skew between wires

• Signal integrity makes it difficult to increase bus width significantly, especially for high clock speeds

Bottlenecks in Traditional I/O Interfaces and Networks

PCI 1990 33MHz/32bit: 1.05Gbps (shared bidirectional)

PCI-X1998 (v1.0)2003 (v2.0)

133MHz/64bit: 8.5Gbps (shared bidirectional)266-533MHz/64bit: 17Gbps (shared bidirectional)

P0Core0 Core1

Core2 Core3

P1Core0 Core1

Core2 Core3

Memory

MemoryI/OBus

Network Adapter

Network Switch

I/O Interface Bottlenecks


• Network speeds saturated at around 1Gbps– Features provided were limited

– Commodity networks were not considered scalable enough for very large-scale systems

Bottlenecks on Traditional Networks

Ethernet (1979 - ) 10 Mbit/sec

Fast Ethernet (1993 -) 100 Mbit/sec

Gigabit Ethernet (1995 -) 1000 Mbit /sec

ATM (1995 -) 155/622/1024 Mbit/sec

Myrinet (1993 -) 1 Gbit/sec

Fibre Channel (1994 -) 1 Gbit/sec

P0Core0 Core1

Core2 Core3

P1Core0 Core1

Core2 Core3

Memory

MemoryI/OBus

Network Adapter

Network Switch

Network Bottlenecks


Ethernet (1979 - ) 10 Mbit/secFast Ethernet (1993 -) 100 Mbit/sec

Gigabit Ethernet (1995 -) 1000 Mbit /secATM (1995 -) 155/622/1024 Mbit/sec

Myrinet (1993 -) 1 Gbit/secFibre Channel (1994 -) 1 Gbit/sec

InfiniBand (2001 -) 2 Gbit/sec (1X SDR)10-Gigabit Ethernet (2001 -) 10 Gbit/sec

InfiniBand (2003 -) 8 Gbit/sec (4X SDR)InfiniBand (2005 -) 16 Gbit/sec (4X DDR)

24 Gbit/sec (12X SDR)InfiniBand (2007 -) 32 Gbit/sec (4X QDR)

40-Gigabit Ethernet (2010 -) 40 Gbit/secInfiniBand (2011 -) 54.6 Gbit/sec (4X FDR)InfiniBand (2012 -) 2 x 54.6 Gbit/sec (4X Dual-FDR)

25-/50-Gigabit Ethernet (2014 -) 25/50 Gbit/sec100-Gigabit Ethernet (2015 -) 100 Gbit/sec

InfiniBand (2015 - ) 100 Gbit/sec (4X EDR)InfiniBand (2016 - ) 200 Gbit/sec (4X HDR)

Network Speed Acceleration with IB and HSE

100 times in the last 15 years


• Recent trends in I/O interfaces show that they are nearly matching head-to-head with network speeds (though they still lag a little bit)

Trends in I/O Interfaces with Servers

PCI 1990 33MHz/32bit: 1.05Gbps (shared bidirectional)

PCI-X1998 (v1.0)2003 (v2.0)

133MHz/64bit: 8.5Gbps (shared bidirectional)266-533MHz/64bit: 17Gbps (shared bidirectional)

AMD HyperTransport (HT)2001 (v1.0), 2004 (v2.0)2006 (v3.0), 2008 (v3.1)

102.4Gbps (v1.0), 179.2Gbps (v2.0)332.8Gbps (v3.0), 409.6Gbps (v3.1)

(32 lanes)

PCI-Express (PCIe)by Intel

2003 (Gen1),2007 (Gen2),

2009 (Gen3 standard), 2017 (Gen4 standard)

Gen1: 4X (8Gbps), 8X (16Gbps), 16X (32Gbps)Gen2: 4X (16Gbps), 8X (32Gbps), 16X (64Gbps)

Gen3: 4X (~32Gbps), 8X (~64Gbps), 16X (~128Gbps)Gen4: 4X (~64Gbps), 8X (~128Gbps), 16X (~256Gbps)

Intel QuickPath Interconnect (QPI) 2009 153.6-204.8Gbps (20 lanes)


Common Challenges for Large-Scale Installations

Common Challenges Adapters and Interactions I/O busMulti-port adapters NUMA

Switches Topologies Switching / Routing

Bridges IB interoperability


System Specific Challenges for HPC Systems

Common Challenges

Adapters and Interactions I/O busMulti-port adapters NUMA



HPCMPIMulti-rail Collectives Scalability Application Scalability Energy Awareness

PGAS Programmability w/ Performance Optimized Resource Utilization

GPU / MIC Programmability w/ Performance Hide data movement costs Heterogeneity aware design Streaming, Deep Learning


System Specific Storage and File Systems




HPCMPI

Multi-rail Collectives Scalability Application Scalability Energy Awareness


GPU / MIC Programmability w/ Performance Hide data movement costs Heterogeneity aware design Streaming, Deep Learning

Storage and File Systems

High throughput I/O Taking advantage of RDMA Checkpointing w/ aggregation Hierarchical data staging QoS aware checkpointing Decentralized Metadata


System Specific Challenges for Big Data ProcessingCommon Challenges

Adapters and Interactions I/O busMulti-port adaptersNUMA



Big Data Taking advantage of RDMA Performance Scalability Backward compatibility

HPCMPI

Multi-rail Collectives ScalabilityApplication Scalability Energy Awareness


GPU / MIC Programmability w/ Performance Hide data movement costs Heterogeneity aware design


System Specific Challenges forCloud Computing




CloudComputing

SR-IOV Support Virtualization Containers

HPCMPI

Multi-rail Collectives ScalabilityApplication Scalability Energy Awareness


GPU / MIC Programmability w/ Performance Hide data movement costs Heterogeneity aware design

Storage and File Systems

High throughput I/O Taking advantage of RDMA Checkpointing w/ aggregation Hierarchical data staging QoS aware checkpointing Decentralized Metadata

Big Data Taking advantage of RDMA Performance Scalability Backward compatibility

Documents

Overview of Parallel and Distributed Systems: Petaflop to ...web.cse.ohio-state.edu/~panda.2/6422/class_slides/overview.pdf · Programming Model – Many discussions towards Partitioned