Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Overview of Parallel and Distributed Systems:Petaflop to Exaflop &
Trends in Networking Technologies
Dhabaleswar K. (DK) Panda
The Ohio State University
E-mail: [email protected]
http://www.cse.ohio-state.edu/~panda
6422-Overview 2Network Based Computing Laboratory
• Different Kinds of Parallel and Distributed Systems
• Petaflops to Exaflops, Trends of Commodity Clusters and Networking Technologies
• Defining Exascale Systems and Challenges
• Challenges in Designing Various Systems
Presentation Overview
6422-Overview 3Network Based Computing Laboratory
• Growth of High Performance Computing– Growth in processor performance
• Chip density doubles every 18 months
– Growth in commodity networking• Increase in speed/features + reducing cost
• Clusters: popular choice for HPC– Scalability, Modularity and Upgradeability
Current and Next Generation Applications and Computing Systems
6422-Overview 4Network Based Computing Laboratory
• Scientific Computing– Message Passing Interface (MPI), including MPI + OpenMP, is the Dominant
Programming Model
– Many discussions towards Partitioned Global Address Space (PGAS) • UPC, OpenSHMEM, CAF, UPC++ etc.
– Hybrid Programming: MPI + PGAS (OpenSHMEM, UPC)
• Deep Learning– Caffe, CNTK, TensorFlow, and many more
• Big Data/Enterprise/Commercial Computing– Focuses on large data and data analysis
– Spark and Hadoop (HDFS, HBase, MapReduce)
– Memcached is also used for Web 2.0
Three Major Computing Categories
6422-Overview 5Network Based Computing Laboratory
Integrated High-End Computing EnvironmentsCompute cluster
Meta-DataManager
I/O ServerNode
MetaData
DataCompute
Node
ComputeNode
I/O ServerNode Data
ComputeNode
I/O ServerNode Data
ComputeNode
LANLANFrontend
Storage cluster
LAN/WAN
.
...
Enterprise Multi-tier Datacenter for Visualization and Mining
Tier1 Tier3
Routers/Servers
Switch
Database Server
Application Server
Routers/Servers
Routers/Servers
Application Server
Application Server
Application Server
Database Server
Database Server
Database Server
Switch SwitchRouters/Servers
Tier2
6422-Overview 6Network Based Computing Laboratory
Cloud Computing Environments
LAN / WAN
Physical Machine
Virtual Machine
Virtual Machine
Physical Machine
Virtual Machine
Virtual Machine
Physical Machine
Virtual Machine
Virtual Machine
Virtual Netw
ork File System
PhysicalMeta-DataManager
MetaData
PhysicalI/O Server
NodeData
PhysicalI/O Server
NodeData
PhysicalI/O Server
NodeData
PhysicalI/O Server
NodeData
Physical Machine
Virtual Machine
Virtual Machine
6422-Overview 7Network Based Computing Laboratory
Big Data Processing with Hadoop Components
• Major components included in this tutorial:
– MapReduce (Batch)– HBase (Query)– HDFS (Storage)– RPC
• Underlying Hadoop Distributed File System (HDFS) used by both MapReduce and HBase
• Model scales but high amount of communication during intermediate phases can be further optimized
HDFS
MapReduce
Hadoop Framework
User Applications
HBase
Hadoop Common (RPC)
6422-Overview 8Network Based Computing Laboratory
Spark Architecture Overview• An in-memory data-processing
framework – Iterative machine learning jobs – Interactive data analytics – Scala based Implementation– Standalone, YARN, Mesos
• Scalable and communication intensive
– Wide dependencies between Resilient Distributed Datasets (RDDs)
– MapReduce-like shuffle operations to repartition RDDs
– Sockets based communication
http://spark.apache.org
6422-Overview 9Network Based Computing Laboratory
Memcached Architecture
• Distributed Caching Layer– Allows to aggregate spare memory from multiple nodes
– General purpose
• Typically used to cache database queries, results of API calls• Scalable model, but typical usage very network intensive
6422-Overview 10Network Based Computing Laboratory
• Deep Learning is going through a resurgence– Excellent accuracy for deep/convolutional neural networks
– Public availability of versatile datasets like MNIST, CIFAR, and ImageNet
– Widespread popularity of accelerators like Nvidia GPUs
• DL frameworks and applications– Caffe, Microsoft CNTK, Google TensorFlow and many more..
– Most of the frameworks are exploiting GPUs to accelerate training
– Diverse range of applications – Image Recognition, Cancer Detection, Self-Driving Cars, Speech Processing etc.
• Can MPI runtimes like MVAPICH2 provide efficient support for Deep Learning workloads?
– MPI runtimes typically deal with• relatively small-sizes message (order of kilobytes)
• CPU-based communication buffers
Deep Learning and MPI: State-of-the-art
https://cntk.aihttps://www.tensorflow.org https://github.com/BVLC/caffe
050
100150
2003
-12
2004
-11
2005
-10
2006
-09
2007
-08
2008
-07
2009
-06
2010
-05
2011
-04
2012
-03
2013
-02
2014
-01
2014
-12
2015
-11
Rela
tive
Sear
ch In
tere
st
Years
Deep Learning - Google Trends
Deep Learning - Google Trends
http://www.computervisionblog.com/2015/11/the-deep-learning-gold-rush-of-2015.html
6422-Overview 11Network Based Computing Laboratory
• Different Kinds of Parallel and Distributed Systems
• Petaflops to Exaflops, Trends of Commodity Clusters and Networking Technologies
• Defining Exascale Systems and Challenges
• Challenges in Designing Various Systems
Presentation Overview
6422-Overview 12Network Based Computing Laboratory
High-End Computing (HEC): Towards Exascale
Expected to have an ExaFlop system in 2020-2021!
100 PFlops in 2016
1 EFlops in 2020-2021?
122 PFlops in 2018
6422-Overview 13Network Based Computing Laboratory
Trends for Commodity Computing Clusters in the Top 500 List (http://www.top500.org)
0102030405060708090100
050
100150200250300350400450500
Perc
enta
ge o
f Clu
ster
s
Num
ber o
f Clu
ster
s
Timeline
Percentage of ClustersNumber of Clusters
87.4%
6422-Overview 14Network Based Computing Laboratory
Drivers of Modern HPC Cluster Architectures
• Multi-core/many-core technologies
• Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE)
• Solid State Drives (SSDs), Non-Volatile Random-Access Memory (NVRAM), NVMe-SSD
• Accelerators (NVIDIA GPGPUs and Intel Xeon Phi)
• Available on HPC Clouds, e.g., Amazon EC2, NSF Chameleon, Microsoft Azure, etc.
Accelerators / Coprocessors high compute density, high
performance/watt>1 TFlop DP on a chip
High Performance Interconnects -InfiniBand
<1usec latency, 100Gbps Bandwidth>Multi-core Processors SSD, NVMe-SSD, NVRAM
Tianhe – 2 TitanK - ComputerSunway TaihuLight
6422-Overview 15Network Based Computing Laboratory
Kernel Space
All Interconnects and Protocols (IB, HSE, Omni-Path & RoCE)Application / Middleware
Verbs
Ethernet Adapter
Ethernet Switch
Ethernet Driver
TCP/IP
InfiniBand Adapter
InfiniBand Switch
IPoIB
IPoIB
Ethernet Adapter
Ethernet Switch
Hardware Offload
TCP/IP
10/40 GigE-TOE
InfiniBand Adapter
InfiniBand Switch
User Space
RSockets
RSockets
iWARP Adapter
Ethernet Switch
TCP/IP
User Space
iWARP
RoCEAdapter
Ethernet Switch
RDMA
User Space
RoCE
InfiniBand Switch
InfiniBand Adapter
RDMA
User Space
IB Native
Sockets
Application / Middleware Interface
Protocol
Adapter
Switch
InfiniBand Adapter
InfiniBand Switch
RDMA
SDP
SDP1/10/25/40/50/100 GigE
Omni-Path Adapter
Omni-Path Switch
User Space
RDMA
100 Gb/s
OFI
6422-Overview 16Network Based Computing Laboratory
Trends of Networking Technologies in TOP500 Systems Interconnect Family – Systems Share
6422-Overview 17Network Based Computing Laboratory
34%
28%
15%
14%
8%
1%
Count
10G Infiniband
Gigabit Ethernet Custom Interconnect
Omnipath Proprietary Network
InfiniBand in the Top500 (June 2018)
13%
36%
8%
33%
9%
1%Performance
10G Infiniband
Gigabit Ethernet Custom Interconnect
Omnipath Proprietary Network
6422-Overview 18Network Based Computing Laboratory
• 139 IB Clusters (27.8%) in the Jun’18 Top500 list
– (http://www.top500.org)
• Installations in the Top 50 (19 systems):
Large-scale InfiniBand Installations
2,282,544 cores (Summit) at ORNL (1st) 155,150 cores (JURECA) at FZJ/Germany (38th)
1,572,480 cores (Sierra) at LLNL (3rd) 72,800 cores Cray CS-Storm in US (40th)
391,680 cores (ABCI) at AIST/Japan (5th) 72,800 cores Cray CS-Storm in US (41st)
253,600 cores (HPC4) in Italy (13th) 78,336 cores (Electra) at NASA/Ames (43rd)
114,480 cores (Juwels Module 1) at FZJ/Germany (23rd) 124,200 cores (Topaz) at ERDC DSRC/USA (44th)
241,108 cores (Pleiades) at NASA/Ames (24th) 60,512 cores NVIDIA DGX-1 at Facebook/USA (45th)
220,800 cores (Pangea) in France (30th) 60,512 cores (DGX Saturn V) at NVIDIA/USA (46th)
144,900 cores (Cheyenne) at NCAR/USA (31st) 113,832 cores (Damson) at AWE/UK (47th)
72,000 cores (IT0 – Subsystem A) in Japan (32nd) 72,000 cores (HPC2) in Italy (49th)
79,488 cores (JOLIOT-CURIE SKL) at CEA/France (34th) and many more!
#2nd system (Sunway TaihuLight) also uses InfiniBand
Upcoming NSF Frontera System will also use InfiniBand
6422-Overview 19Network Based Computing Laboratory
• 39 Omni-Path Clusters (7.8%) in the Jun’18 Top500 list
– (http://www.top500.org)
Large-scale Omni-Path Installations
570,020 core (Nurion) at KISTI/South Korea (11th) 53,300 core (Makman-3) at Saudi Aramco/Saudi Arabia (78th)
556,104 core (Oakforest-PACS) at JCAHPC in Japan (12th) 34,560 core (Gaffney) at Navy DSRC/USA (85th)
367,024 core (Stampede2) at TACC in USA (15th) 34,560 core (Koehr) at Navy DSRC/USA (86th)
312,936 core (Marconi XeonPhi) at CINECA in Italy (18th) 49,432 core (Mogon II) in Germany (87th)
135,828 core (Tsubame 3.0) at TiTech in Japan (19th) 38,553 core (Molecular Simulator) in Japan (93rd)
153,216 core (MareNostrum) at BSC in Spain (22nd) 35,280 core (Quriosity) at BASF in Germany (94th)
127,520 core (Cobra) in Germany (28th) 54,432 core (Marconi Xeon) at CINECA in Italy (98th)
55,296 core (Mustang) at AFRL/USA (48th) 46,464 core (Peta4) at Cambridge/UK (101st)
95,472 core (Quartz) at LLNL in USA (63rd) 53,352 core (Girzzly) at LANL in USA (136th)
95,472 core (Jade) at LLNL in USA (64th) and many more!
6422-Overview 20Network Based Computing Laboratory
HSE Scientific Computing Installations• 171 HSE compute systems with ranking in the Jun’18 Top500 list
– 38,400-core installation in China (#95) – new– 38,400-core installation in China (#96) – new– 38,400-core installation in China (#97) – new– 39,680-core installation in China (#99)– 66,560-core installation in China (#157)– 66,280-core installation in China (#159)– 64,000-core installation in China (#160)– 64,000-core installation in China (#161)– 72,000-core installation in China (#164)– 64,320-core installation in China (#185) – new– 78,000-core installation in China (#187)– 75,776-core installation in China (#188) – new– 59,520-core installation in China (#192)– 59,520-core installation in China (#193)– 28,800-core installation in China (#195) – new– 62,400-core installation in China (#197) – new– 64,800-core installation in China (#198)– 66,000-core installation in China (#209) – new– and many more!
6422-Overview 21Network Based Computing Laboratory
• Different Kinds of Parallel and Distributed Systems
• Petaflops to Exaflops, Trends of Commodity Clusters and Networking Technologies
• Defining Exascale Systems and Challenges
• Challenges in Designing Various Systems
Presentation Overview
6422-Overview 22Network Based Computing Laboratory
• Exascale Systems– Computing capability of 10
18Flops
– 1,000 times more than PetaFlop (1015
)– 1,000,000 times more than TeraFlop (10
12)
What Does Exascale Systems Mean?
6422-Overview 23Network Based Computing Laboratory
Towards Exascale System (2015 and Target)
Systems 2015Tianhe-2
2018 (original target)2020-2024
DifferenceToday & Exascale
System peak 55 PFlop/s 1 EFlop/s ~20x
Power 18 MW(3 Gflops/W)
~20 MW(50 Gflops/W)
O(1)~15x
System memory 1.4 PB(1.024PB CPU + 0.384PB CoP)
32 – 64 PB ~50X
Node performance 3.43TF/s(0.4 CPU + 3 CoP)
1.2 or 15 TF O(1)
Node concurrency 24 core CPU + 171 cores CoP
O(1k) or O(10k) ~5x - ~50x
Total node interconnect BW 6.36 GB/s 200 – 400 GB/s ~40x -~60x
System size (nodes) 16,000 O(100,000) or O(1M) ~6x - ~60x
Total concurrency 3.12M12.48M threads (4 /core)
O(billion) for latency hiding
~100x
MTTI Few/day Many/day O(?)
Courtesy: Prof. Jack Dongarra
6422-Overview 24Network Based Computing Laboratory
Towards Exascale System (2016-17 and Target)
Systems 2016-17Sunway TaihuLight
2020-2021(New Target)
DifferenceToday & Exascale
System peak 125.4 PFlop/s 1 EFlop/s ~10x
Power 15 MW(6 Gflops/W)
~20 MW(50 Gflops/W)
O(1)~8x
System memory 1.31 PB 32 – 64 PB ~50X
Node performance 3.0624 TF/s 1.2 or 15 TF O(1)
Node concurrency 260 cores CPU O(1k) or O(10k) ~4x - ~40x
Total node interconnect BW 16 GB/s 200 – 400 GB/s ~12x -~25x
System size (nodes) 40,960 O(100,000) or O(1M) ~2.5x - ~25x
Total concurrency 10M(1 thread /core)
O(billion) for latency hiding
~100x
MTTI Few/day Many/day O(?)
6422-Overview 25Network Based Computing Laboratory
Towards Exascale System (2018 and Target)
Systems 2018Summit
2020-2021(New Target)
DifferenceToday & Exascale
System peak 200 PFlop/s 1 EFlop/s ~5x
Power 13 MW(15 Pflops/MW)
~20 MW(50 Pflops/MW)
O(1)~4x
System memory 10.2 PB 32 – 64 PB ~3-6X
Node performance 42 TF/s (6 x 7 TF) 1.2 or 15 TF O(1)
Node concurrency 44 cores CPU O(1k) or O(10k) ~25x - ~250x
Total node interconnect BW 25 GB/s 200 – 400 GB/s ~8x -~16x
System size (nodes) 4,608 O(100,000) or O(1M) ~25x - ~250x
Total concurrency 202K(4 threads /core)
O(billion) for latency hiding
~5000x
MTTI Few/day Many/day O(?)
6422-Overview 26Network Based Computing Laboratory
• Energy and Power Challenge– Hard to solve power requirements for data movement
• Memory and Storage Challenge– Hard to achieve high capacity and high data rate
• Concurrency and Locality Challenge– Management of very large amount of concurrency (billion threads)
• Resiliency Challenge– Low voltage devices (for low power) introduce more faults
Basic Design Challenges for Exascale Systems
6422-Overview 27Network Based Computing Laboratory
• Supercomputers require a lot of energy– Power consumption by processors for computation
– Power required by memory and other devices to move data
– Power required to cool the system
• Power requirement by current generation is already high
• Design constraint on Exaflop systems: must reach Exaflops using a total of around 20MW of power
• Gradually being relaxed
Power Constraints for Exascale Systems
6422-Overview 28Network Based Computing Laboratory
How will a Typical Exascale System Look Like?
• 50,000-100,000 nodes• Each node might be heterogeneous
• 256-1,024 CPU cores• 1K-4K accelerator cores
• New kinds of Memory Technology (High Bandwidth Memory, NVRAM)• Running Hybrid Programming
• MPI + Partitioned Global Address (UPC, OpenSHMEM, CAF, etc.)• A significant fraction of the nodes might have SSDs for storage
6422-Overview 29Network Based Computing Laboratory
Broad Set of Challenges in Designing Exascale Systems
CPUs and Accelerators
Memory &Storage
Networking
I/O and File
Systems
Programming Models
Compilers
Languages
Runtime
Resource Management
Fault-Tolerance
Tools &Debugging
Algorithms & Applications
6422-Overview 30Network Based Computing Laboratory
• Different Kinds of Parallel and Distributed Systems
• Petaflops to Exaflops, Trends of Commodity Clusters and Networking Technologies
• Defining Exascale Systems and Challenges
• Challenges in Designing Various Systems
Presentation Overview
6422-Overview 31Network Based Computing Laboratory
Designing Communication Middleware for Multi-Petaflop and Exaflop Systems: Challenges
Programming ModelsMPI, PGAS (UPC, Global Arrays, OpenSHMEM), CUDA, OpenMP,
OpenACC, Cilk, Hadoop (MapReduce), Spark (RDD, DAG), etc.
Application Kernels/Applications
Networking Technologies(InfiniBand, 40/100GigE,
Aries, and OmniPath)
Multi/Many-coreArchitectures
Accelerators(NVIDIA and MIC)
MiddlewareCo-Design
Opportunities and
Challenges across Various
Layers
PerformanceScalability
Fault-Resilience
Communication Library or Runtime for Programming ModelsPoint-to-point
CommunicationCollective
CommunicationEnergy-
AwarenessSynchronization
and LocksI/O and
File SystemsFault
Tolerance
6422-Overview 32Network Based Computing Laboratory
• Hardware components– Processing cores and memory
subsystem
– I/O bus or links
– Network adapters/switches
• Software components– Communication stack
• Bottlenecks can artificially limit the network performance the user perceives
Major Components in Computing Systems
P0Core0 Core1
Core2 Core3
P1Core0 Core1
Core2 Core3
Memory
MemoryI/OBus
Network Adapter
Network Switch
Processing Bottlenecks
I/O Interface Bottlenecks
Network Bottlenecks
6422-Overview 33Network Based Computing Laboratory
• Ex: TCP/IP, UDP/IP
• Generic architecture for all networks
• Host processor handles almost all aspects of communication
– Data buffering (copies on sender and receiver)
– Data integrity (checksum)
– Routing aspects (IP routing)
• Signaling between different layers– Hardware interrupt on packet arrival or transmission
– Software signals between different layers to handle protocol processing in different priority levels
Processing Bottlenecks in Traditional Protocols
P0Core0 Core1
Core2 Core3
P1Core0 Core1
Core2 Core3
Memory
MemoryI/OBus
Network Adapter
Network Switch
Processing Bottlenecks
6422-Overview 34Network Based Computing Laboratory
• Traditionally relied on bus-based
technologies (last mile bottleneck)– E.g., PCI, PCI-X
– One bit per wire
– Performance increase through:• Increasing clock speed
• Increasing bus width
– Not scalable:• Cross talk between bits
• Skew between wires
• Signal integrity makes it difficult to increase bus width significantly, especially for high clock speeds
Bottlenecks in Traditional I/O Interfaces and Networks
PCI 1990 33MHz/32bit: 1.05Gbps (shared bidirectional)
PCI-X1998 (v1.0)2003 (v2.0)
133MHz/64bit: 8.5Gbps (shared bidirectional)266-533MHz/64bit: 17Gbps (shared bidirectional)
P0Core0 Core1
Core2 Core3
P1Core0 Core1
Core2 Core3
Memory
MemoryI/OBus
Network Adapter
Network Switch
I/O Interface Bottlenecks
6422-Overview 35Network Based Computing Laboratory
• Network speeds saturated at around 1Gbps– Features provided were limited
– Commodity networks were not considered scalable enough for very large-scale systems
Bottlenecks on Traditional Networks
Ethernet (1979 - ) 10 Mbit/sec
Fast Ethernet (1993 -) 100 Mbit/sec
Gigabit Ethernet (1995 -) 1000 Mbit /sec
ATM (1995 -) 155/622/1024 Mbit/sec
Myrinet (1993 -) 1 Gbit/sec
Fibre Channel (1994 -) 1 Gbit/sec
P0Core0 Core1
Core2 Core3
P1Core0 Core1
Core2 Core3
Memory
MemoryI/OBus
Network Adapter
Network Switch
Network Bottlenecks
6422-Overview 36Network Based Computing Laboratory
Ethernet (1979 - ) 10 Mbit/secFast Ethernet (1993 -) 100 Mbit/sec
Gigabit Ethernet (1995 -) 1000 Mbit /secATM (1995 -) 155/622/1024 Mbit/sec
Myrinet (1993 -) 1 Gbit/secFibre Channel (1994 -) 1 Gbit/sec
InfiniBand (2001 -) 2 Gbit/sec (1X SDR)10-Gigabit Ethernet (2001 -) 10 Gbit/sec
InfiniBand (2003 -) 8 Gbit/sec (4X SDR)InfiniBand (2005 -) 16 Gbit/sec (4X DDR)
24 Gbit/sec (12X SDR)InfiniBand (2007 -) 32 Gbit/sec (4X QDR)
40-Gigabit Ethernet (2010 -) 40 Gbit/secInfiniBand (2011 -) 54.6 Gbit/sec (4X FDR)InfiniBand (2012 -) 2 x 54.6 Gbit/sec (4X Dual-FDR)
25-/50-Gigabit Ethernet (2014 -) 25/50 Gbit/sec100-Gigabit Ethernet (2015 -) 100 Gbit/sec
InfiniBand (2015 - ) 100 Gbit/sec (4X EDR)InfiniBand (2016 - ) 200 Gbit/sec (4X HDR)
Network Speed Acceleration with IB and HSE
100 times in the last 15 years
6422-Overview 37Network Based Computing Laboratory
• Recent trends in I/O interfaces show that they are nearly matching head-to-head with network speeds (though they still lag a little bit)
Trends in I/O Interfaces with Servers
PCI 1990 33MHz/32bit: 1.05Gbps (shared bidirectional)
PCI-X1998 (v1.0)2003 (v2.0)
133MHz/64bit: 8.5Gbps (shared bidirectional)266-533MHz/64bit: 17Gbps (shared bidirectional)
AMD HyperTransport (HT)2001 (v1.0), 2004 (v2.0)2006 (v3.0), 2008 (v3.1)
102.4Gbps (v1.0), 179.2Gbps (v2.0)332.8Gbps (v3.0), 409.6Gbps (v3.1)
(32 lanes)
PCI-Express (PCIe)by Intel
2003 (Gen1),2007 (Gen2),
2009 (Gen3 standard), 2017 (Gen4 standard)
Gen1: 4X (8Gbps), 8X (16Gbps), 16X (32Gbps)Gen2: 4X (16Gbps), 8X (32Gbps), 16X (64Gbps)
Gen3: 4X (~32Gbps), 8X (~64Gbps), 16X (~128Gbps)Gen4: 4X (~64Gbps), 8X (~128Gbps), 16X (~256Gbps)
Intel QuickPath Interconnect (QPI) 2009 153.6-204.8Gbps (20 lanes)
6422-Overview 38Network Based Computing Laboratory
Common Challenges for Large-Scale Installations
Common Challenges Adapters and Interactions I/O busMulti-port adapters NUMA
Switches Topologies Switching / Routing
Bridges IB interoperability
6422-Overview 39Network Based Computing Laboratory
System Specific Challenges for HPC Systems
Common Challenges
Adapters and Interactions I/O busMulti-port adapters NUMA
Switches Topologies Switching / Routing
Bridges IB interoperability
HPCMPIMulti-rail Collectives Scalability Application Scalability Energy Awareness
PGAS Programmability w/ Performance Optimized Resource Utilization
GPU / MIC Programmability w/ Performance Hide data movement costs Heterogeneity aware design Streaming, Deep Learning
6422-Overview 40Network Based Computing Laboratory
System Specific Storage and File Systems
Common Challenges Adapters and Interactions I/O busMulti-port adapters NUMA
Switches Topologies Switching / Routing
Bridges IB interoperability
HPCMPI
Multi-rail Collectives Scalability Application Scalability Energy Awareness
PGAS Programmability w/ Performance Optimized Resource Utilization
GPU / MIC Programmability w/ Performance Hide data movement costs Heterogeneity aware design Streaming, Deep Learning
Storage and File Systems
High throughput I/O Taking advantage of RDMA Checkpointing w/ aggregation Hierarchical data staging QoS aware checkpointing Decentralized Metadata
6422-Overview 41Network Based Computing Laboratory
System Specific Challenges for Big Data ProcessingCommon Challenges
Adapters and Interactions I/O busMulti-port adaptersNUMA
Switches Topologies Switching / Routing
Bridges IB interoperability
Big Data Taking advantage of RDMA Performance Scalability Backward compatibility
HPCMPI
Multi-rail Collectives ScalabilityApplication Scalability Energy Awareness
PGAS Programmability w/ Performance Optimized Resource Utilization
GPU / MIC Programmability w/ Performance Hide data movement costs Heterogeneity aware design
6422-Overview 42Network Based Computing Laboratory
System Specific Challenges forCloud Computing
Common Challenges Adapters and Interactions I/O busMulti-port adapters NUMA
Switches Topologies Switching / Routing
Bridges IB interoperability
CloudComputing
SR-IOV Support Virtualization Containers
HPCMPI
Multi-rail Collectives ScalabilityApplication Scalability Energy Awareness
PGAS Programmability w/ Performance Optimized Resource Utilization
GPU / MIC Programmability w/ Performance Hide data movement costs Heterogeneity aware design
Storage and File Systems
High throughput I/O Taking advantage of RDMA Checkpointing w/ aggregation Hierarchical data staging QoS aware checkpointing Decentralized Metadata
Big Data Taking advantage of RDMA Performance Scalability Backward compatibility