Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Title 44pt sentence case
Affiliations 24pt sentence case
20pt sentence case
© ARM 2017
HPC Network Stack on ARM
Pavel Shamis (Pasha) Principal Research Engineer ARM Research
ExaComm 2017 06/22/2017
© ARM 2017 2
Title 40pt sentence case
Bullets 24pt sentence case
Sub-bullets 20pt sentence case
HPC network stack on ARM ?
© ARM 2017 3
Title 40pt sentence case
Bullets 24pt sentence case
Sub-bullets 20pt sentence case
Serious ARM HPC deployments starting in 2017
§ ARM – Emerging CPU architecture in HPC and server space § Future deployments – Islamabad Cray CS-400, Japan Post-K ARMv8
© ARM 2017 4
Title 40pt sentence case
Bullets 24pt sentence case
Sub-bullets 20pt sentence case
An introduction to ARM
ARM is the world's leading semiconductor intellectual property supplier. We license to over 350 partners, are present in 95% of smart phones, 80% of digital cameras, 35% of all electronic devices, and a total of 60 billion ARM cores have been shipped since 1990.
Our CPU business model: License technology to partners, who use it to create their own system-on-chip (SoC) products. § We may license an instruction set architecture (ISA) such as
“ARMv8-A”) § or a specific implementation, such as “Cortex-A72”. § Partners who license an ISA can create their own
implementation, as long as it passes the compliance tests.
…and our IP extends beyond the CPU
© ARM 2017 5
Title 40pt sentence case
Bullets 24pt sentence case
Sub-bullets 20pt sentence case
Range of SoCs addressing infrastructure
Highly Accelerated Massively Multicore
One size does not fit all
QorIQ®Layerscape2080A
© ARM 2017 6
Text 54pt sentence case Integration with Network Interconnects
© ARM 2017 7
Title 40pt sentence case
Bullets 24pt sentence case
Sub-bullets 20pt sentence case
CCIX
§ Accelerators and Network (NIC/HCA/etc.) as a first class “citizen” in the system
§ Seamless process and accelerator hardware cache coherence support
§ Low-latency and high-bandwidth § Allow in-line acceleration
§ Bump in the wire processing (network packet processing, storage acceleration, etc.)
§ Allows “off-line” acceleration (co-processor model) § Driver-less / interrupt-less usage model § http://www.ccixconsortium.com
© ARM 2017 8
Title 40pt sentence case
Bullets 24pt sentence case
Sub-bullets 20pt sentence case
Scale-up server node
DMC-620
DMC-620 DMC-620
DMC-620
CoreLink CMN-600
Shared virtual memory system
DMC-620
DMC-620 DMC-620
DMC-620
CoreLink CMN-600
Accelerator
Smart Network
Persistent Memory
© ARM 2017 9
Title 40pt sentence case
Bullets 24pt sentence case
Sub-bullets 20pt sentence case
CCIX multichip connectivity and topologies
§ New class of interconnect providing high performance, low latency for new accelerators use cases § CCIX defines 25GT/s (3x performance*) § Examining 56GT/s (7x performance*) and beyond § Enabling low latency via light transaction layer
§ Flexible, scalable interconnect topologies § Flexible point-to-point, daisy chained and switched topologies
§ Simplified deployment by leveraging existing PCIe
hardware and software infrastructure § Runs on existing PCIe transport layer and management stack § Coexist with legacy PCIe designs
Compute Node
Accelerator
Smart Network
Persistent Memory
Switch
* Note: Based on PCIe Gen3 Performance
© ARM 2017 10
Title 40pt sentence case
Bullets 24pt sentence case
Sub-bullets 20pt sentence case
Building CCIX devices
§ Cadence® IP for CCIX § Built upon silicon proven PCIe solutions
§ IP products: § Controller IP – Provides the CCIX transaction and
data link layers. § PHY IP – Provides the high performance SERDES
physical layer supporting speeds up to 25Gpbs. § Verification IP – Provides the necessary test
infrastructure to verify CCIX designs.
Cadence Controller & PHY IP
© ARM 2017 11
Title 40pt sentence case
Bullets 24pt sentence case
Sub-bullets 20pt sentence case
Cadence CCIX integration
PCIe
Tr
ansa
ctio
n La
yer
CC
IX
Tran
sact
ion
Laye
r
Dat
a Li
nk L
ayer
PH
Y (
up t
o 25
Gpb
s)
16 Lanes
DMC-620
DMC-620 DMC-620
DMC-620
CoreLink CMN-600 XP
CM
L R
NI
CXS
AXI
Cadence IP
Example CMN-600 mesh design Low latency CCIX transaction layer
Support up to 25Gbps vs 16Gbps PCIe Gen4
CML converts CHI to CCIX messages
PCIe IP connects to a CMN IO interface via AXI
© ARM 2017 12
Title 40pt sentence case
Bullets 24pt sentence case
Sub-bullets 20pt sentence case
Gen-Z
§ All data is accessed by some form of a Read or a Write § Example of reads: DDR Row + Column Read, PCI DMA Read, SCSI Write,
Socket Read, File Read, RDMA Read § Example of writes: DDR Row + Column Write, PCI DMA Write, SCSI Read,
Socket Write, File Write, RDMA Write § The Goal: Simplify world to memory semantic Reads & Writes
© ARM 2017 13
Title 40pt sentence case
Bullets 24pt sentence case
Sub-bullets 20pt sentence case
Gen-Z Overview
§ An open, standards-based, scalable, system interconnect and protocol. § Optimized to support memory semantic communications § Breaks Processor-Memory Interlock § Split controller model
§ Memory controller § Initiates high-level requests—Read, Write, Atomic, Put / Get, etc. § Enforces ordering, reliability, path selection, etc.
§ Media controller § Abstracts memory media § Supports volatile / non-volatile / mixed-media
§ Performs media-specific operations § Executes requests and returns responses § Enables data-centric computing (accelerator, compute, etc.)
© ARM 2017 14
Text 54pt sentence case Software Stack Overview
© ARM 2017 15
Title 40pt sentence case
Bullets 24pt sentence case
Sub-bullets 20pt sentence case
Linux / FreeBSD w/ AARCH64 support
ß Engaged with FreeBSD foundation / Semi-half & Cavium to get FreeBSD on ARMv8 FreeBSD Beta version demo’d by Semihalf – Nov. 2015
12.04LTS & 14.04LTS released
ß Also 14.10 & 15.04 releases
Red Hat Enterprise Linux Server for ARM 7.2 BETA – Sept, 2015
Fedora 22 released – May 2015 Fedora 23 released – Nov 2015
CentOS Linux 7 for AArch64 GA – August 2015
OpenSUSE 13.2 – Nov 2014 SUSE Launches Partner Program to Bring SUSE Linux Enterprise 12 to 64-bit ARM
July 2015 @ ISC
Debian 8 adds AARCH64 – April 2015
© ARM 2017 16
Title 40pt sentence case
Bullets 24pt sentence case
Sub-bullets 20pt sentence case
Open source and commercial compilers
§ LLVM § C, C++ § OpenMP 3.1, (4.0 coming soon) § Fortran coming Q1 2017
§ PathScale § C, C++ Fortran § OpenACC § OpenMP 4.0
§ NAG § Fortran § OpenMP 3.1
§ GCC § C, C++, Fortran § OpenMP 4.0
§ ARM C/C++ Compiler § LLVM based § Includes SVE
© ARM 2017 17
Title 40pt sentence case
Bullets 24pt sentence case
Sub-bullets 20pt sentence case
Released
Planned
Concept
2016
Hardware
Open-Source software
ARM HPC tools
ARM HPC ecosystem roadmap
2017 Future
Cavium ThunderX
Allinea DDT and MAP Rogue Wave TotalView
AppliedMicro X-Gene 1 & 2
Fujitsu – Post K (SVE)
ISV software
AMD Seattle
Altair PBS Pro
Cavium ThunderX2
Qualcomm Centriq
NAG Library & Compiler
ARM Optimized Routines
GCC (gcc/g++/gfortran) LLVM - clang LLVM – Flang
ISV software
ARM Code Advisor (Beta) ARM Code Advisor (Full release)
ARM Optimized Routines – vector versions
Phytium Mars
OpenHPC 1.2
ARM Performance Libraries
PathScale ENZO
ARM Fortran Compiler ARM C/C++ Compiler – ahead of LLVM trunk
ARM Instruction Emulator
AppliedMicro X-Gene 3
© ARM 2017 18
Title 40pt sentence case
Bullets 24pt sentence case
Sub-bullets 20pt sentence case
RDMA Networks
§ Remote Direct Memory Access (RDMA) – popular hardware network technology § InfiniBand – 37% of systems in TOP500
https://www.top500.org
© ARM 2017 19
Title 40pt sentence case
Bullets 24pt sentence case
Sub-bullets 20pt sentence case
RDMA Support
§ Mellanox OFED 2.4 and above supports ARM § Linux Kernel 4.5.0 and above (maybe even earlier) § Rdma-Core – runs on ARM § OFED – No support § Linux Distribution – on going process
© ARM 2017 20
Title 40pt sentence case
Bullets 24pt sentence case
Sub-bullets 20pt sentence case
OpenUCX v1.2
• The first official release from OpenUCX community • https://github.com/openucx/ucx/releases/tag/v1.2.0
• Features • Support for InfiniBand and RoCE • Transports RC, UD, DC
• Support for Accelerated Verbs – 40% speedup on ARM compared to vanilla Verbs
• Support for Cray Aries and Gemini • Support for Shared Memory: KNEM, CMA, XPMEM, Posix, SySV • Support for x86, ARMv8, Power • Efficient memory polling – 36% increase in efficiency on ARM • UCX interface is integrated with MPICH, OpenMPI, OSHMEM, ORNL-
SHMEM, etc.
Pavel Shamis, M. Graham Lopez, and Gilad Shainer. “Enabling One-sided Communication Semantics on ARM“, HIPS 2017
© ARM 2017 21
Title 40pt sentence case
Bullets 24pt sentence case
Sub-bullets 20pt sentence case
Programing models
§ Open MPI – compiles and runs on ARMv8 § Continues integration with HPCAC ARMv8 server
§ MPICH – compiles and runs on ARMv8 § MVAPICH – compiles (with patches) and runs § OSHMEM – compiles and runs
§ Continues integration with HPCAC ARMv8 server
© ARM 2017 22
Title 40pt sentence case
Bullets 24pt sentence case
Sub-bullets 20pt sentence case
Example: MPI+SHMEM+OpenUCX on InfiniBand
© ARM 2017 23
Title 40pt sentence case
Bullets 24pt sentence case
Sub-bullets 20pt sentence case
Lessons Learned
§ Memory Barriers § Multithread environment § Software-hardware interaction § Examples https://github.com/openucx/ucx/blob/master/src/ucs/arch/aarch64/cpu.h#L25 § You can “fish” for these bugs in MPI implementations around Eager-RDMA and shared
memory protocols Write Payload
Write Notify
Busy-wait Read Notify
Read Payload
Write Barrier Read Barrier
RDMA
RDMA
Maranget, Luc, Susmit Sarkar, and Peter Sewell. "A tutorial introduction to the ARM and POWER relaxed memory models." Draft available from http://www. cl. cam. ac. uk/~ pes20/ppc-supplemental/test7. pdf (2012).
© ARM 2017 24
Title 40pt sentence case
Bullets 24pt sentence case
Sub-bullets 20pt sentence case
More About Barriers
§ There are multiple types of barriers § DSB
§ Completion semantics § Interaction with external devices (PCIe doorbells) § Device drivers
§ DMB § ISH* domain on Linux § Poll-flag, barrier, data
§ ISB
© ARM 2017 25
Title 40pt sentence case
Bullets 24pt sentence case
Sub-bullets 20pt sentence case
Lessons Learned - continued
§ Low-level timers § Typically found in benchmarks and MPI § Code examples https://github.com/openucx/ucx/blob/master/src/ucs/arch/aarch64/cpu.h#L35
© ARM 2017 26
Title 40pt sentence case
Bullets 24pt sentence case
Sub-bullets 20pt sentence case
Lessons Learned – continued
§ Not all cache-lines are 64Byte ! § Implementation dependent § 128Byte and 64Byte
http://xeroxnostalgia.com/duplicators/xerox-9200/
© ARM 2017 27
Title 40pt sentence case
Bullets 24pt sentence case
Sub-bullets 20pt sentence case
Optimizations
§ AVX => Neon § Mostly found around communication request initialization codes
https://github.com/openucx/ucx/blob/891e20ef90257d1e2721da52461b0261220c82d8/src/uct/ib/mlx5/ib_mlx5.inl#L160
§ Busy-wait loop § See Wait-For-Event (WFE)
Pavel Shamis, M. Graham Lopez, and Gilad Shainer. “Enabling One-sided Communication Semantics on ARM“, HIPS 2017
© ARM 2017 28
Text 54pt sentence case Preliminary Results
© ARM 2017 29
Title 40pt sentence case
Bullets 24pt sentence case
Sub-bullets 20pt sentence case
Testbed
• 2 x Softiron Overdrive 3000 servers with AMD Opteron A1100 / 2GHz • ConnectX-4 IB/VPI EDR (PCIe gen2 x8) • Ubuntu 16.04 • MOFED 3.3-1.5.0.0 • UCX [0558b41] • XPMEM [bdfcc52] • OSHMEM/OPEN-MPI [fed4849]
© ARM 2017 30
Title 40pt sentence case
Bullets 24pt sentence case
Sub-bullets 20pt sentence case
OpenUCX IB: MLX5 vs Verbs
�
�
�
�
��
��
��
���
���
���
����
� � � �� �� �� ���
���
���
����
����
����
����
�����
�����
�����
������
������
������
�������
�������
�����������������
��������
��� �������
���� �������� ������� �������� ���
���
���
���
���
���
���
� � � � �� �� �� ���
���
���
����
����������
����������
�����
��������
��� ��� ������� ����
���� �������� ���
���
����
����
����
����
����
����
� � � �� �� �� ���
���
���
����
����
����
����
�����
�����
�����
������
������
������
�������
�������
����
��������
��� ��� ���������
���� �������� ���
���
�
���
�
���
�
���
�����
������
����
��
�������
�����
������
����
��
�������
������������
������������
��� ������ ������ ���������� �����
���������
40%
© ARM 2017 31
Title 40pt sentence case
Bullets 24pt sentence case
Sub-bullets 20pt sentence case
OpenUCX: XPMEM
�����������
�������
����
������
���������
����
� � � �� �� �� ���
���
���
����
����
����
����
�����
�����
�����
������
������
������
�������
�������
�����������������
��������
��� �������
����� �������� ���
�����
�����
�����
�����
�����
�����
�����
�����
� � � � �� �� �� ���
���
���
����
���������������
��������
��� ������� ����
����� �������� ���
����
����
����
����
�����
�����
�����
�����
� � � �� �� �� ���
���
���
����
����
����
����
�����
�����
�����
������
������
������
�������
�������
����
��������
��� ���������
����� �������� ���
����
����
����
����
����
����
����
����
����
���
�����
������
������
�������
�����
������
������
�������
������������
������������
��� ������ ������ ���������� �����
© ARM 2017 32
Title 40pt sentence case
Bullets 24pt sentence case
Sub-bullets 20pt sentence case
SHMEM_WAIT()
���
�
���
�
���
��� ���������������������
�������������������
���������������������
���
������� �� ����� ������������� ������������
�
�����
�����
�������
�����
��� ���������������������
�������������������
����������������
���
����� �������������������� �����
�
�����
�����
�����
�����
�����
�������
�������
�������
��� ���������������������
�������������������
����������
���
����� �������������� �����
73%
35%
© ARM 2017 33
Title 40pt sentence case
Bullets 24pt sentence case
Sub-bullets 20pt sentence case
OpenSHMEM SSCA
�
���
���
���
���
���
� � � ��
�������������
���
���� ���������
��� �������� ����
��� �����
7-30%
© ARM 2017 34
Title 40pt sentence case
Bullets 24pt sentence case
Sub-bullets 20pt sentence case
OpenSHMEM GUPs
Pavel Shamis, M. Graham Lopez, and Gilad Shainer. “Enabling One-sided Communication Semantics on ARM“
21%
�������
�������
�������
�������
�������
�������
�������
� � � ��
��������������
����
�����
���
��������� ���� ���������
��� ���� ���������� ����� �������
��� ���� ���������������� ����� �������������
© ARM 2017 35
Title 40pt sentence case
Bullets 24pt sentence case
Sub-bullets 20pt sentence case
Summary
§ Linux RDMA community is doing great job ! § A lot of progress was made in ARM HPC/server software
eco-system
The trademarks featured in this presentation are registered and/or unregistered trademarks of ARM Limited (or its subsidiaries) in the US and/or elsewhere. All rights reserved. All other marks featured may be trademarks of their respective owners. Copyright © 2017 ARM Limited © ARM 2017