Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Prof. John Goodacre
The University of Manchester
Department of Computer Science
EuroEXA: Co-designed Innovation and System Architecture for Resilient Exascale Computing
ExaScale: What does it really mean?
© 2017 EuroEXA and Consortia Member Rights Holders Project ID: 754337
Target: 1 Billion Billion (10^18) FLOPs or equivalent
Limitation #1 €100M - €500M per system
Limitation #2 20MW – 60MW of power
….and then a few physical limits, such as speed of light.
The EU route to ExaScale
© 2017 EuroEXA and Consortia Member Rights Holders Project ID: 754337
• 2015 – 2018: First H2020 ExaScale Projects (subject area focus)
• 2017 – 2020: Co-design Projects
• 2017: EuroHPC declaration signed by seven European countries
• 2019: First ExaScale Demonstrators
• 2020: Pre-ExaScale
• 2023 – 2024: ExaScale
Decade of Innovation
• investigated alignment of HPC requirements and Arm technologyEncore
• investigated using existing Arm devices and porting HPC S/WMont Blanc
• investigated modular physical/logical approach to computeEuroServer
• furthered modularity and node level integrationExaNode
• created the cluster, distributed storage while increasing densityExaNeSt
• investigated delivering performance through FPGAEcoScale
• unifies research and co-designs a modular platform with applicationsEuroExa
• to pull in industry with the HPC market to benefit from ExaScale1018…
• Maturing Research and Innovation around the concept of delivering high performance computing outside of the constraints of commodity
Delivering European Exascale
Encore
Mont Blanc
EuroServer
ExaNode
ExaNest
EcoScale
EuroEXA
HPC CoE
1018
Applications
IndustryEPI
EuroEXA: AbstractTo achieve the demands of extreme scale and the delivery of exascale, we embrace the computing platform as a whole, not just component optimization or fault resilience. EuroEXA brings a holistic foundation from multiple European HPC projects and partners together with the industrial SME focus of MAX for FPGA data-flow; ICE for infrastructure; ALLIN for HPC tooling and ZPT to collapse the memory bottleneck; to co-design a ground-breaking platform capable of scaling peak performance to 400 PFLOP in a peak system power envelope of 30MW; over four times the performance at four times the energy efficiency of today’s HPC platforms. Further, we target a PUE parity rating of 1.0 through use of renewables and immersion-based cooling.
We co-design a balanced architecture for both compute- and data-intensive applications using a cost-efficient, modular integration approach enabled by novel inter-die links and the tape-out of a resulting EuroEXA processing unit with integration of FPGA for data-flow acceleration. We provide a homogenised software platform offering heterogeneous acceleration with scalable shared memory access and create a unique hybrid geographically-addressed, switching and topology interconnect within the rack while enabling the adoption of low-cost Ethernet switches offering low-Latency and high-switching bandwidth.
Working together with a rich mix of key HPC applications from across climate/weather, physics/energy and life-science/ bioinformatics domains we will demonstrate the results of the project through the deployment of an integrated and operational peta-flop level prototype hosted at STFC. Supported by run-to-completion platform-wide resilience mechanisms, components will manage local failures, while communicating with higher levels of the stack. Monitored and controlled by advanced runtime capabilities, EuroEXA will demonstrate its co-design solution supporting both existing pre-exascale and project-developed exascale applications.
What EuroEXA?
© 2017 EuroEXA and Consortia Member Rights Holders Project ID: 754337
EuroEXA is a European Union ExaScale Supercomputer Development Project. It targets to provide the template for an upcoming exascale system by co-designing and implementing a petascale-level prototype with ground-breaking characteristics.
• €20m in funding from FET-HPC
• €50m in total funding from the EU
• Collaboration between 16 organisations across 8 countries
• 2019 – First ExaScale demonstrators deployed – open call
• 2021-24 – EU Taget for first ExaScalemachine to be turned on
Start: Sep. 2017
Duration: 42 months
Funded under: H2020-EU.1.2.2
The EuroEXA Consortium
© 2017 EuroEXA and Consortia Member Rights Holders Project ID: 754337
SupportersCommercial Partners Academic/Gov. Partners
EuroEXA: The Vision
© 2017 EuroEXA and Consortia Member Rights Holders Project ID: 754337
• Testbed architecture will be evidenced and projected to be capable of scaling to world-class peak performance in excess of 400 PFLOPS with an estimated system power of around 30 MW peak
• A compute-centric 250 PFLOPS per 15 MW by 2019
• Show that an ExaScale machine could be built in 2020 within 30 shipping containers with an edge distance of less than 40m
EuroEXA Aims
• Provide a homogenized software platform with advanced runtime capabilities supporting novel parallel programming paradigms, dataflow programming, heterogeneous acceleration and scalable shared memory access
• Post and optimize a rich mix of both traditional and next-generation HPC applications
• Deploy the EuroEXA testbed taster within an operational datacenter, with next-generation cooling and power supply technology
Why Data flow? Why FPGA?
• Unlike a von-Neumann processor, a FPGA is just a lot of “MACs” and wires• 1,000’s of MACs per cycle vs. 10’s in a CPU
• A program is not defined as a sequence of operations on data, but is defined as how data flows through operations• No need to store intermediate values to RAM
• Since the dataflow of an application does not need a control unit, the power efficiency can be significantly higher than a CPU
• FPGA is currently the best silicon to investigate dataflow paradigm
Increasing Support for Dataflow
• If you accept a data-flow model rather than a von-Neumann model is required:• You can make a GPU execute a graph
• You can map a graph to a FPGA
• You can map a graph onto many very simple processing elements
• Its then a question of how to generate a graph, and how to map to hardware • EuroExa investigates multiple ways to do this
EuroEXA full-scale exascale-class applications
EuroEXA software stack building on earlier projects
13
UNIMEM API, OS kernel and firmware
Xilinx tools
BeeGFS
storageOmpSs clusters
and FPGA
GPI
OpenStream
Full MPI-3 Co-design MPI
Accelerated
libraries
SLURM
scheduler
Allinea
debug
Aftermath
monitor
Maxeler
MAXJOpenCL
ExaNoDe => EuroEXAApplications External
ECOSCALE/ ExaNeSt
=> EuroEXANew in EuroEXA
EuroEXA: co-design with exascale-class applications, porting and optimization
14
NEMO
IOPS
Mem BW
Mem capacity
NEST/DPSNN
SMURFF
InfOliFRTM
AlyaLFRic
IFS & ESCAPE dwarves LBM GADGET
AVU-GSR
Neuromarketing
Quantum Espresso
Astronomyimage classification
FLOPSLOFAR
© 2017–2018 EuroEXA and Consortia Member Rights Holders Confidential - Project ID: 754337
Maxeler
Why focus on applications? Theory vs. Practice
• e.g. TaihuLight
• 93 PF and 6 GF/W for HPL
• But 0.3% of peak performance for more realistic HPCG (network byte per flop)
• EuroEXA aims to demonstrate increasing realised vs theorectic 10x
Pe
rfo
rma
nc
e
(FL
OP
S)
100%
Theoretical Max sustained
(measured,
DGEMM)
80%
Production
HPC apps
1, 2, 5% if you
work very
hard
Mini-apps
and
benchmarks
ExaNoDe
Key Architecture Challenges
To control the ALU costs over 70% of the required power
•Need to flow data between ops rather than control which op on data
•Need to stop storing data between ops to reduce the bandwidth disparity between compute/dram
Movement of data cost orders of magnitude more power than the ALU itself
•Need introduce data locality to reduce the movement of data
•Need to increase compute-density to further reduce distances
Scaling the sequential execution of ALU is limited by latency
•Need to pipeline ops while ensuring low latency for remaining sequential parts
The speed of the network is approaching the speed of memory
•The network needs to nativity attach to compute, just as memory does today
Storage devices are castrated by the way they are accessed
•Need to directly attach to the network while providing improved local access
Memory capacity within a node won’t scale
•Need to extend the current “shared memory” and “device” paradigm with “remote access” paradigm
Application specific chips are economically unviable for most markets
•Need to modularise manufacturing using scalable units compute, use of FPGA prior to ASICs
Without breaking existing software
While improving on existing performance metric
and utilizing the best of class compute/memory and acceleration technologies
Driving High-level Concepts
Compress distances, Leverage locality
• multi-chip-modules, in-package memory
• high-thermal/compute density
• hybrid interconnects
• converged and distribute storage
Enhance Computer architecture
• Compute unit scalability model
• Direct network link-layer memory transaction (Unimem model)
• Flatten the compute centric model
• Use of FPGA to investigate non-von-Neumann application acceleration in a move to data-flow
Evolving Computing Architecture
CPU
RAM I/O
CPU
RAM I/O
CPUCPUSMP
CPU
RAM I/O
CPUCPUSMP
CPU
RAM
CPUCPUSMP
CPU
RAM I/O
CPUCPUSMP ACCEL
RAM I/O
CPUCTRL
DATA
ACCEL
Mu
lticore co
mp
utin
g
Mo
re com
pu
te,Lo
wers R
AM
/IO p
er th
read
Mu
lti-socket N
UM
A
com
pu
ter
Mo
re RA
M, m
ore
Co
mp
ute, reach
ing
ph
ysical limits
Offlo
ad co
mp
ute to
accelerato
rs
Ho
st bo
ttleneckin
g R
AM
/IO
Ad
op
t con
trol/d
ata p
lane arch
itecture fo
r co
mp
ute
Data p
lane o
wn
s its R
AM
/IO
1970 – 90’s 1990-00’s 2000-10’s 2010-20 2020 – 30’s
Simp
le ho
st-centric
com
pu
ter
• Accelerator given native access to network and app memory
• Storage is distributed with locality to each node
Resulting Node Architecture
Accelerator
Ap
plic
ation
Me
mo
ry
Traditional
Host
Ap
plic
atio
n
Me
mo
ry
Traditional
Control Host
Data
Accelerator
Distributed
Storage
Ne
two
rk
Inte
rface
Ne
two
rk
Inte
rface
Centra
l
Sto
rag
e
• Host must move all network and accelerator data through its memory
• Storage is unified, but non-local
Traditional HPC New EuroEXA HPC
• Introduces an additional “global” remote address space that can be addressed natively by hardware level transactions
• Only the data-owner can cache globally shared memory (data locality)
• Enables nodes to read/write data coherently with the data-owner
• Provides native hardware level one-sided (read/write/atomic) communication
• Apps can use RDMA to generate both local and remote transactions • Block move data between local address space and remote address space
• EuroExa adds support for CPU to natively generate remote transactions
• Hence apps can also natively access remote address space
• Eg, word write directly into a a remote memory location
• A node (of any size and complexity), that exposes a Unimem bridge is known as a Compute Unit
Memory Architecture: Unimem
Host
Operating System
Tasks
Shared
MemoryComms
Host
DRAMSTORAGE
Network
Local Address Space
(Local-AXI transactions +
Remote-AXI transactions)
Remote
InterconnectRemote Address Space
(Unimem transactions)
RDMA
Local
Interconnect
To/From Bridge
AXI is a standard hardware protocol for read/write/atomic transactions published by Arm
• Cache coherent shared memory (eg, SMP, ccNUMA, SGI)• Requires coherence protocol between all memories – limits scalability, Unimem does not
• Software managed partitioned global address space• Software creates a model and API for communicating using addresses – Unimem is at hardware level
• Remote Direct Memory Access communication• Using a device outside the CPU to move data directly with remote memory
• No correlation of the source and destination address spaces – Unimem has common global address space
• Communication devices• Using a device to move data between software managed buffers – Unimem operates at hardware level
• Unimem• Allows native hardware read/write/atomic transactions to work on a global address space
• Enables data-owners to place locally cached regions of memory into the global address space
• Any hardware device / accelerator can be a data-owner / user
How’s this compare
• 1980’s 1990’s 2000’s – 2010’s 2020’s
Evolving Interconnect Architecture
Node Node Node Node
Switch
Node Node
Switch
Node Node
Switch
Packets b
road
cast b
etween
all no
des
Packets sw
itched
po
int
to p
oin
t betw
een all
no
des
Node integrates switching with increasing radix of topology between nodes
Node Node
Hybrid Switch
Node Node
Hybrid Switch
Switch
Node continues to integrate switching with peak radix of topology between nodes with “overflow” to higher level switching
• Network Interface provides hybrid access between peer access between nodes and the blade level topology and switching• Provides both distributed and local access to
storage device
• Native host can be the Unimem addressable ASIC (offering shared memory), or a classical CPU host• (Host also has its own memory)
EuroExa Compute Unit
Accelerator
FPGA
Network
Interface
Storage
Interface
Native HostB
lad
e-le
ve
l
sw
itch
Inte
r-n
od
e
top
olo
gy
Application DRAM
• VU9 for accelerator
• ZU9 for NIC/Storage• Currently limited
CPU compute
• Later testbed increased compute
Prototype Node:
Modular Compute/Acceleration Blade
• Moving to an industrial standard COM Express, Type-7 Extended PCB for a node (grouped into 4’s)
• Create high-density 16-slot carrier with integrated switching
• Use fly-over cables to adapt blade to create single compute unit from multiple slots
Blade Level NetworkingS
witch
ed
ne
two
rk
be
twe
en
all
bla
de
s
in a
ca
bin
et (2
00
Gb
)
Inte
rco
nn
ect to
po
log
y
be
twe
en
gro
up
s o
f
bla
de
s (8
x1
00
Gb
)
• Half-depth OCP Cabinet containing 32x blades• Liquid cooled to achieve thermal density (~100kW)
• Each group of 8 peer interconnected
• 200Gb/s from each blade to top of rack• Blade to any blade 2 hops (maybe 1)
• This is the target for the EuroEXA testbed-2
• Provides “network locality” to limit data movement (1EB/s ~ 5MW per 5mm)
Proposed Cabinet Layout (2020 switches)
Blade
Blade
Blade
Blade
Blade
Blade
Blade
BladeNetw
ork
Gro
up
(3
D T
oru
s)
Power / Cooling
Network Group
Network Group
Network Group
Top of Rack 40x200Gb/s
32
x 2
00
Gb
(sp
lit 2x1
00
)
• 2MW per ISO “shipping container” (Approx. 100kW per cabinet)
• Footprint of 10 x 40ft ISOs, stacked 3 high
• EuroExa testbed 2/ will use 1 container, initially with 3 live cabinets
Structural Modularity
Compute
Cabinet
Compute
Cabinet
Compute
Cabinet
Compute
Cabinet
Compute
Cabinet
Compute
Cabinet
Compute
Cabinet
Compute
Cabinet
Compute
CabinetCompute
Cabinet
Compute
Cabinet
Compute
Cabinet
Compute
Cabinet
Compute
CabinetCompute
Cabinet
Compute
Cabinet
Compute
Cabinet
Compute
Cabinet
Compute
Cabinet
Compute
CabinetCompute
Cabinet
• Longest cable around 30m, eg 125nsec, IB port to port ~90nsec
Exascale in ~30 Stacked Containers
• The HPC measure of FLOPS highs various other needs of high-performance applications
• GLOPS can ignore memory bandwidth requirements
• Agnostic to interconnect latency and bandwidth
• Delivering storage over the network is expensive and too slow for big-data
HPC for more than HPC applications
• FPGA have variable precision, EuroEXA testbed is ~6 PetaOPINT8 for ML/AI (256 nodes)
• Data-flow moves data between operations, not ram
• Dataflow benefits from locality, higher cross-sectional bandwidth
• 100Gb network means ~10GB/s – Directly attached NVMe to 10k nodes means 30TB/s in aggregate
• ARM Processing and DataFlow (locality connected nodes and FPGA)
• Unique Hybrid Geographically-Addressed, Switching and Topology Interconnect
• UNIMEM Architecture with hardware-native global address space
• Distributed Storage on BeeGFS – storage locality
• Memory Compression Technologies – memory bottleneck
System architecture summary
© 2017 EuroEXA and Consortia Member Rights Holders Project ID: 754337
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No
754337
Thank you!