EuroEXA: Co-designed Innovation and System for Resilient … · 2020-01-16 · Target: 1 Billion Billion (10^18) FLOPs or equivalent Limitation #1 €100M - €500M per system Limitation

Prof. John Goodacre

The University of Manchester

Department of Computer Science

EuroEXA: Co-designed Innovation and System Architecture for Resilient Exascale Computing

ExaScale: What does it really mean?

© 2017 EuroEXA and Consortia Member Rights Holders Project ID: 754337

Target: 1 Billion Billion (10^18) FLOPs or equivalent

Limitation #1 €100M - €500M per system

Limitation #2 20MW – 60MW of power

….and then a few physical limits, such as speed of light.

The EU route to ExaScale


• 2015 – 2018: First H2020 ExaScale Projects (subject area focus)

• 2017 – 2020: Co-design Projects

• 2017: EuroHPC declaration signed by seven European countries

• 2019: First ExaScale Demonstrators

• 2020: Pre-ExaScale

• 2023 – 2024: ExaScale

Decade of Innovation

• investigated alignment of HPC requirements and Arm technologyEncore

• investigated using existing Arm devices and porting HPC S/WMont Blanc

• investigated modular physical/logical approach to computeEuroServer

• furthered modularity and node level integrationExaNode

• created the cluster, distributed storage while increasing densityExaNeSt

• investigated delivering performance through FPGAEcoScale

• unifies research and co-designs a modular platform with applicationsEuroExa

• to pull in industry with the HPC market to benefit from ExaScale1018…

• Maturing Research and Innovation around the concept of delivering high performance computing outside of the constraints of commodity

Delivering European Exascale

Encore

Mont Blanc

EuroServer

ExaNode

ExaNest

EcoScale

EuroEXA

HPC CoE

1018

Applications

IndustryEPI

EuroEXA: AbstractTo achieve the demands of extreme scale and the delivery of exascale, we embrace the computing platform as a whole, not just component optimization or fault resilience. EuroEXA brings a holistic foundation from multiple European HPC projects and partners together with the industrial SME focus of MAX for FPGA data-flow; ICE for infrastructure; ALLIN for HPC tooling and ZPT to collapse the memory bottleneck; to co-design a ground-breaking platform capable of scaling peak performance to 400 PFLOP in a peak system power envelope of 30MW; over four times the performance at four times the energy efficiency of today’s HPC platforms. Further, we target a PUE parity rating of 1.0 through use of renewables and immersion-based cooling.

We co-design a balanced architecture for both compute- and data-intensive applications using a cost-efficient, modular integration approach enabled by novel inter-die links and the tape-out of a resulting EuroEXA processing unit with integration of FPGA for data-flow acceleration. We provide a homogenised software platform offering heterogeneous acceleration with scalable shared memory access and create a unique hybrid geographically-addressed, switching and topology interconnect within the rack while enabling the adoption of low-cost Ethernet switches offering low-Latency and high-switching bandwidth.

Working together with a rich mix of key HPC applications from across climate/weather, physics/energy and life-science/ bioinformatics domains we will demonstrate the results of the project through the deployment of an integrated and operational peta-flop level prototype hosted at STFC. Supported by run-to-completion platform-wide resilience mechanisms, components will manage local failures, while communicating with higher levels of the stack. Monitored and controlled by advanced runtime capabilities, EuroEXA will demonstrate its co-design solution supporting both existing pre-exascale and project-developed exascale applications.

What EuroEXA?


EuroEXA is a European Union ExaScale Supercomputer Development Project. It targets to provide the template for an upcoming exascale system by co-designing and implementing a petascale-level prototype with ground-breaking characteristics.

• €20m in funding from FET-HPC

• €50m in total funding from the EU

• Collaboration between 16 organisations across 8 countries

• 2019 – First ExaScale demonstrators deployed – open call

• 2021-24 – EU Taget for first ExaScalemachine to be turned on

Start: Sep. 2017

Duration: 42 months

Funded under: H2020-EU.1.2.2

The EuroEXA Consortium


SupportersCommercial Partners Academic/Gov. Partners

EuroEXA: The Vision


• Testbed architecture will be evidenced and projected to be capable of scaling to world-class peak performance in excess of 400 PFLOPS with an estimated system power of around 30 MW peak

• A compute-centric 250 PFLOPS per 15 MW by 2019

• Show that an ExaScale machine could be built in 2020 within 30 shipping containers with an edge distance of less than 40m

EuroEXA Aims

• Provide a homogenized software platform with advanced runtime capabilities supporting novel parallel programming paradigms, dataflow programming, heterogeneous acceleration and scalable shared memory access

• Post and optimize a rich mix of both traditional and next-generation HPC applications

• Deploy the EuroEXA testbed taster within an operational datacenter, with next-generation cooling and power supply technology

Why Data flow? Why FPGA?

• Unlike a von-Neumann processor, a FPGA is just a lot of “MACs” and wires• 1,000’s of MACs per cycle vs. 10’s in a CPU

• A program is not defined as a sequence of operations on data, but is defined as how data flows through operations• No need to store intermediate values to RAM

• Since the dataflow of an application does not need a control unit, the power efficiency can be significantly higher than a CPU

• FPGA is currently the best silicon to investigate dataflow paradigm

Increasing Support for Dataflow

• If you accept a data-flow model rather than a von-Neumann model is required:• You can make a GPU execute a graph

• You can map a graph to a FPGA

• You can map a graph onto many very simple processing elements

• Its then a question of how to generate a graph, and how to map to hardware • EuroExa investigates multiple ways to do this

EuroEXA full-scale exascale-class applications

EuroEXA software stack building on earlier projects

13

UNIMEM API, OS kernel and firmware

Xilinx tools

BeeGFS

storageOmpSs clusters

and FPGA

GPI

OpenStream

Full MPI-3 Co-design MPI

Accelerated

libraries

SLURM

scheduler

Allinea

debug

Aftermath

monitor

Maxeler

MAXJOpenCL

ExaNoDe => EuroEXAApplications External

ECOSCALE/ ExaNeSt

=> EuroEXANew in EuroEXA

EuroEXA: co-design with exascale-class applications, porting and optimization

14

NEMO

IOPS

Mem BW

Mem capacity

NEST/DPSNN

SMURFF

InfOliFRTM

AlyaLFRic

IFS & ESCAPE dwarves LBM GADGET

AVU-GSR

Neuromarketing

Quantum Espresso

Astronomyimage classification

FLOPSLOFAR

© 2017–2018 EuroEXA and Consortia Member Rights Holders Confidential - Project ID: 754337

Maxeler

Why focus on applications? Theory vs. Practice

• e.g. TaihuLight

• 93 PF and 6 GF/W for HPL

• But 0.3% of peak performance for more realistic HPCG (network byte per flop)

• EuroEXA aims to demonstrate increasing realised vs theorectic 10x

Pe

rfo

rma

nc

e

(FL

OP

S)

100%

Theoretical Max sustained

(measured,

DGEMM)

80%

Production

HPC apps

1, 2, 5% if you

work very

hard

Mini-apps

and

benchmarks

ExaNoDe

Key Architecture Challenges

To control the ALU costs over 70% of the required power

•Need to flow data between ops rather than control which op on data

•Need to stop storing data between ops to reduce the bandwidth disparity between compute/dram

Movement of data cost orders of magnitude more power than the ALU itself

•Need introduce data locality to reduce the movement of data

•Need to increase compute-density to further reduce distances

Scaling the sequential execution of ALU is limited by latency

•Need to pipeline ops while ensuring low latency for remaining sequential parts

The speed of the network is approaching the speed of memory

•The network needs to nativity attach to compute, just as memory does today

Storage devices are castrated by the way they are accessed

•Need to directly attach to the network while providing improved local access

Memory capacity within a node won’t scale

•Need to extend the current “shared memory” and “device” paradigm with “remote access” paradigm

Application specific chips are economically unviable for most markets

•Need to modularise manufacturing using scalable units compute, use of FPGA prior to ASICs

Without breaking existing software

While improving on existing performance metric

and utilizing the best of class compute/memory and acceleration technologies

Driving High-level Concepts

Compress distances, Leverage locality

• multi-chip-modules, in-package memory

• high-thermal/compute density

• hybrid interconnects

• converged and distribute storage

Enhance Computer architecture

• Compute unit scalability model

• Direct network link-layer memory transaction (Unimem model)

• Flatten the compute centric model

• Use of FPGA to investigate non-von-Neumann application acceleration in a move to data-flow

Evolving Computing Architecture

CPU

RAM I/O

CPU

RAM I/O

CPUCPUSMP

CPU

RAM I/O

CPUCPUSMP

CPU

RAM

CPUCPUSMP

CPU

RAM I/O

CPUCPUSMP ACCEL

RAM I/O

CPUCTRL

DATA

ACCEL

Mu

lticore co

mp

utin

g

Mo

re com

pu

te,Lo

wers R

AM

/IO p

er th

read

Mu

lti-socket N

UM

A

com

pu

ter

Mo

re RA

M, m

ore

Co

mp

ute, reach

ing

ph

ysical limits

Offlo

ad co

mp

ute to

accelerato

rs

Ho

st bo

ttleneckin

g R

AM

/IO

Ad

op

t con

trol/d

ata p

lane arch

itecture fo

r co

mp

ute

Data p

lane o

wn

s its R

AM

/IO

1970 – 90’s 1990-00’s 2000-10’s 2010-20 2020 – 30’s

Simp

le ho

st-centric

com

pu

ter

• Accelerator given native access to network and app memory

• Storage is distributed with locality to each node

Resulting Node Architecture

Accelerator

Ap

plic

ation

Me

mo

ry

Traditional

Host

Ap

plic

atio

n

Me

mo

ry

Traditional

Control Host

Data

Accelerator

Distributed

Storage

Ne

two

rk

Inte

rface

Ne

two

rk

Inte

rface

Centra

l

Sto

rag

e

• Host must move all network and accelerator data through its memory

• Storage is unified, but non-local

Traditional HPC New EuroEXA HPC

• Introduces an additional “global” remote address space that can be addressed natively by hardware level transactions

• Only the data-owner can cache globally shared memory (data locality)

• Enables nodes to read/write data coherently with the data-owner

• Provides native hardware level one-sided (read/write/atomic) communication

• Apps can use RDMA to generate both local and remote transactions • Block move data between local address space and remote address space

• EuroExa adds support for CPU to natively generate remote transactions

• Hence apps can also natively access remote address space

• Eg, word write directly into a a remote memory location

• A node (of any size and complexity), that exposes a Unimem bridge is known as a Compute Unit

Memory Architecture: Unimem

Host

Operating System

Tasks

Shared

MemoryComms

Host

DRAMSTORAGE

Network

Local Address Space

(Local-AXI transactions +

Remote-AXI transactions)

Remote

InterconnectRemote Address Space

(Unimem transactions)

RDMA

Local

Interconnect

To/From Bridge

AXI is a standard hardware protocol for read/write/atomic transactions published by Arm

• Cache coherent shared memory (eg, SMP, ccNUMA, SGI)• Requires coherence protocol between all memories – limits scalability, Unimem does not

• Software managed partitioned global address space• Software creates a model and API for communicating using addresses – Unimem is at hardware level

• Remote Direct Memory Access communication• Using a device outside the CPU to move data directly with remote memory

• No correlation of the source and destination address spaces – Unimem has common global address space

• Communication devices• Using a device to move data between software managed buffers – Unimem operates at hardware level

• Unimem• Allows native hardware read/write/atomic transactions to work on a global address space

• Enables data-owners to place locally cached regions of memory into the global address space

• Any hardware device / accelerator can be a data-owner / user

How’s this compare

• 1980’s 1990’s 2000’s – 2010’s 2020’s

Evolving Interconnect Architecture

Node Node Node Node

Switch

Node Node

Switch

Node Node

Switch

Packets b

road

cast b

etween

all no

des

Packets sw

itched

po

int

to p

oin

t betw

een all

no

des

Node integrates switching with increasing radix of topology between nodes

Node Node

Hybrid Switch

Node Node

Hybrid Switch

Switch

Node continues to integrate switching with peak radix of topology between nodes with “overflow” to higher level switching

• Network Interface provides hybrid access between peer access between nodes and the blade level topology and switching• Provides both distributed and local access to

storage device

• Native host can be the Unimem addressable ASIC (offering shared memory), or a classical CPU host• (Host also has its own memory)

EuroExa Compute Unit

Accelerator

FPGA

Network

Interface

Storage

Interface

Native HostB

lad

e-le

ve

l

sw

itch

Inte

r-n

od

e

top

olo

gy

Application DRAM

• VU9 for accelerator

• ZU9 for NIC/Storage• Currently limited

CPU compute

• Later testbed increased compute

Prototype Node:

Modular Compute/Acceleration Blade

• Moving to an industrial standard COM Express, Type-7 Extended PCB for a node (grouped into 4’s)

• Create high-density 16-slot carrier with integrated switching

• Use fly-over cables to adapt blade to create single compute unit from multiple slots

Blade Level NetworkingS

witch

ed

ne

two

rk

be

twe

en

all

bla

de

s

in a

ca

bin

et (2

00

Gb

)

Inte

rco

nn

ect to

po

log

y

be

twe

en

gro

up

s o

f

bla

de

s (8

x1

00

Gb

)

• Half-depth OCP Cabinet containing 32x blades• Liquid cooled to achieve thermal density (~100kW)

• Each group of 8 peer interconnected

• 200Gb/s from each blade to top of rack• Blade to any blade 2 hops (maybe 1)

• This is the target for the EuroEXA testbed-2

• Provides “network locality” to limit data movement (1EB/s ~ 5MW per 5mm)

Proposed Cabinet Layout (2020 switches)

Blade

Blade

Blade

Blade

Blade

Blade

Blade

BladeNetw

ork

Gro

up

(3

D T

oru

s)

Power / Cooling

Network Group

Network Group

Network Group

Top of Rack 40x200Gb/s

32

x 2

00

Gb

(sp

lit 2x1

00

)

• 2MW per ISO “shipping container” (Approx. 100kW per cabinet)

• Footprint of 10 x 40ft ISOs, stacked 3 high

• EuroExa testbed 2/ will use 1 container, initially with 3 live cabinets

Structural Modularity

Compute

Cabinet

Compute

Cabinet

Compute

Cabinet

Compute

Cabinet

Compute

Cabinet

Compute

Cabinet

Compute

Cabinet

Compute

Cabinet

Compute

CabinetCompute

Cabinet

Compute

Cabinet

Compute

Cabinet

Compute

Cabinet

Compute

CabinetCompute

Cabinet

Compute

Cabinet

Compute

Cabinet

Compute

Cabinet

Compute

Cabinet

Compute

CabinetCompute

Cabinet

• Longest cable around 30m, eg 125nsec, IB port to port ~90nsec

Exascale in ~30 Stacked Containers

• The HPC measure of FLOPS highs various other needs of high-performance applications

• GLOPS can ignore memory bandwidth requirements

• Agnostic to interconnect latency and bandwidth

• Delivering storage over the network is expensive and too slow for big-data

HPC for more than HPC applications

• FPGA have variable precision, EuroEXA testbed is ~6 PetaOPINT8 for ML/AI (256 nodes)

• Data-flow moves data between operations, not ram

• Dataflow benefits from locality, higher cross-sectional bandwidth

• 100Gb network means ~10GB/s – Directly attached NVMe to 10k nodes means 30TB/s in aggregate

• ARM Processing and DataFlow (locality connected nodes and FPGA)

• Unique Hybrid Geographically-Addressed, Switching and Topology Interconnect

• UNIMEM Architecture with hardware-native global address space

• Distributed Storage on BeeGFS – storage locality

• Memory Compression Technologies – memory bottleneck

System architecture summary


This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No

754337

Thank you!

Documents

EuroEXA: Co-designed Innovation and System for Resilient … · 2020-01-16 · Target: 1 Billion Billion (10^18) FLOPs or equivalent Limitation #1 €100M - €500M per system Limitation