Echelon: NVIDIA & Team’s UHPC Project

14/6/2011

Echelon: NVIDIA & Team’s UHPC Project

STEVE KECKLERDIRECTOR OF ARCHITECTURE RESEARCH, NVIDIA

24/6/2011

GPU Supercomputing

Tianhe-1A7168 GPUs

Linpack: 2.5 PFlops

Dawning Nebulae4640 GPUs

Linpack: 1.3 PFlops

Tsubame 2.04224 GPUs

Linpack 1.3 PFlops

8 more GPU accelerated machines in the November Top500

Many (corporate) machines not listed

NVIDIA GPU Module

34/6/2011

Tianhe-1A Jaguar Nebulae Tsubame Hopper II0

500

1000

1500

2000

2500

0

1

2

3

4

5

6

7

8

Gig

afl

op

s

Me

ga

wa

tts

Top 5 Performance and Power

44/6/2011

Existing GPU Application Areas

54/6/2011

Key Challenges

Energy to Solution is too large

Programming parallel machines is too difficult

Programs are not scalable to billion-fold parallelism

Resilience (AMTTI) is too low

Machines are vulnerable to attacks/undetected program errors

64/6/2011

Echelon Team

74/6/2011

System Sketch

Self-Aware OS

Self-Aware Runtime

Locality-AwareCompiler & Autotuner

Echelon SystemCabinet 0 (C0) 2 PF, 205TB/s, 32TB

Module 0 (M)) 128TF, 12.8TB/s, 2TB M15Node 0 (N0) 16TF, 1.6TB/s, 256GB

Processor Chip (PC)

L0

C0

SM0

L0

C7

NoC

SM127

MC NICL20 L21023

DRAMCube

DRAMCube

NV RAM

High-Radix Router Module (RM)

CN

Dragonfly Interconnect (optical fiber)

N7

LC0

LC7

84/6/2011

Execution Model

A B

Active Message

Abstract Memory Hierarchy

Global Address Space

ThreadObject

BL

oa

d/S

tore

A

B Bulk Xfer

94/6/2011

Two (of many) Fundamental Challenges

104/6/2011

The High Cost of Data MovementFetching operands costs more than computing on them

20mm

64-bit DP20pJ 26 pJ 256 pJ

1 nJ

500 pJ Efficientoff-chip link

28nm

256-bitbuses

16 nJDRAMRd/Wr

256-bit access8 kB SRAM

50 pJ

114/6/2011

Magnitude of Thread Count

Billion-fold parallel fine-grained threads for Exascale

2010:4640 GPUs

2018:90K GPUs

Threads/SM 1.5 K ~103

Threads/GPU 21 K ~105

Threads/Cabinet 672 K ~107

Threads/Machine 97 M ~109-1010

124/6/2011

Fine-grained concurrency2

APIs for resilience, memory safety3

Locality, Locality, Locality1

Echelon Disruptive Technologies

Order of magnitude improvement in efficiency4

134/6/2011

Data Locality(Central to performance and efficiency)

Programming SystemAbstract expression of spatial, temporal, and producer-consumer locality

Programmer expresses localityProgramming system maps threads and objects to locations to exploit locality

ArchitectureHierarchical global address spaceConfigurable memory hierarchyEnergy-provisioned bandwidth

144/6/2011

Fine-Grained Concurrency(How we get to 1010 threads)

Programming SystemProgrammer expresses ALL of the concurrencyProgramming system decides how much to exploit in space and how much to iterate over in time

ArchitectureFast, low-overhead thread-array creation and managementFast, low-overhead communication and synchronizationMessage-driven computing (active messages)

154/6/2011

Dependability(How we get to AMTTI of one day)

Programming SystemAPI to express

State to preserveWhen to preserve itComputations to checkAssertions

ResponsibilitiesPreserves stateGenerates recovery codeGenerates redundant computation where appropriate

ArchitectureError Checking on all memories and communication pathsHardware configurable to run duplex computationsHardware support for error containment and recovery

164/6/2011

Security(From attackers and ourselves)

Key challenge: Memory safetyMalicious attacksProgrammer memory bugs (bad pointer dereferencing, etc.)

Programming SystemExpress partitioning of subsystemsExpress privileges on data structures

Architecture: Guarded pointers primitiveHardware can check all memory references/address computationsFast, low-overhead subsystem entryErrors reported through error containment features

174/6/2011

> 10x Energy Efficiency Gain (GFlops/Watt)

Contemporary GPU: ~300pJ/FlopFuture parallel systems: ~20pJ/Flop

In order to get anywhere near Exascale in 2018~4x can come from process scaling to 10nm

Remainder from architecture/programming systemLocality – both horizontal and vertical

Reduce data movement, migrate fine-grain tasks to data

Extremely energy-efficient throughput coresEfficient instruction/data supplySimple hardware: static instruction scheduling, simple instruction controlMultithreading and hardware support for thread arrays

184/6/2011

An NVIDIA ExaScale Machine

194/6/2011

Lane – 4 DFMAs, 16 GFLOPS

DFMA DFMA DFMA DFMA

Main Registers

LSI LSI

Operand Registers

L0 I$

L0 D$

204/6/2011

Streaming Multiprocessor 8 lanes – 128 GFLOPS

P P P P P P P P

Switch

L1$

214/6/2011

1024 SRAM Banks, 256KB each

128 SMs 128GF each

Echelon Chip - 128 SMs + 8 Latency Cores16 TFLOPS

NIMC MC

SM SM SM SM

NoC

SM LC LC

SRAM SRAM SRAM

224/6/2011

Node MCM – 16 TF + 256GB

GPU Chip16TF DP256MB

1.6TB/sDRAM BW

160GB/sNetwork BW

DRAMStack

DRAMStack

DRAMStack

NVMemory

234/6/2011

32 Modules, 4 Nodes/Module, Central Router Module(s), Dragonfly Interconnect

NODE

NODE

NODE

NODE

MODULE

NODE

NODE

NODE

NODE

MODULE

ROUTER

ROUTER

ROUTER

ROUTER

MODULE

NODE

NODE

NODE

NODE

MODULE

NODE

NODE

NODE

NODE

MODULE

Cabinet – 128 Nodes – 2 PF – 38 kW

244/6/2011

Dragonfly Interconnect500 Cabinets is ~1EF and ~19MW

System – to ExaScale and Beyond

GPU Technology Conference 2011Oct. 11-14 | San Jose, CAThe one event you can’t afford to miss

Learn about leading-edge advances in GPU computing

Explore the research as well as the commercial applications

Discover advances in computational visualization

Take a deep dive into parallel programming

Ways to participate

Speak – share your work and gain exposure as a thought leader

Register – learn from the experts and network with your peers

Exhibit/Sponsor – promote your company as a key player in the GPU ecosystem www.gputechconf.c

om

264/6/2011

Questions

274/6/2011

NVIDIA Parallel Developer ProgramAll GPGPU developers should become NVIDIA Registered Developers

Benefits include:

Early Access to Pre-Release SoftwareBeta software and libraries

Submit & Track Issues and Bugs

Announcing new benefitsExclusive Q&A Webinars with NVIDIA Engineering

Exclusive deep dive CUDA training webinars

In depth engineering presentations on beta software

Sign up Now: www.nvidia.com/paralleldeveloper

http://www.nvidia.com/paralleldeveloper

Documents

Echelon: NVIDIA & Team’s UHPC Project