27
1 4/6/20 11 Echelon: NVIDIA & Team’s UHPC Project STEVE KECKLER DIRECTOR OF ARCHITECTURE RESEARCH, NVIDIA

Echelon: NVIDIA & Team’s UHPC Project

Embed Size (px)

DESCRIPTION

Echelon: NVIDIA & Team’s UHPC Project. STEVE KECKLER DIRECTOR OF ARCHITECTURE RESEARCH, NVIDIA. GPU Supercomputing. Tsubame 2.0 4224 GPUs Linpack 1.3 PFlops. Tianhe-1A 7168 GPUs Linpack : 2.5 PFlops. Dawning Nebulae 4640 GPUs Linpack : 1.3 PFlops. - PowerPoint PPT Presentation

Citation preview

Page 1: Echelon: NVIDIA & Team’s UHPC Project

14/6/2011

Echelon: NVIDIA & Team’s UHPC Project

STEVE KECKLERDIRECTOR OF ARCHITECTURE RESEARCH, NVIDIA

Page 2: Echelon: NVIDIA & Team’s UHPC Project

24/6/2011

GPU Supercomputing

Tianhe-1A7168 GPUs

Linpack: 2.5 PFlops

Dawning Nebulae4640 GPUs

Linpack: 1.3 PFlops

Tsubame 2.04224 GPUs

Linpack 1.3 PFlops

8 more GPU accelerated machines in the November Top500

Many (corporate) machines not listed

NVIDIA GPU Module

Page 3: Echelon: NVIDIA & Team’s UHPC Project

34/6/2011

Tianhe-1A Jaguar Nebulae Tsubame Hopper II0

500

1000

1500

2000

2500

0

1

2

3

4

5

6

7

8

Gig

afl

op

s

Me

ga

wa

tts

Top 5 Performance and Power

Page 4: Echelon: NVIDIA & Team’s UHPC Project

44/6/2011

Existing GPU Application Areas

Page 5: Echelon: NVIDIA & Team’s UHPC Project

54/6/2011

Key Challenges

Energy to Solution is too large

Programming parallel machines is too difficult

Programs are not scalable to billion-fold parallelism

Resilience (AMTTI) is too low

Machines are vulnerable to attacks/undetected program errors

Page 6: Echelon: NVIDIA & Team’s UHPC Project

64/6/2011

Echelon Team

Page 7: Echelon: NVIDIA & Team’s UHPC Project

74/6/2011

System Sketch

Self-Aware OS

Self-Aware Runtime

Locality-AwareCompiler & Autotuner

Echelon SystemCabinet 0 (C0) 2 PF, 205TB/s, 32TB

Module 0 (M)) 128TF, 12.8TB/s, 2TB M15Node 0 (N0) 16TF, 1.6TB/s, 256GB

Processor Chip (PC)

L0

C0

SM0

L0

C7

NoC

SM127

MC NICL20 L21023

DRAMCube

DRAMCube

NV RAM

High-Radix Router Module (RM)

CN

Dragonfly Interconnect (optical fiber)

N7

LC0

LC7

Page 8: Echelon: NVIDIA & Team’s UHPC Project

84/6/2011

Execution Model

A B

Active Message

Abstract Memory Hierarchy

Global Address Space

ThreadObject

BL

oa

d/S

tore

A

B Bulk Xfer

Page 9: Echelon: NVIDIA & Team’s UHPC Project

94/6/2011

Two (of many) Fundamental Challenges

Page 10: Echelon: NVIDIA & Team’s UHPC Project

104/6/2011

The High Cost of Data MovementFetching operands costs more than computing on them

20mm

64-bit DP20pJ 26 pJ 256 pJ

1 nJ

500 pJ Efficientoff-chip link

28nm

256-bitbuses

16 nJDRAMRd/Wr

256-bit access8 kB SRAM

50 pJ

Page 11: Echelon: NVIDIA & Team’s UHPC Project

114/6/2011

Magnitude of Thread Count

Billion-fold parallel fine-grained threads for Exascale

2010:4640 GPUs

2018:90K GPUs

Threads/SM 1.5 K ~103

Threads/GPU 21 K ~105

Threads/Cabinet 672 K ~107

Threads/Machine 97 M ~109-1010

Page 12: Echelon: NVIDIA & Team’s UHPC Project

124/6/2011

Fine-grained concurrency2

APIs for resilience, memory safety3

Locality, Locality, Locality1

Echelon Disruptive Technologies

Order of magnitude improvement in efficiency4

Page 13: Echelon: NVIDIA & Team’s UHPC Project

134/6/2011

Data Locality(Central to performance and efficiency)

Programming SystemAbstract expression of spatial, temporal, and producer-consumer locality

Programmer expresses localityProgramming system maps threads and objects to locations to exploit locality

ArchitectureHierarchical global address spaceConfigurable memory hierarchyEnergy-provisioned bandwidth

Page 14: Echelon: NVIDIA & Team’s UHPC Project

144/6/2011

Fine-Grained Concurrency(How we get to 1010 threads)

Programming SystemProgrammer expresses ALL of the concurrencyProgramming system decides how much to exploit in space and how much to iterate over in time

ArchitectureFast, low-overhead thread-array creation and managementFast, low-overhead communication and synchronizationMessage-driven computing (active messages)

Page 15: Echelon: NVIDIA & Team’s UHPC Project

154/6/2011

Dependability(How we get to AMTTI of one day)

Programming SystemAPI to express

State to preserveWhen to preserve itComputations to checkAssertions

ResponsibilitiesPreserves stateGenerates recovery codeGenerates redundant computation where appropriate

ArchitectureError Checking on all memories and communication pathsHardware configurable to run duplex computationsHardware support for error containment and recovery

Page 16: Echelon: NVIDIA & Team’s UHPC Project

164/6/2011

Security(From attackers and ourselves)

Key challenge: Memory safetyMalicious attacksProgrammer memory bugs (bad pointer dereferencing, etc.)

Programming SystemExpress partitioning of subsystemsExpress privileges on data structures

Architecture: Guarded pointers primitiveHardware can check all memory references/address computationsFast, low-overhead subsystem entryErrors reported through error containment features

Page 17: Echelon: NVIDIA & Team’s UHPC Project

174/6/2011

> 10x Energy Efficiency Gain (GFlops/Watt)

Contemporary GPU: ~300pJ/FlopFuture parallel systems: ~20pJ/Flop

In order to get anywhere near Exascale in 2018~4x can come from process scaling to 10nm

Remainder from architecture/programming systemLocality – both horizontal and vertical

Reduce data movement, migrate fine-grain tasks to data

Extremely energy-efficient throughput coresEfficient instruction/data supplySimple hardware: static instruction scheduling, simple instruction controlMultithreading and hardware support for thread arrays

Page 18: Echelon: NVIDIA & Team’s UHPC Project

184/6/2011

An NVIDIA ExaScale Machine

Page 19: Echelon: NVIDIA & Team’s UHPC Project

194/6/2011

Lane – 4 DFMAs, 16 GFLOPS

DFMA DFMA DFMA DFMA

Main Registers

LSI LSI

Operand Registers

L0 I$

L0 D$

Page 20: Echelon: NVIDIA & Team’s UHPC Project

204/6/2011

Streaming Multiprocessor 8 lanes – 128 GFLOPS

P P P P P P P P

Switch

L1$

Page 21: Echelon: NVIDIA & Team’s UHPC Project

214/6/2011

1024 SRAM Banks, 256KB each

128 SMs 128GF each

Echelon Chip - 128 SMs + 8 Latency Cores16 TFLOPS

NIMC MC

SM SM SM SM

NoC

SM LC LC

SRAM SRAM SRAM

Page 22: Echelon: NVIDIA & Team’s UHPC Project

224/6/2011

Node MCM – 16 TF + 256GB

GPU Chip16TF DP256MB

1.6TB/sDRAM BW

160GB/sNetwork BW

DRAMStack

DRAMStack

DRAMStack

NVMemory

Page 23: Echelon: NVIDIA & Team’s UHPC Project

234/6/2011

32 Modules, 4 Nodes/Module, Central Router Module(s), Dragonfly Interconnect

NODE

NODE

NODE

NODE

MODULE

NODE

NODE

NODE

NODE

MODULE

ROUTER

ROUTER

ROUTER

ROUTER

MODULE

NODE

NODE

NODE

NODE

MODULE

NODE

NODE

NODE

NODE

MODULE

Cabinet – 128 Nodes – 2 PF – 38 kW

Page 24: Echelon: NVIDIA & Team’s UHPC Project

244/6/2011

Dragonfly Interconnect500 Cabinets is ~1EF and ~19MW

System – to ExaScale and Beyond

Page 25: Echelon: NVIDIA & Team’s UHPC Project

GPU Technology Conference 2011Oct. 11-14 | San Jose, CAThe one event you can’t afford to miss

Learn about leading-edge advances in GPU computing

Explore the research as well as the commercial applications

Discover advances in computational visualization

Take a deep dive into parallel programming

Ways to participate

Speak – share your work and gain exposure as a thought leader

Register – learn from the experts and network with your peers

Exhibit/Sponsor – promote your company as a key player in the GPU ecosystem www.gputechconf.c

om

Page 26: Echelon: NVIDIA & Team’s UHPC Project

264/6/2011

Questions

Page 27: Echelon: NVIDIA & Team’s UHPC Project

274/6/2011

NVIDIA Parallel Developer ProgramAll GPGPU developers should become NVIDIA Registered Developers

Benefits include:

Early Access to Pre-Release SoftwareBeta software and libraries

Submit & Track Issues and Bugs

Announcing new benefitsExclusive Q&A Webinars with NVIDIA Engineering

Exclusive deep dive CUDA training webinars

In depth engineering presentations on beta software

Sign up Now: www.nvidia.com/paralleldeveloper