Upload
deirdre-holman
View
52
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Echelon: NVIDIA & Team’s UHPC Project. STEVE KECKLER DIRECTOR OF ARCHITECTURE RESEARCH, NVIDIA. GPU Supercomputing. Tsubame 2.0 4224 GPUs Linpack 1.3 PFlops. Tianhe-1A 7168 GPUs Linpack : 2.5 PFlops. Dawning Nebulae 4640 GPUs Linpack : 1.3 PFlops. - PowerPoint PPT Presentation
Citation preview
14/6/2011
Echelon: NVIDIA & Team’s UHPC Project
STEVE KECKLERDIRECTOR OF ARCHITECTURE RESEARCH, NVIDIA
24/6/2011
GPU Supercomputing
Tianhe-1A7168 GPUs
Linpack: 2.5 PFlops
Dawning Nebulae4640 GPUs
Linpack: 1.3 PFlops
Tsubame 2.04224 GPUs
Linpack 1.3 PFlops
8 more GPU accelerated machines in the November Top500
Many (corporate) machines not listed
NVIDIA GPU Module
34/6/2011
Tianhe-1A Jaguar Nebulae Tsubame Hopper II0
500
1000
1500
2000
2500
0
1
2
3
4
5
6
7
8
Gig
afl
op
s
Me
ga
wa
tts
Top 5 Performance and Power
44/6/2011
Existing GPU Application Areas
54/6/2011
Key Challenges
Energy to Solution is too large
Programming parallel machines is too difficult
Programs are not scalable to billion-fold parallelism
Resilience (AMTTI) is too low
Machines are vulnerable to attacks/undetected program errors
64/6/2011
Echelon Team
74/6/2011
System Sketch
Self-Aware OS
Self-Aware Runtime
Locality-AwareCompiler & Autotuner
Echelon SystemCabinet 0 (C0) 2 PF, 205TB/s, 32TB
Module 0 (M)) 128TF, 12.8TB/s, 2TB M15Node 0 (N0) 16TF, 1.6TB/s, 256GB
Processor Chip (PC)
L0
C0
SM0
L0
C7
NoC
SM127
MC NICL20 L21023
DRAMCube
DRAMCube
NV RAM
High-Radix Router Module (RM)
CN
Dragonfly Interconnect (optical fiber)
N7
LC0
LC7
84/6/2011
Execution Model
A B
Active Message
Abstract Memory Hierarchy
Global Address Space
ThreadObject
BL
oa
d/S
tore
A
B Bulk Xfer
94/6/2011
Two (of many) Fundamental Challenges
104/6/2011
The High Cost of Data MovementFetching operands costs more than computing on them
20mm
64-bit DP20pJ 26 pJ 256 pJ
1 nJ
500 pJ Efficientoff-chip link
28nm
256-bitbuses
16 nJDRAMRd/Wr
256-bit access8 kB SRAM
50 pJ
114/6/2011
Magnitude of Thread Count
Billion-fold parallel fine-grained threads for Exascale
2010:4640 GPUs
2018:90K GPUs
Threads/SM 1.5 K ~103
Threads/GPU 21 K ~105
Threads/Cabinet 672 K ~107
Threads/Machine 97 M ~109-1010
124/6/2011
Fine-grained concurrency2
APIs for resilience, memory safety3
Locality, Locality, Locality1
Echelon Disruptive Technologies
Order of magnitude improvement in efficiency4
134/6/2011
Data Locality(Central to performance and efficiency)
Programming SystemAbstract expression of spatial, temporal, and producer-consumer locality
Programmer expresses localityProgramming system maps threads and objects to locations to exploit locality
ArchitectureHierarchical global address spaceConfigurable memory hierarchyEnergy-provisioned bandwidth
144/6/2011
Fine-Grained Concurrency(How we get to 1010 threads)
Programming SystemProgrammer expresses ALL of the concurrencyProgramming system decides how much to exploit in space and how much to iterate over in time
ArchitectureFast, low-overhead thread-array creation and managementFast, low-overhead communication and synchronizationMessage-driven computing (active messages)
154/6/2011
Dependability(How we get to AMTTI of one day)
Programming SystemAPI to express
State to preserveWhen to preserve itComputations to checkAssertions
ResponsibilitiesPreserves stateGenerates recovery codeGenerates redundant computation where appropriate
ArchitectureError Checking on all memories and communication pathsHardware configurable to run duplex computationsHardware support for error containment and recovery
164/6/2011
Security(From attackers and ourselves)
Key challenge: Memory safetyMalicious attacksProgrammer memory bugs (bad pointer dereferencing, etc.)
Programming SystemExpress partitioning of subsystemsExpress privileges on data structures
Architecture: Guarded pointers primitiveHardware can check all memory references/address computationsFast, low-overhead subsystem entryErrors reported through error containment features
174/6/2011
> 10x Energy Efficiency Gain (GFlops/Watt)
Contemporary GPU: ~300pJ/FlopFuture parallel systems: ~20pJ/Flop
In order to get anywhere near Exascale in 2018~4x can come from process scaling to 10nm
Remainder from architecture/programming systemLocality – both horizontal and vertical
Reduce data movement, migrate fine-grain tasks to data
Extremely energy-efficient throughput coresEfficient instruction/data supplySimple hardware: static instruction scheduling, simple instruction controlMultithreading and hardware support for thread arrays
184/6/2011
An NVIDIA ExaScale Machine
194/6/2011
Lane – 4 DFMAs, 16 GFLOPS
DFMA DFMA DFMA DFMA
Main Registers
LSI LSI
Operand Registers
L0 I$
L0 D$
204/6/2011
Streaming Multiprocessor 8 lanes – 128 GFLOPS
P P P P P P P P
Switch
L1$
214/6/2011
1024 SRAM Banks, 256KB each
128 SMs 128GF each
Echelon Chip - 128 SMs + 8 Latency Cores16 TFLOPS
NIMC MC
SM SM SM SM
NoC
SM LC LC
SRAM SRAM SRAM
224/6/2011
Node MCM – 16 TF + 256GB
GPU Chip16TF DP256MB
1.6TB/sDRAM BW
160GB/sNetwork BW
DRAMStack
DRAMStack
DRAMStack
NVMemory
234/6/2011
32 Modules, 4 Nodes/Module, Central Router Module(s), Dragonfly Interconnect
NODE
NODE
NODE
NODE
MODULE
NODE
NODE
NODE
NODE
MODULE
ROUTER
ROUTER
ROUTER
ROUTER
MODULE
NODE
NODE
NODE
NODE
MODULE
NODE
NODE
NODE
NODE
MODULE
Cabinet – 128 Nodes – 2 PF – 38 kW
244/6/2011
Dragonfly Interconnect500 Cabinets is ~1EF and ~19MW
System – to ExaScale and Beyond
GPU Technology Conference 2011Oct. 11-14 | San Jose, CAThe one event you can’t afford to miss
Learn about leading-edge advances in GPU computing
Explore the research as well as the commercial applications
Discover advances in computational visualization
Take a deep dive into parallel programming
Ways to participate
Speak – share your work and gain exposure as a thought leader
Register – learn from the experts and network with your peers
Exhibit/Sponsor – promote your company as a key player in the GPU ecosystem www.gputechconf.c
om
264/6/2011
Questions
274/6/2011
NVIDIA Parallel Developer ProgramAll GPGPU developers should become NVIDIA Registered Developers
Benefits include:
Early Access to Pre-Release SoftwareBeta software and libraries
Submit & Track Issues and Bugs
Announcing new benefitsExclusive Q&A Webinars with NVIDIA Engineering
Exclusive deep dive CUDA training webinars
In depth engineering presentations on beta software
Sign up Now: www.nvidia.com/paralleldeveloper