Upload
ornice
View
31
Download
0
Embed Size (px)
DESCRIPTION
Presentation to 2004 Workshop on Extreme Supercomputing Panel :. Roadmap and Change How Much and How Fast. Thomas Sterling California Institute of Technology and NASA Jet Propulsion Laboratory October 12, 2004. 2 9 Years Ago Today. Linpack Zettaflops in 2032. 10 Zflops. 1 Zflops. - PowerPoint PPT Presentation
Citation preview
October 12, 2004 Thomas Sterling - Caltech & JPL 1
Roadmap and Change
How Much and How Fast
Thomas SterlingCalifornia Institute of Technology
and
NASA Jet Propulsion Laboratory
October 12, 2004
Presentation to 2004 Workshop on Extreme Supercomputing Panel:
October 12, 2004 Thomas Sterling - Caltech & JPL 2
29 Years Ago Today
October 12, 2004 Thomas Sterling - Caltech & JPL 3
Linpack Zettaflops in 2032
100 Mflops
1 Gflops
10 Gflops
100 Gflops
1 Tflops
10 Tflops
100 Tflops
1 Pflops
10 Pflops
100 Pflops
1 Eflops
10 Eflops
2000 2010 2040 20431995 2005 20301993
SUM N=1 N=500
Court
esy
of
Thom
as
Ste
rlin
g
100 Eflops
1 Zflops
10 Zflops
2015 2020 2025 2035
October 12, 2004 Thomas Sterling - Caltech & JPL 4
October 12, 2004 Thomas Sterling - Caltech & JPL 5
The Way We Were: 1974
IBM 370 market mainstream Approx. 1 Mflops
DEC PDP-11 geeks delight Seymour Cray started
working on Cray-1 Approx. 100 Mflops
2nd generation microprocessor
e.g. Intel 8008
Core memory 1103 1Kx1 DRAM chips Punch cards, paper tapes,
teletypes, selectrics
October 12, 2004 Thomas Sterling - Caltech & JPL 6
What Will Be Different
Moore’s Law will have flatlined Nano-scale atomic level devices
Assuming we solve lithography problem
Local clock rates ~100 GHz Fastest today is > 700 GHz
Local actions strongly preferential to global actions
Non-conventional technologies may be employed
Optical Quantum dots Rapid Single Flux Quantum
(RSFQ) gates
JJ1 JJ2
L1
October 12, 2004 Thomas Sterling - Caltech & JPL 7
What we will need
1 nano-watt per Megaflops Energy received from Tau
Ceti (per m2)
Approximately 1 square meter for 1 Zetaflops ALUs
10 billion execution sites
> 10 billion-way parallelism Including memory and
communications: 2000 m2
3-D packaging (4m)3
Global latency of ~ 10,000 cycles
Including average latency, => 1 trillion-way parallelism
October 12, 2004 Thomas Sterling - Caltech & JPL 8
Parcel Simulation Latency Hiding Experiment
Remote Memory Requests
Local Memory
Process Driven Node Parcel Driven Node
Local Memory
ALU
ALU
Input Parcels
Output Parcels
Remote Memory Requests
Remote Memory Requests
Remote Memory Requests
Flat
Network NodesNodes
Nodes
NodesControl Experiment
Test Experiment
October 12, 2004 Thomas Sterling - Caltech & JPL 9
Latency Hiding with Parcelswith respect to System Diameter in cycles
Sensitivity to Remote Latency and Remote Access Fraction16 Nodes
deg_parallelism in RED (pending parcels @ t=0 per node)
0.1
1
10
100
1000
Remote Memory Latency (cycles)
Tota
l tra
nsac
tiona
l wor
k do
ne/T
otal
pro
cess
wor
k do
ne
1/4%
1/2%
1%
2%
4%
16
4
1
64256
2
October 12, 2004 Thomas Sterling - Caltech & JPL 10
Latency Hiding with ParcelsIdle Time with respect to Degree of
ParallelismIdle Time/Node
(number of nodes in black)
0.E+00
1.E+05
2.E+05
3.E+05
4.E+05
5.E+05
6.E+05
7.E+05
8.E+05
Parallelism Level (parcels/node at time=0)
Idle
tim
e/n
od
e (c
ycle
s)
Process
Transaction
12 4 8
16 32 64 128 256
October 12, 2004 Thomas Sterling - Caltech & JPL 11
Architecture Innovation
Extreme memory bandwidth Active latency hiding Extreme parallelism Message-driven split-transaction computations (parcels) PIM
e.g. Kogge, Draper, Sterling, … Very high memory bandwidth Lower memory latency (on chip) Higher execution parallelism (banks and row-wide)
Streaming Dally, Keckler, … Very high functional parallelism Low latency (between functional units) Higher execution parallelism (high ALU density)
October 12, 2004 Thomas Sterling - Caltech & JPL 12
Continuum Computer Architecture
Merges state, logic, and communication in single building block
Parcel driven computation Fine grain split transaction computing Move data through vectors of instructions in store Move instruction stream through vector of data Gather-scatter an intrinsic Very efficient Futures for produces-multi-consumer computing
Combines strengths of PIM and Streaming All register architecture (fully associative) Functional units within a cycle of neighbors Extreme parallelism Intrinsic latency hiding
October 12, 2004 Thomas Sterling - Caltech & JPL 13
Inst. Reg.
Control
ALU
Assoc. Memory
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
October 12, 2004 Thomas Sterling - Caltech & JPL 14
Conclusions
Zettaflops at nano-scale technology is possible Size requirements tolerable
But packaging is a challenge; Latency challenge does not sink the idea
Major obstacles Power Latency Parallelism Reliability Programming
Architecture can address many of these Continuum Computing Architecture
Combines advantages of PIM and streaming Strong candidate for future Zetaflops computer
October 12, 2004 Thomas Sterling - Caltech & JPL 15