Roadmap and Change How Much and How Fast

October 12, 2004 Thomas Sterling - Caltech & JPL 1

Roadmap and Change

How Much and How Fast

Thomas SterlingCalifornia Institute of Technology

and

NASA Jet Propulsion Laboratory

October 12, 2004

Presentation to 2004 Workshop on Extreme Supercomputing Panel:


29 Years Ago Today


Linpack Zettaflops in 2032

100 Mflops

1 Gflops

10 Gflops

100 Gflops

1 Tflops

10 Tflops

100 Tflops

1 Pflops

10 Pflops

100 Pflops

1 Eflops

10 Eflops

2000 2010 2040 20431995 2005 20301993

SUM N=1 N=500

Court

esy

of

Thom

as

Ste

rlin

g

100 Eflops

1 Zflops

10 Zflops

2015 2020 2025 2035



The Way We Were: 1974

IBM 370 market mainstream Approx. 1 Mflops

DEC PDP-11 geeks delight Seymour Cray started

working on Cray-1 Approx. 100 Mflops

2nd generation microprocessor

e.g. Intel 8008

Core memory 1103 1Kx1 DRAM chips Punch cards, paper tapes,

teletypes, selectrics


What Will Be Different

Moore’s Law will have flatlined Nano-scale atomic level devices

Assuming we solve lithography problem

Local clock rates ~100 GHz Fastest today is > 700 GHz

Local actions strongly preferential to global actions

Non-conventional technologies may be employed

Optical Quantum dots Rapid Single Flux Quantum

(RSFQ) gates

JJ1 JJ2

L1


What we will need

1 nano-watt per Megaflops Energy received from Tau

Ceti (per m2)

Approximately 1 square meter for 1 Zetaflops ALUs

10 billion execution sites

> 10 billion-way parallelism Including memory and

communications: 2000 m2

3-D packaging (4m)3

Global latency of ~ 10,000 cycles

Including average latency, => 1 trillion-way parallelism


Parcel Simulation Latency Hiding Experiment

Remote Memory Requests

Local Memory

Process Driven Node Parcel Driven Node

Local Memory

ALU

ALU

Input Parcels

Output Parcels




Flat

Network NodesNodes

Nodes

NodesControl Experiment

Test Experiment


Latency Hiding with Parcelswith respect to System Diameter in cycles

Sensitivity to Remote Latency and Remote Access Fraction16 Nodes

deg_parallelism in RED (pending parcels @ t=0 per node)

0.1

1

10

100

1000

Remote Memory Latency (cycles)

Tota

l tra

nsac

tiona

l wor

k do

ne/T

otal

pro

cess

wor

k do

ne

1/4%

1/2%

1%

2%

4%

16

4

1

64256

2


Latency Hiding with ParcelsIdle Time with respect to Degree of

ParallelismIdle Time/Node

(number of nodes in black)

0.E+00

1.E+05

2.E+05

3.E+05

4.E+05

5.E+05

6.E+05

7.E+05

8.E+05

Parallelism Level (parcels/node at time=0)

Idle

tim

e/n

od

e (c

ycle

s)

Process

Transaction

12 4 8

16 32 64 128 256


Architecture Innovation

Extreme memory bandwidth Active latency hiding Extreme parallelism Message-driven split-transaction computations (parcels) PIM

e.g. Kogge, Draper, Sterling, … Very high memory bandwidth Lower memory latency (on chip) Higher execution parallelism (banks and row-wide)

Streaming Dally, Keckler, … Very high functional parallelism Low latency (between functional units) Higher execution parallelism (high ALU density)


Continuum Computer Architecture

Merges state, logic, and communication in single building block

Parcel driven computation Fine grain split transaction computing Move data through vectors of instructions in store Move instruction stream through vector of data Gather-scatter an intrinsic Very efficient Futures for produces-multi-consumer computing

Combines strengths of PIM and Streaming All register architecture (fully associative) Functional units within a cycle of neighbors Extreme parallelism Intrinsic latency hiding


Inst. Reg.

Control

ALU

Assoc. Memory

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU


Conclusions

Zettaflops at nano-scale technology is possible Size requirements tolerable

But packaging is a challenge; Latency challenge does not sink the idea

Major obstacles Power Latency Parallelism Reliability Programming

Architecture can address many of these Continuum Computing Architecture

Combines advantages of PIM and streaming Strong candidate for future Zetaflops computer


Documents

Roadmap and Change How Much and How Fast