Computer Architecture for the Next Millenium

Computer Architecture for theNext Millenium

November 1, 1999

William J. DallyComputer Systems Laboratory

Stanford [email protected]

Computer Architecture for the Next MilleniumWJD November 1, 1999 2

Outline

• The Stanford Concurrent VLSI Architecture Group• Forces acting on computer architecture

– applications (media)– technology (wire-limited)– techniques (explicit parallelism)

• Example: register organization– distributed register files

• Imagine a stream processor– 20GFLOPS on a 0.5cm2 chip

• Tremendous opportunities and challenges forcomputer architecture in the next millenium– its not a mature field yet


The Concurrent VLSI Architecture Group

• Architecture and design technology for VLSI• Routing chips

– Torus Routing Chip, Network Design Frame, Reliable Router– Basis for Intel, Cray/SGI, Mercury, Avici network chips


Parallel computer systems

• J-Machine (MDP) led to Cray T3D/T3E• M-Machine (MAP)

– Fast messaging, scalable processing nodes, scalablememory architecture

MDP Chip J-Machine Cray T3D MAP Chip


Design technology

• Off-chip I/O– Simultaneous bidirectional signaling, 1989

• now used by Intel and Hitachi– High-speed signalling

• 4Gb/s in 0.6µm CMOS, Equalization, 1995

• On-Chip Signalling– Low-voltage on-chip signalling– Low-skew clock distribution

• Synchronization– Mesochronous, Plesiochronous– Self-Timed Design

4Gb/s CMOS I/O

250ps/division


What is Computer Architecture?

Technology

Applications ComputerArchitect

Interfaces

Machine Organization

Measurement &Evaluation

ISA

AP

I

Link

I/O C

han

Regs

IR


Forces Acting on Architecture

• Applications - shifting towards media applications dealingwith streams of low-precision samples– video, graphics, audio, DSL modems, cellular base stations

• Technology - becoming wire-limited– power and delay dominated by communication, not arithmetic– global structures: register files and instruction issue don’t scale

• Technique - Micro-architecture - ILP has been mined out– to the point of diminishing returns on squeezing performance

from sequential code– explicit parallelism (data parallelism and thread-level parallelism)

required to continue scaling performance


Applications

• Little locality of reference– read each pixel once– often non-unit stride– but there is producer-consumer

locality• Very high arithmetic intensity

– 100s of arithmetic operationsper memory reference

• Dominated by low-precision(16-bit) integer operations


Wires Are Becoming Like Wet Noodles

0.0mm

2.5mm

5.0mm

7.5mm

10.0mm

Minimum widthwire in an 0.35µmprocess


Technology scaling makes communication thescarce resource

0.18µm256Mb DRAM16 64b FP Proc

500MHz

0.07µm4Gb DRAM

256 64b FP Proc2.5GHz

P

1999 2008

18mm30,000 tracks

1 clockrepeaters every 3mm

25mm120,000 tracks

16 clocksrepeaters every 0.5mm

P


Care and Feeding of ALUs

DataBandwidth

Instruction Bandwidth

Regs

Instr.Cache

IR

IP‘Feeding’ Structure Dwarfs ALU


What Does This Say About Architecture?

• Tremendous opportunities– Media problems have lots of parallelism and locality– VLSI technology enables 100s of ALUs per chip (1000s soon)

• (in 0.18um 0.1mm2 per integer adder, 0.5mm2 per FP adder)

• Challenging problems– Locality - global structures won’t work– Explicit parallelism - ILP won’t keep 100 ALUs busy– Memory - streaming applications don’t cache well

• Its time to try some new approaches


ExampleRegister File Organization

• Register files serve two functions:– Short term storage for intermediate results– Communication between multiple function units

• Global register files don’t scale well as N, number ofALUs increases– Need more registers to hold more results (grows with N)– Need more ports to connect all of the units (grows with N2)


Register Cells are Mostly Switch

Bit Lines

Wor

d Li

nes

Vdd

Gnd

Vdd

p w

p

h

Bit Lines

Wor

d Li

nes

...

1 wiregrid

...

p

p w

h


Register Architecture for ‘wide’ Processors

N Arithmetic Units N/CArithmetic

Units

C SIMD Clusters

N/CArithmetic

Units

C SIMD Clusters

(A) (B)

(C) (D)

N Arithmetic UnitsN/C Arithmetic

UnitsN/C Arithmetic

Units


Area of Register Organizations

Central

SIMD

DRF

SIMD/DRF

0.1

1

10

100

1000

1 10 100 1000

Number of Arithmetic Units


Delay of Register Organizations

Central

DRF

SIMD/DRF

SIMD

0.1

1

10

100

1000

1 10 100 1000

Number of Arithmetic Units


Performance of Register Organizations

0.00

0.20

0.40

0.60

0.80

1.00

1.20

Central SIMD SIMD/DRF HIERARCHICAL STREAM

(B) Performance with Latency

0.00

0.20

0.40

0.60

0.80

1.00

1.20

Central SIMD SIMD/DRF HIERARCHICAL STREAM

(A) Raw Performance


Stubs Abstract the Communication BetweenOperations

FU(Op 1)

RF

FU(Op 2)

RF

Op 1

Op 2W

rite

stu

bR

ead

stubD

ata

tran

sfer


A Communication Example

+

t1 t1

t1 * t2 L/S

+ k * x L/S

t1 t1

+ * L/S

+ t1 * t2 L/S

+ k * x L/S

t1

Pass t1 * L/S

+

t1

t1 * t2 L/S

Inst

ruct

ion

5In

stru

ctio

n 6

(a) (b) (c)


The Imagine Stream Processor

Stream Register FileNetworkInterface

HostInterface

Imagine Stream Processor

HostProcessor

Net

wor

k

AL

U C

lust

er 0

AL

U C

lust

er 1

AL

U C

lust

er 2

AL

U C

lust

er 3

AL

U C

lust

er 4

AL

U C

lust

er 5

AL

U C

lust

er 6

AL

U C

lust

er 7

SDRAMSDRAM SDRAMSDRAM

Streaming Memory SystemM

icro

cont

rolle

r


Data Bandwidth Hierarchy

Imagine Stream Processor

3.2GB/s 64GB/s

SDRAM

SDRAM

SDRAM

SDRAM

Str

eam

R

egis

ter

File

ALU Cluster

ALU Cluster

ALU Cluster

544GB/s


Cluster Architecture

• VLIW organization with shared control• Local register files provide high data bandwidth

CU

Inte

rclu

ster

Net

wor

k

+

From Stream Buffers

To Stream Buffers

+ + * * /

Cross Point

Local Register File


Imagine is a Stream Processor

• Instructions are Load, Store, and Operate– operands are streams– also Send and Receive for multiple-imagine systems

• Operate performs a compound stream operation– read elements from input streams– perform a local computation– append elements to output streams– repeat until input stream is consumed– (e.g., triangle transform)

• Order of magnitude less global register bandwidththan a vector processor

Pixe

l Dep

th &

Col

or

Mem

ory

Stre

am R

egis

ter F

ile

Inpu

tD

ata

wor

d

reco

rdTr

iang

le R

ecor

ds

Shad

ed T

riang

le R

ecor

ds

Proj

ecte

d Tr

iang

le R

ecor

ds

Span

Rec

ords

Frag

men

t Rec

ords

Pixe

l Dep

th &

Col

or

Pixe

l Dep

th &

Col

orZ-

Com

posi

te

Frag

men

t Rec

ords

Imag

eD

epth

&C

olor

Imag

e B

uffe

r Ind

ices

Mem

ory

Ban

dwid

thR

egis

ter

Ban

dwid

th

Com

pact

Sort

Proc

ess

Span

Span

Set

up

Proj

ect/

Cul

l

Tran

sfor

m

Arit

hmet

icC

lust

ers

Tria

ngle

Rec

ords

Shad

e

Tria

ngle

Ren

derin

g


Transform Kernel

References (per ∆) Stream Scalar Vector

Memory 5.5 117 (21.3) 48 (8.7)

Global RF 48 624 (13.0) 261 (5.4)

Local RF 372 N/A N/A

Bandwidth Demands


Data Parallelism is easier than ILP

Kernel 1 to 8 Cluster Speedup

FFT (1024) 6.4

DCT (8x8) 7.8

Blockwarp (8x8) 7.2

Transform (∆) 8.0

Harmonic Mean 7.3


Conventional Approaches toData-Dependent Conditional Execution

A

x>0

B

C

J

K

Data-DependentBranch

Y N

A

x>0

YB

C

Whoops

J

K

SpeculativeLoss D x W~100s

A

B

J

y=(x>0)

if y

if ~y

C

K

if y

if ~y

ExponentiallyDecreasing Duty Factor


Zero-Cost Conditionals

• Most Approaches to Conditional Operations are Costly– Branching control flow - dead issue slots on mispredicted branches– Predication (SIMD select, masked vectors) - large fraction of

execution ‘opportunities’ go idle.• Conditional Streams

– append an element to an output stream depending on a casevariable.

Value Stream

Case Stream {0,1}

Output Stream


Sustainable Performance

12.88.0 8.9 8.1 7.0

4.0

24.4

14.63.2 1.8

3.62.8

0.8

2.6

0

5

10

15

20

25

30Tr

ansf

orm

FFT

FIR

Blo

ckw

arp

Sort

Mer

ge

DC

T

FIR

GO

PS

Communication16-bit Arithmetic32-bit ArithmeticFP Arithmetic


Power Comparison

3.7 1.5

0.28

0.22

1.2

0 1 2 3 4 5 6GOPs/W

Imagine

AD 21160

TI C67x

StrongARM SA-1100

1024-point Floating-Point FFTCommunicationsDhrystone

-Source: Web Pages of Intel, TI, and Analog Devices


Power and Performance

34%

45%

9%

9% 3%

00.5

11.5

22.5

33.5

W

FIR

FFT

BWAR

P

XFOR

M

MER

GE DCT

OtherMem SysSRFClustersClock

11.5 5.2 5.2 3.8 7.1 14.1GOPs/W:


A Look Inside an ApplicationStereo Depth Extraction

• 320x240 8-bit grayscaleimages

• 30 disparity search• 220 frames/second• 12.7 GOPS• 5.7 GOPS/W

Clusters Mem_0 Mem_1

21400

21500

21600

21700

21800

21900

22000

22100

22200

22300

22400

22500

22600

22700

22800

22900

23000

23100

23200

23300

23400

23500

23600

STOREUNPACK LOADCONV 7x7

CONV 3x3


CONV 3x3


Clust Mem0 Mem1501400

501500

501600

501700

501800

501900

502000

502100

502200

502300

502400

502500

502600

502700

502800

502900

503000

503100

503200

503300

BlockSADLoad Load

BlockSAD

StoreBlockSAD

BlockSAD

BlockSADLoad Load

BlockSAD

StoreBlockSAD

Load originalpacked row

Unpack (8bit -> 16 bit)

7x7 Convolve

3x3 Convolve

Store convolved row

Load ConvolvedRows

CalculateBlockSADs atdifferent disparities

Store bestdisparity values

Stereo Depth ExtractorConvolutions Disparity Search


ADD0 ADD1 ADD2 MUL0 MUL1 DIV0 INP0 INP1 INP2 INP3 OUT0 OUT1 SP_0 SP_0 COM0 MC_0 JUK0 VAL0

G E N _ C I S T A T E

C O N D _ I N _ D

G E N _ C C E N D

S P C R E A D _ W T S P C W R I T E

C O M M U C D A T A

C H K _ A N Y

S E L E C T

S H I F T A 1 6

C O M M U C P E R M

C O M M U C P E R M

C O M M U C P E R M

S E L E C T

S E L E C T C O M M U C P E R M

I M U L R N D 1 6 S E L E C T

I M U L R N D 1 6

I M U L R N D 1 6I M U L R N D 1 6 N S E L E C T

I M U L R N D 1 6I M U L R N D 1 6

I M U L R N D 1 6I M U L R N D 1 6 P A S S

I M U L R N D 1 6 I M U L R N D 1 6 P A S S

I M U L R N D 1 6 I M U L R N D 1 6 P A S S

I M U L R N D 1 6 I M U L R N D 1 6




I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6I A D D S 1 6

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 N S E L E C T

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6 S E L E C T P A S S

I M U L R N D 1 6 I M U L R N D 1 6 P A S SI A D D S 1 6 N S E L E C TP A S S

I M U L R N D 1 6 I M U L R N D 1 6 P A S SI A D D S 1 6 I A D D S 1 6 N S E L E C T

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6I A D D S 1 6S H U F F L E

I M U L R N D 1 6 I M U L R N D 1 6S H U F F L E P A S S

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6 P A S SI A D D S 1 6

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6 P A S SS H U F F L E

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 S H U F F L E

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 S H U F F L ES H U F F L E

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6I A D D S 1 6

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6I A D D S 1 6I A D D S 1 6

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6I A D D S 1 6I A D D S 1 6

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6 P A S S P A S S

P A S SS H U F F L E

S H U F F L ES H U F F L E S H U F F L E

I A D D S 1 6 I A D D S 1 6 I A D D S 1 6 P A S S

I A D D S 1 6I A D D S 1 6I A D D S 1 6 P A S S P A S SP A S S

I A D D S 1 6 I A D D S 1 6I A D D S 1 6

I A D D S 1 6I A D D S 1 6 I A D D S 1 6

S H U F F L E S H U F F L ES H U F F L E D A T A _ I N

I A D D S 1 6I A D D S 1 6 S H U F F L E P A S SD A T A _ I N

I A D D S 1 6 I A D D S 1 6I A D D S 1 6 P A S SP A S S D A T A _ I N

S E L E C TI A D D S 1 6I A D D S 1 6 D A T A _ I N

I A D D S 1 6 S E L E C TI A D D S 1 6I A D D S 1 6 N S E L E C T D A T A _ O U T

I A D D S 1 6 S E L E C TI A D D S 1 6 I A D D S 1 6 D A T A _ I N D A T A _ O U T

I A D D S 1 6 N S E L E C T D A T A _ I N D A T A _ O U T

I A D D S 1 6 N S E L E C TN S E L E C T D A T A _ O U T

L O O PI A D D S 1 6 I A D D S 1 6 D A T A _ O U T

D A T A _ O U T D A T A _ O U T

ADD0 ADD1 ADD2 MUL0 MUL1 DIV0 INP0 INP1 INP2 INP3 OUT0 OUT1 SP_0 SP_0 COM0 MC_0 JUK0 VAL0

IMULRND16 IMULRND16 PASSIADDS16 NSELECTPASS

IMULRND16 IMULRND16 PASSIADDS16 IADDS16 NSELECTSHIFTA16

IMULRND16 IMULRND16IADDS16 IADDS16

IMULRND16 IMULRND16IADDS16IADDS16SHUFFLE

IMULRND16 IMULRND16SHUFFLE PASS

IMULRND16 IMULRND16IADDS16 IADDS16 PASSIADDS16

IMULRND16 IMULRND16IADDS16 IADDS16

IMULRND16 IMULRND16IADDS16 IADDS16 PASSSHUFFLE

IMULRND16 IMULRND16IADDS16 SHUFFLE

IMULRND16 IMULRND16IADDS16 SHUFFLESHUFFLE

IMULRND16 IMULRND16IADDS16 IADDS16IADDS16 COMMUCPERM

IMULRND16 IMULRND16IADDS16IADDS16IADDS16 COMMUCPERM

IMULRND16 IMULRND16IADDS16IADDS16IADDS16 COMMUCPERM

IMULRND16 IMULRND16IADDS16 IADDS16 PASS PASS

PASSSHUFFLE SELECT

SHUFFLESHUFFLE SHUFFLE SELECT COMMUCPERM

IADDS16 IADDS16 IADDS16 PASSIMULRND16 SELECT

IADDS16IADDS16IADDS16 PASS PASSPASS IMULRND16

IADDS16 IADDS16IADDS16 IMULRND16IMULRND16 NSELECT

IADDS16IADDS16 IADDS16 IMULRND16IMULRND16

IMULRND16IMULRND16 PASS

SHUFFLE SHUFFLESHUFFLE DATA_INIMULRND16 IMULRND16 PASS

IADDS16IADDS16 SHUFFLE PASSDATA_INIMULRND16 IMULRND16 PASS GEN_CISTATE

IADDS16 IADDS16IADDS16 PASSPASS DATA_INIMULRND16 IMULRND16 COND_IN_D

SELECTIADDS16IADDS16 DATA_INIMULRND16 IMULRND16 GEN_CCEND

IADDS16 SELECTIADDS16IADDS16 NSELECTIMULRND16 IMULRND16 SPCREAD_WT SPCWRITEDATA_OUT

IADDS16 SELECTIADDS16 IADDS16 DATA_INIMULRND16 IMULRND16 COMMUCDATADATA_OUT

IADDS16 NSELECT DATA_INIMULRND16 IMULRND16IADDS16 CHK_ANYDATA_OUT

IADDS16 NSELECTNSELECTIMULRND16 IMULRND16IADDS16IADDS16 DATA_OUT

LOOPIADDS16 IADDS16 IMULRND16 IMULRND16IADDS16 NSELECTSELECT DATA_OUT

IMULRND16 IMULRND16IADDS16 IADDS16 SELECT PASSDATA_OUT DATA_OUT

7x7 Convolve Kernel


Imagine Summary

• Imagine operates on streams of records– simplifies programming– exposes locality and concurrency

• Compound stream operations– perform a subroutine on each stream element– reduces global register bandwidth

• Bandwidth hierarchy– use bandwidth where its inexpensive– distributed and hierarchical register organization

• Conditional stream operations– sort elements into homogeneous streams– avoid predication or speculation


Computer Architecture for the Next Millenium

• Applications and technology are changing– media applications process streams of low-precision samples– wires dominate gates

• ILP is at the point of diminishing returns• Tremendous opportunities for new architectures

– new applications have lots of parallelism and locality– modern technology can build chips with 100s of ALUs (32b

FP) 1000s in the near future• The challenge is to develop architectures

– that can harness this potential performance– in a way that can be easily programmed

• Stream processing is one approach, there are manyothers. We need to start exploring them

Documents

Computer Architecture for the Next Millenium