37
Computer Architecture for the Next Millenium November 1, 1999 William J. Dally Computer Systems Laboratory Stanford University [email protected]

Computer Architecture for the Next Millenium

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Computer Architecture for the Next Millenium

Computer Architecture for theNext Millenium

November 1, 1999

William J. DallyComputer Systems Laboratory

Stanford [email protected]

Page 2: Computer Architecture for the Next Millenium

Computer Architecture for the Next MilleniumWJD November 1, 1999 2

Outline

• The Stanford Concurrent VLSI Architecture Group• Forces acting on computer architecture

– applications (media)– technology (wire-limited)– techniques (explicit parallelism)

• Example: register organization– distributed register files

• Imagine a stream processor– 20GFLOPS on a 0.5cm2 chip

• Tremendous opportunities and challenges forcomputer architecture in the next millenium– its not a mature field yet

Page 3: Computer Architecture for the Next Millenium

Computer Architecture for the Next MilleniumWJD November 1, 1999 3

The Concurrent VLSI Architecture Group

• Architecture and design technology for VLSI• Routing chips

– Torus Routing Chip, Network Design Frame, Reliable Router– Basis for Intel, Cray/SGI, Mercury, Avici network chips

Page 4: Computer Architecture for the Next Millenium

Computer Architecture for the Next MilleniumWJD November 1, 1999 4

Parallel computer systems

• J-Machine (MDP) led to Cray T3D/T3E• M-Machine (MAP)

– Fast messaging, scalable processing nodes, scalablememory architecture

MDP Chip J-Machine Cray T3D MAP Chip

Page 5: Computer Architecture for the Next Millenium

Computer Architecture for the Next MilleniumWJD November 1, 1999 5

Design technology

• Off-chip I/O– Simultaneous bidirectional signaling, 1989

• now used by Intel and Hitachi– High-speed signalling

• 4Gb/s in 0.6µm CMOS, Equalization, 1995

• On-Chip Signalling– Low-voltage on-chip signalling– Low-skew clock distribution

• Synchronization– Mesochronous, Plesiochronous– Self-Timed Design

4Gb/s CMOS I/O

250ps/division

Page 6: Computer Architecture for the Next Millenium

Computer Architecture for the Next MilleniumWJD November 1, 1999 6

What is Computer Architecture?

Technology

Applications ComputerArchitect

Interfaces

Machine Organization

Measurement &Evaluation

ISA

AP

I

Link

I/O C

han

Regs

IR

Page 7: Computer Architecture for the Next Millenium

Computer Architecture for the Next MilleniumWJD November 1, 1999 7

Forces Acting on Architecture

• Applications - shifting towards media applications dealingwith streams of low-precision samples– video, graphics, audio, DSL modems, cellular base stations

• Technology - becoming wire-limited– power and delay dominated by communication, not arithmetic– global structures: register files and instruction issue don’t scale

• Technique - Micro-architecture - ILP has been mined out– to the point of diminishing returns on squeezing performance

from sequential code– explicit parallelism (data parallelism and thread-level parallelism)

required to continue scaling performance

Page 8: Computer Architecture for the Next Millenium

Computer Architecture for the Next MilleniumWJD November 1, 1999 8

Applications

• Little locality of reference– read each pixel once– often non-unit stride– but there is producer-consumer

locality• Very high arithmetic intensity

– 100s of arithmetic operationsper memory reference

• Dominated by low-precision(16-bit) integer operations

Page 9: Computer Architecture for the Next Millenium

Computer Architecture for the Next MilleniumWJD November 1, 1999 9

Wires Are Becoming Like Wet Noodles

0.0mm

2.5mm

5.0mm

7.5mm

10.0mm

Minimum widthwire in an 0.35µmprocess

Page 10: Computer Architecture for the Next Millenium

Computer Architecture for the Next MilleniumWJD November 1, 1999 10

Technology scaling makes communication thescarce resource

0.18µm256Mb DRAM16 64b FP Proc

500MHz

0.07µm4Gb DRAM

256 64b FP Proc2.5GHz

P

1999 2008

18mm30,000 tracks

1 clockrepeaters every 3mm

25mm120,000 tracks

16 clocksrepeaters every 0.5mm

P

Page 11: Computer Architecture for the Next Millenium

Computer Architecture for the Next MilleniumWJD November 1, 1999 11

Care and Feeding of ALUs

DataBandwidth

Instruction Bandwidth

Regs

Instr.Cache

IR

IP‘Feeding’ Structure Dwarfs ALU

Page 12: Computer Architecture for the Next Millenium

Computer Architecture for the Next MilleniumWJD November 1, 1999 12

What Does This Say About Architecture?

• Tremendous opportunities– Media problems have lots of parallelism and locality– VLSI technology enables 100s of ALUs per chip (1000s soon)

• (in 0.18um 0.1mm2 per integer adder, 0.5mm2 per FP adder)

• Challenging problems– Locality - global structures won’t work– Explicit parallelism - ILP won’t keep 100 ALUs busy– Memory - streaming applications don’t cache well

• Its time to try some new approaches

Page 13: Computer Architecture for the Next Millenium

Computer Architecture for the Next MilleniumWJD November 1, 1999 13

ExampleRegister File Organization

• Register files serve two functions:– Short term storage for intermediate results– Communication between multiple function units

• Global register files don’t scale well as N, number ofALUs increases– Need more registers to hold more results (grows with N)– Need more ports to connect all of the units (grows with N2)

Page 14: Computer Architecture for the Next Millenium

Computer Architecture for the Next MilleniumWJD November 1, 1999 14

Register Cells are Mostly Switch

Bit Lines

Wor

d Li

nes

Vdd

Gnd

Vdd

p w

p

h

Bit Lines

Wor

d Li

nes

...

1 wiregrid

...

p

p w

h

Page 15: Computer Architecture for the Next Millenium

Computer Architecture for the Next MilleniumWJD November 1, 1999 15

Register Architecture for ‘wide’ Processors

N Arithmetic Units N/CArithmetic

Units

C SIMD Clusters

N/CArithmetic

Units

C SIMD Clusters

(A) (B)

(C) (D)

N Arithmetic UnitsN/C Arithmetic

UnitsN/C Arithmetic

Units

Page 16: Computer Architecture for the Next Millenium

Computer Architecture for the Next MilleniumWJD November 1, 1999 16

Area of Register Organizations

Central

SIMD

DRF

SIMD/DRF

0.1

1

10

100

1000

1 10 100 1000

Number of Arithmetic Units

Page 17: Computer Architecture for the Next Millenium

Computer Architecture for the Next MilleniumWJD November 1, 1999 17

Delay of Register Organizations

Central

DRF

SIMD/DRF

SIMD

0.1

1

10

100

1000

1 10 100 1000

Number of Arithmetic Units

Page 18: Computer Architecture for the Next Millenium

Computer Architecture for the Next MilleniumWJD November 1, 1999 18

Performance of Register Organizations

0.00

0.20

0.40

0.60

0.80

1.00

1.20

Central SIMD SIMD/DRF HIERARCHICAL STREAM

(B) Performance with Latency

0.00

0.20

0.40

0.60

0.80

1.00

1.20

Central SIMD SIMD/DRF HIERARCHICAL STREAM

(A) Raw Performance

Page 19: Computer Architecture for the Next Millenium

Computer Architecture for the Next MilleniumWJD November 1, 1999 19

Stubs Abstract the Communication BetweenOperations

FU(Op 1)

RF

FU(Op 2)

RF

Op 1

Op 2W

rite

stu

bR

ead

stubD

ata

tran

sfer

Page 20: Computer Architecture for the Next Millenium

Computer Architecture for the Next MilleniumWJD November 1, 1999 20

A Communication Example

+

t1 t1

t1 * t2 L/S

+ k * x L/S

t1 t1

+ * L/S

+ t1 * t2 L/S

+ k * x L/S

t1

Pass t1 * L/S

+

t1

t1 * t2 L/S

Inst

ruct

ion

5In

stru

ctio

n 6

(a) (b) (c)

Page 21: Computer Architecture for the Next Millenium

Computer Architecture for the Next MilleniumWJD November 1, 1999 21

The Imagine Stream Processor

Stream Register FileNetworkInterface

HostInterface

Imagine Stream Processor

HostProcessor

Net

wor

k

AL

U C

lust

er 0

AL

U C

lust

er 1

AL

U C

lust

er 2

AL

U C

lust

er 3

AL

U C

lust

er 4

AL

U C

lust

er 5

AL

U C

lust

er 6

AL

U C

lust

er 7

SDRAMSDRAM SDRAMSDRAM

Streaming Memory SystemM

icro

cont

rolle

r

Page 22: Computer Architecture for the Next Millenium

Computer Architecture for the Next MilleniumWJD November 1, 1999 22

Data Bandwidth Hierarchy

Imagine Stream Processor

3.2GB/s 64GB/s

SDRAM

SDRAM

SDRAM

SDRAM

Str

eam

R

egis

ter

File

ALU Cluster

ALU Cluster

ALU Cluster

544GB/s

Page 23: Computer Architecture for the Next Millenium

Computer Architecture for the Next MilleniumWJD November 1, 1999 23

Cluster Architecture

• VLIW organization with shared control• Local register files provide high data bandwidth

CU

Inte

rclu

ster

Net

wor

k

+

From Stream Buffers

To Stream Buffers

+ + * * /

Cross Point

Local Register File

Page 24: Computer Architecture for the Next Millenium

Computer Architecture for the Next MilleniumWJD November 1, 1999 24

Imagine is a Stream Processor

• Instructions are Load, Store, and Operate– operands are streams– also Send and Receive for multiple-imagine systems

• Operate performs a compound stream operation– read elements from input streams– perform a local computation– append elements to output streams– repeat until input stream is consumed– (e.g., triangle transform)

• Order of magnitude less global register bandwidththan a vector processor

Page 25: Computer Architecture for the Next Millenium

Pixe

l Dep

th &

Col

or

Mem

ory

Stre

am R

egis

ter F

ile

Inpu

tD

ata

wor

d

reco

rdTr

iang

le R

ecor

ds

Shad

ed T

riang

le R

ecor

ds

Proj

ecte

d Tr

iang

le R

ecor

ds

Span

Rec

ords

Frag

men

t Rec

ords

Pixe

l Dep

th &

Col

or

Pixe

l Dep

th &

Col

orZ-

Com

posi

te

Frag

men

t Rec

ords

Imag

eD

epth

&C

olor

Imag

e B

uffe

r Ind

ices

Mem

ory

Ban

dwid

thR

egis

ter

Ban

dwid

th

Com

pact

Sort

Proc

ess

Span

Span

Set

up

Proj

ect/

Cul

l

Tran

sfor

m

Arit

hmet

icC

lust

ers

Tria

ngle

Rec

ords

Shad

e

Tria

ngle

Ren

derin

g

Page 26: Computer Architecture for the Next Millenium

Computer Architecture for the Next MilleniumWJD November 1, 1999 26

Transform Kernel

References (per ∆) Stream Scalar Vector

Memory 5.5 117 (21.3) 48 (8.7)

Global RF 48 624 (13.0) 261 (5.4)

Local RF 372 N/A N/A

Bandwidth Demands

Page 27: Computer Architecture for the Next Millenium

Computer Architecture for the Next MilleniumWJD November 1, 1999 27

Data Parallelism is easier than ILP

Kernel 1 to 8 Cluster Speedup

FFT (1024) 6.4

DCT (8x8) 7.8

Blockwarp (8x8) 7.2

Transform (∆) 8.0

Harmonic Mean 7.3

Page 28: Computer Architecture for the Next Millenium

Computer Architecture for the Next MilleniumWJD November 1, 1999 28

Conventional Approaches toData-Dependent Conditional Execution

A

x>0

B

C

J

K

Data-DependentBranch

Y N

A

x>0

YB

C

Whoops

J

K

SpeculativeLoss D x W~100s

A

B

J

y=(x>0)

if y

if ~y

C

K

if y

if ~y

ExponentiallyDecreasing Duty Factor

Page 29: Computer Architecture for the Next Millenium

Computer Architecture for the Next MilleniumWJD November 1, 1999 29

Zero-Cost Conditionals

• Most Approaches to Conditional Operations are Costly– Branching control flow - dead issue slots on mispredicted branches– Predication (SIMD select, masked vectors) - large fraction of

execution ‘opportunities’ go idle.• Conditional Streams

– append an element to an output stream depending on a casevariable.

Value Stream

Case Stream {0,1}

Output Stream

Page 30: Computer Architecture for the Next Millenium

Computer Architecture for the Next MilleniumWJD November 1, 1999 30

Sustainable Performance

12.88.0 8.9 8.1 7.0

4.0

24.4

14.63.2 1.8

3.62.8

0.8

2.6

0

5

10

15

20

25

30Tr

ansf

orm

FFT

FIR

Blo

ckw

arp

Sort

Mer

ge

DC

T

FIR

GO

PS

Communication16-bit Arithmetic32-bit ArithmeticFP Arithmetic

Page 31: Computer Architecture for the Next Millenium

Computer Architecture for the Next MilleniumWJD November 1, 1999 31

Power Comparison

3.7 1.5

0.28

0.22

1.2

0 1 2 3 4 5 6GOPs/W

Imagine

AD 21160

TI C67x

StrongARM SA-1100

1024-point Floating-Point FFTCommunicationsDhrystone

-Source: Web Pages of Intel, TI, and Analog Devices

Page 32: Computer Architecture for the Next Millenium

Computer Architecture for the Next MilleniumWJD November 1, 1999 32

Power and Performance

34%

45%

9%

9% 3%

00.5

11.5

22.5

33.5

W

FIR

FFT

BWAR

P

XFOR

M

MER

GE DCT

OtherMem SysSRFClustersClock

11.5 5.2 5.2 3.8 7.1 14.1GOPs/W:

Page 33: Computer Architecture for the Next Millenium

Computer Architecture for the Next MilleniumWJD November 1, 1999 33

A Look Inside an ApplicationStereo Depth Extraction

• 320x240 8-bit grayscaleimages

• 30 disparity search• 220 frames/second• 12.7 GOPS• 5.7 GOPS/W

Page 34: Computer Architecture for the Next Millenium

Clusters Mem_0 Mem_1

21400

21500

21600

21700

21800

21900

22000

22100

22200

22300

22400

22500

22600

22700

22800

22900

23000

23100

23200

23300

23400

23500

23600

STOREUNPACK LOADCONV 7x7

CONV 3x3

STOREUNPACK LOADCONV 7x7

CONV 3x3

STOREUNPACK LOADCONV 7x7

Clust Mem0 Mem1501400

501500

501600

501700

501800

501900

502000

502100

502200

502300

502400

502500

502600

502700

502800

502900

503000

503100

503200

503300

BlockSADLoad Load

BlockSAD

StoreBlockSAD

BlockSAD

BlockSADLoad Load

BlockSAD

StoreBlockSAD

Load originalpacked row

Unpack (8bit -> 16 bit)

7x7 Convolve

3x3 Convolve

Store convolved row

Load ConvolvedRows

CalculateBlockSADs atdifferent disparities

Store bestdisparity values

Stereo Depth ExtractorConvolutions Disparity Search

Page 35: Computer Architecture for the Next Millenium

Computer Architecture for the Next MilleniumWJD November 1, 1999 35

ADD0 ADD1 ADD2 MUL0 MUL1 DIV0 INP0 INP1 INP2 INP3 OUT0 OUT1 SP_0 SP_0 COM0 MC_0 JUK0 VAL0

G E N _ C I S T A T E

C O N D _ I N _ D

G E N _ C C E N D

S P C R E A D _ W T S P C W R I T E

C O M M U C D A T A

C H K _ A N Y

S E L E C T

S H I F T A 1 6

C O M M U C P E R M

C O M M U C P E R M

C O M M U C P E R M

S E L E C T

S E L E C T C O M M U C P E R M

I M U L R N D 1 6 S E L E C T

I M U L R N D 1 6

I M U L R N D 1 6I M U L R N D 1 6 N S E L E C T

I M U L R N D 1 6I M U L R N D 1 6

I M U L R N D 1 6I M U L R N D 1 6 P A S S

I M U L R N D 1 6 I M U L R N D 1 6 P A S S

I M U L R N D 1 6 I M U L R N D 1 6 P A S S

I M U L R N D 1 6 I M U L R N D 1 6

I M U L R N D 1 6 I M U L R N D 1 6

I M U L R N D 1 6 I M U L R N D 1 6

I M U L R N D 1 6 I M U L R N D 1 6

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6I A D D S 1 6

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 N S E L E C T

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6 S E L E C T P A S S

I M U L R N D 1 6 I M U L R N D 1 6 P A S SI A D D S 1 6 N S E L E C TP A S S

I M U L R N D 1 6 I M U L R N D 1 6 P A S SI A D D S 1 6 I A D D S 1 6 N S E L E C T

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6I A D D S 1 6S H U F F L E

I M U L R N D 1 6 I M U L R N D 1 6S H U F F L E P A S S

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6 P A S SI A D D S 1 6

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6 P A S SS H U F F L E

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 S H U F F L E

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 S H U F F L ES H U F F L E

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6I A D D S 1 6

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6I A D D S 1 6I A D D S 1 6

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6I A D D S 1 6I A D D S 1 6

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6 P A S S P A S S

P A S SS H U F F L E

S H U F F L ES H U F F L E S H U F F L E

I A D D S 1 6 I A D D S 1 6 I A D D S 1 6 P A S S

I A D D S 1 6I A D D S 1 6I A D D S 1 6 P A S S P A S SP A S S

I A D D S 1 6 I A D D S 1 6I A D D S 1 6

I A D D S 1 6I A D D S 1 6 I A D D S 1 6

S H U F F L E S H U F F L ES H U F F L E D A T A _ I N

I A D D S 1 6I A D D S 1 6 S H U F F L E P A S SD A T A _ I N

I A D D S 1 6 I A D D S 1 6I A D D S 1 6 P A S SP A S S D A T A _ I N

S E L E C TI A D D S 1 6I A D D S 1 6 D A T A _ I N

I A D D S 1 6 S E L E C TI A D D S 1 6I A D D S 1 6 N S E L E C T D A T A _ O U T

I A D D S 1 6 S E L E C TI A D D S 1 6 I A D D S 1 6 D A T A _ I N D A T A _ O U T

I A D D S 1 6 N S E L E C T D A T A _ I N D A T A _ O U T

I A D D S 1 6 N S E L E C TN S E L E C T D A T A _ O U T

L O O PI A D D S 1 6 I A D D S 1 6 D A T A _ O U T

D A T A _ O U T D A T A _ O U T

ADD0 ADD1 ADD2 MUL0 MUL1 DIV0 INP0 INP1 INP2 INP3 OUT0 OUT1 SP_0 SP_0 COM0 MC_0 JUK0 VAL0

IMULRND16 IMULRND16 PASSIADDS16 NSELECTPASS

IMULRND16 IMULRND16 PASSIADDS16 IADDS16 NSELECTSHIFTA16

IMULRND16 IMULRND16IADDS16 IADDS16

IMULRND16 IMULRND16IADDS16IADDS16SHUFFLE

IMULRND16 IMULRND16SHUFFLE PASS

IMULRND16 IMULRND16IADDS16 IADDS16 PASSIADDS16

IMULRND16 IMULRND16IADDS16 IADDS16

IMULRND16 IMULRND16IADDS16 IADDS16 PASSSHUFFLE

IMULRND16 IMULRND16IADDS16 SHUFFLE

IMULRND16 IMULRND16IADDS16 SHUFFLESHUFFLE

IMULRND16 IMULRND16IADDS16 IADDS16IADDS16 COMMUCPERM

IMULRND16 IMULRND16IADDS16IADDS16IADDS16 COMMUCPERM

IMULRND16 IMULRND16IADDS16IADDS16IADDS16 COMMUCPERM

IMULRND16 IMULRND16IADDS16 IADDS16 PASS PASS

PASSSHUFFLE SELECT

SHUFFLESHUFFLE SHUFFLE SELECT COMMUCPERM

IADDS16 IADDS16 IADDS16 PASSIMULRND16 SELECT

IADDS16IADDS16IADDS16 PASS PASSPASS IMULRND16

IADDS16 IADDS16IADDS16 IMULRND16IMULRND16 NSELECT

IADDS16IADDS16 IADDS16 IMULRND16IMULRND16

IMULRND16IMULRND16 PASS

SHUFFLE SHUFFLESHUFFLE DATA_INIMULRND16 IMULRND16 PASS

IADDS16IADDS16 SHUFFLE PASSDATA_INIMULRND16 IMULRND16 PASS GEN_CISTATE

IADDS16 IADDS16IADDS16 PASSPASS DATA_INIMULRND16 IMULRND16 COND_IN_D

SELECTIADDS16IADDS16 DATA_INIMULRND16 IMULRND16 GEN_CCEND

IADDS16 SELECTIADDS16IADDS16 NSELECTIMULRND16 IMULRND16 SPCREAD_WT SPCWRITEDATA_OUT

IADDS16 SELECTIADDS16 IADDS16 DATA_INIMULRND16 IMULRND16 COMMUCDATADATA_OUT

IADDS16 NSELECT DATA_INIMULRND16 IMULRND16IADDS16 CHK_ANYDATA_OUT

IADDS16 NSELECTNSELECTIMULRND16 IMULRND16IADDS16IADDS16 DATA_OUT

LOOPIADDS16 IADDS16 IMULRND16 IMULRND16IADDS16 NSELECTSELECT DATA_OUT

IMULRND16 IMULRND16IADDS16 IADDS16 SELECT PASSDATA_OUT DATA_OUT

7x7 Convolve Kernel

Page 36: Computer Architecture for the Next Millenium

Computer Architecture for the Next MilleniumWJD November 1, 1999 36

Imagine Summary

• Imagine operates on streams of records– simplifies programming– exposes locality and concurrency

• Compound stream operations– perform a subroutine on each stream element– reduces global register bandwidth

• Bandwidth hierarchy– use bandwidth where its inexpensive– distributed and hierarchical register organization

• Conditional stream operations– sort elements into homogeneous streams– avoid predication or speculation

Page 37: Computer Architecture for the Next Millenium

Computer Architecture for the Next MilleniumWJD November 1, 1999 37

Computer Architecture for the Next Millenium

• Applications and technology are changing– media applications process streams of low-precision samples– wires dominate gates

• ILP is at the point of diminishing returns• Tremendous opportunities for new architectures

– new applications have lots of parallelism and locality– modern technology can build chips with 100s of ALUs (32b

FP) 1000s in the near future• The challenge is to develop architectures

– that can harness this potential performance– in a way that can be easily programmed

• Stream processing is one approach, there are manyothers. We need to start exploring them