59
The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA 94305 Scott Rixner February 23, 2000

The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA 94305 Scott Rixner February

Embed Size (px)

Citation preview

Page 1: The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA 94305 Scott Rixner February

The Imagine Stream Processor

Concurrent VLSI Architecture Group

Stanford University Computer Systems Laboratory

Stanford, CA 94305

Scott Rixner

February 23, 2000

Page 2: The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA 94305 Scott Rixner February

Scott Rixner The Imagine Stream Processor 2

Outline

The Stanford Concurrent VLSI Architecture Group Media processing Technology

– Distributed register files

– Stream organization The Imagine Stream Processor

– 20GFLOPS on a 0.7cm2 chip

– Conditional streams

– Memory access scheduling

– Programming tools

Page 3: The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA 94305 Scott Rixner February

Scott Rixner The Imagine Stream Processor 3

The Concurrent VLSI Architecture Group

Architecture and design technology for VLSI Routing chips

– Torus Routing Chip, Network Design Frame, Reliable Router

– Basis for Intel, Cray/SGI, Mercury, Avici network chips

Page 4: The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA 94305 Scott Rixner February

Scott Rixner The Imagine Stream Processor 4

Parallel computer systems

J-Machine (MDP) led to Cray T3D/T3E M-Machine (MAP)

– Fast messaging, scalable processing nodes, scalable memory architecture

MDP Chip J-Machine Cray T3D MAP Chip

Page 5: The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA 94305 Scott Rixner February

Scott Rixner The Imagine Stream Processor 5

Design technology

Off-chip I/O– Simultaneous bidirectional signaling, 1989

• now used by Intel and Hitachi

– High-speed signalling• 4Gb/s in 0.6m CMOS, Equalization, 1995

On-Chip Signalling– Low-voltage on-chip signalling

– Low-skew clock distribution Synchronization

– Mesochronous, Plesiochronous

– Self-Timed Design4Gb/s CMOS I/O

250ps/division

Page 6: The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA 94305 Scott Rixner February

Scott Rixner The Imagine Stream Processor 6

Forces Acting on Media Processor Architecture

Applications - streams of low-precision samples– video, graphics, audio, DSL modems, cellular base stations

Technology - becoming wire-limited– power/delay dominated by communication, not arithmetic

– global structures: register files, instruction issue don’t scale Microarchitecture - ILP has been mined out

– to the point of diminishing returns on squeezing performance from sequential code

– explicit parallelism (data parallelism and thread-level parallelism) required to continue scaling performance

Page 7: The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA 94305 Scott Rixner February

Scott Rixner The Imagine Stream Processor 7

640x480 @ 30 fps Requirements

– 11 GOPS

Imagine stream processor– 12.1 GOPS, 4.6 GOPS/W

Stereo Depth Extraction

Left Camera Image Right Camera Image

Depth Map

Page 8: The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA 94305 Scott Rixner February

Scott Rixner The Imagine Stream Processor 8

Applications

Little locality of reference– read each pixel once

– often non-unit stride

– but there is producer-consumer locality Very high arithmetic intensity

– 100s of arithmetic operations per memory reference

Dominated by low-precision (16-bit) integer operations

Page 9: The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA 94305 Scott Rixner February

Scott Rixner The Imagine Stream Processor 9

Stream Processing

SAD

Kernel StreamInput Data

Output Data

Image 1 convolve convolve

Image 0 convolve convolve

Depth Map

Little data reuse (pixels never revisited) Highly data parallel (output pixels not dependent on other output pixels) Compute intensive (60 operations per memory reference)

Page 10: The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA 94305 Scott Rixner February

Scott Rixner The Imagine Stream Processor 10

Locality and Concurrency

SAD

Image 1 convolve convolve

Image 0 convolve convolve

Depth Map

Operations within a kernel operate on local data

Streams expose data parallelism

Kernels can be partitioned across chips to exploit control parallelism

Page 11: The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA 94305 Scott Rixner February

Scott Rixner The Imagine Stream Processor 11

These Applications Demand HighPerformance Density

GOPs to TOPs of sustained performance Space, weight, and power are limited Historically achieved using special-purpose hardware Building a special processor for each application is

– expensive

– time consuming

– inflexible

Page 12: The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA 94305 Scott Rixner February

Scott Rixner The Imagine Stream Processor 12

Performance of Special-Purpose Processors

Fed by dedicated wires/memoriesLots (100s) of ALUs

Page 13: The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA 94305 Scott Rixner February

Scott Rixner The Imagine Stream Processor 13

Care and Feeding of ALUs

DataBandwidth

Instruction Bandwidth

Regs

Instr.Cache

IR

IP‘Feeding’ Structure Dwarfs ALU

Page 14: The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA 94305 Scott Rixner February

Scott Rixner The Imagine Stream Processor 14

Technology scaling makes communication the scarce resource

0.18m256Mb DRAM16 64b FP Proc

500MHz

0.07m4Gb DRAM

256 64b FP Proc2.5GHz

P

1999 2008

18mm30,000 tracks

1 clockrepeaters every 3mm

25mm120,000 tracks

16 clocksrepeaters every 0.5mm

P

Page 15: The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA 94305 Scott Rixner February

Scott Rixner The Imagine Stream Processor 15

What Does This Say About Architecture?

Tremendous opportunities– Media problems have lots of parallelism and locality

– VLSI technology enables 100s of ALUs/chip (1000s soon)• (in 0.18um 0.1mm2 per integer adder, 0.5mm2 per FP adder)

Challenging problems– Locality - global structures won’t work

– Explicit parallelism - ILP won’t keep 100 ALUs busy

– Memory - streaming applications don’t cache well Its time to try some new approaches

Page 16: The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA 94305 Scott Rixner February

Scott Rixner The Imagine Stream Processor 16

Example: Register File Organization

Register files serve two functions:– Short term storage for intermediate results

– Communication between multiple function units Global register files don’t scale well as N, number of

ALUs increases– Need more registers to hold more results (grows with N)

– Need more ports to connect all of the units (grows with N2)

Page 17: The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA 94305 Scott Rixner February

Scott Rixner The Imagine Stream Processor 17

Register Files Dwarf ALUs

N A rithm etic Units

1 cm

32 ALUs

Size of RFto support32 ALUs

Size of1 ALU

Size of RFto support

1 ALU1 cm

4 ALUs 16 ALUs

Page 18: The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA 94305 Scott Rixner February

Scott Rixner The Imagine Stream Processor 18

Register File Area

Each cell requires:– 1 word line per port

– 1 bit line per port Each cell grows as p2

R registers in the file Area: p2R N3

Bit Lines

Wor

d Li

nes

...

1 wiregrid

...

p

p w

h

Register Bit Cell

Page 19: The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA 94305 Scott Rixner February

Scott Rixner The Imagine Stream Processor 19

SIMD and Distributed Register Files

N A rithm etic U n its N/CA rithm etic

U nits

C S IM D C lusters

N/CA rithm etic

U nits

C S IM D C lusters

N A rithm etic U n itsN/C A rithm etic

U nitsN/C A rithm etic

U nits

Scalar SIMD

Central

DRF

Page 20: The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA 94305 Scott Rixner February

Scott Rixner The Imagine Stream Processor 20

Hierarchical Organizations

N/CA rith . U n its

C S IM D C lusters

N/CA rith . U n its

C S IM D C lusters

N/C A rithm eticU nits

N/C A rithm eticU nits

N A rithm etic U n its

N A rithm etic U n its

Scalar SIMD

Central

DRF

Page 21: The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA 94305 Scott Rixner February

Scott Rixner The Imagine Stream Processor 21

Stream Organizations

N A rithm etic U n its N/CA rith . U n its

N/CA rith . U n its

C S IM D C lusters

C S IM D C lusters

N/C A rith . U n itsN A rithm etic U n its N/C A rith . U n its

Scalar SIMD

Central

DRF

Page 22: The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA 94305 Scott Rixner February

Scott Rixner The Imagine Stream Processor 22

Organizations

0.1

1

10

100

1000

1 10 100 1000Number of Arithmetic Units

SIMDCentral

Stream/SIMD/DRF

Hier/SIMD/DRF

SIMD/DRF

480.1

1

10

100

1000

1 10 100 1000Number of Arithmetic Units

SIMD

Central

Hier/SIMD/DRF &Stream/SIMD/DRF

SIMD/DRF

48

48 ALUs (32-bit), 500 MHz Stream organization improves central organization by

Area: 195x, Delay: 20x, Power: 430x

Page 23: The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA 94305 Scott Rixner February

Scott Rixner The Imagine Stream Processor 23

Performance

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Sp

eed

up

CENTRAL SIMD SIMD/DRF HIER. STREAM

16% Performance Drop(8% with latency constraints)

0

50

100

150

200

250

Per

form

ance

/Are

a

CENTRAL S IMD S IMD/DRF HIER. S TREAM

Convolve DCT Transform Shader FIR FFT Mean

180x Improvement

Page 24: The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA 94305 Scott Rixner February

Scott Rixner The Imagine Stream Processor 24

Scheduling Distributed Register Files

Distributed register files means:– Not all functional units can access all data

– Each functional unit input/output no longer has a dedicated route from/to all register files

A D D 0 L/S A D D 1

can write toeither or

both busescan read

from eitherbus

Page 25: The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA 94305 Scott Rixner February

Scott Rixner The Imagine Stream Processor 25

Communication Scheduling

Schedule operation on instruction Assign operation to functional unit Schedule communication for

operation

load x

A D D 0 L/S A D D 1

0

... = x + ...

xInst

ruct

ion

Functiona l U n it

... = x + ...

?

x

??

? ?

...load xy = ... z = ...

A D D 0 L/S A D D 1

0

Inst

ruct

ion

Functiona l U n it

XX

When to schedule route from operation A to operation B?– Can’t schedule route when A

is scheduled since B is not assigned to a FU

– Can’t wait to schedule route when B is scheduled since other routes may block

Page 26: The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA 94305 Scott Rixner February

Scott Rixner The Imagine Stream Processor 26

Stubs: Communication Placeholders

FU(Op 1)

RF

FU(Op 2)

RF

Op 1

Op 2

Wri

te s

tub

Re

ad

stu

bDa

ta t

ran

sfe

r

Stubs– ensure that other

communications will not block write of Op1’s result

Copy-connected architecture– ensures that any functional

unit that performs Op2 can read Op1’s result

Page 27: The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA 94305 Scott Rixner February

Scott Rixner The Imagine Stream Processor 27

Scheduling With Stubs

load x

A D D 0 L/S A D D 1

0

Functiona l U n it

Inst

ruct

ion

load xy = ...

A D D 0 L/S A D D 1

0

Functiona l U n it

Inst

ruct

ion

= ... x + ...

1

x

load xy = ...

A D D 0 L/S A D D 1

0

Functiona l U n it

Inst

ruct

ion

Page 28: The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA 94305 Scott Rixner February

Scott Rixner The Imagine Stream Processor 28

The Imagine Stream Processor

Stream Register FileNetworkInterface

StreamController

Imagine Stream Processor

HostProcessor

Net

wor

k

AL

U C

lust

er 0

AL

U C

lust

er 1

AL

U C

lust

er 2

AL

U C

lust

er 3

AL

U C

lust

er 4

AL

U C

lust

er 5

AL

U C

lust

er 6

AL

U C

lust

er 7

SDRAMSDRAM SDRAMSDRAM

Streaming Memory SystemM

icro

con

trol

ler

Page 29: The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA 94305 Scott Rixner February

Scott Rixner The Imagine Stream Processor 29

Arithmetic Clusters

CU

Inte

rclu

ster

N

etw

ork+

From SRF

To SRF

+ + * * /

Cross Point

Local Register File

Page 30: The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA 94305 Scott Rixner February

Scott Rixner The Imagine Stream Processor 30

Bandwidth Hierarchy

VLIW clusters with shared control 41.2 32-bit operations per word of memory bandwidth

2GB/s 32GB/s

SDRAM

SDRAM

SDRAM

SDRAM

Str

eam

R

egis

ter

File

ALU Cluster

ALU Cluster

ALU Cluster

544GB/s

Page 31: The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA 94305 Scott Rixner February

Scott Rixner The Imagine Stream Processor 31

Imagine is a Stream Processor

Instructions are Load, Store, and Operate– operands are streams

– Send and Receive for multiple-imagine systems Operate performs a compound stream operation

– read elements from input streams

– perform a local computation

– append elements to output streams

– repeat until input stream is consumed

– (e.g., triangle transform)

Page 32: The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA 94305 Scott Rixner February

Scott Rixner The Imagine Stream Processor 32

Bandwidth Demands of Transformation

References (per ) Stream

Memory 5.5 117 (21.3x) 48 (8.7x)

Global RF 48 624 (13.0x) 261 (5.4x)

Local RF 372

Scalar Vector

N/A N/A

Page 33: The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA 94305 Scott Rixner February

Scott Rixner The Imagine Stream Processor 33

Bandwidth Usage

Memory BW Global RF BW Local RF BW

Depth Extractor 0.80 GB/s 18.45 GB/s 210.85 GB/s

MPEG Encoder 0.47 GB/s 2.46 GB/s 121.05 GB/s

Polygon Rendering 0.78 GB/s 4.06 GB/s 102.46 GB/s

QR Decomposition 0.46 GB/s 3.67 GB/s 234.57 GB/s

Page 34: The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA 94305 Scott Rixner February

Scott Rixner The Imagine Stream Processor 34

Data Parallelism is easier than ILP

Kernel 1 to 8 Cluster Speedup

FFT (1024) 6.4

DCT (8x8) 7.8

Blockwarp (8x8) 7.2

Transform () 8.0

Harmonic Mean 7.3

Page 35: The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA 94305 Scott Rixner February

Scott Rixner The Imagine Stream Processor 35

Conventional Approaches to Data-Dependent Conditional Execution

A

x>0

B

C

J

K

Data-DependentBranch

Y N

A

x>0

Y

B

C

Whoops

J

K

SpeculativeLoss D x W~100s

A

B

J

y=(x>0)

if y

if ~y

C

K

if y

if ~y

ExponentiallyDecreasing Duty Factor

Page 36: The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA 94305 Scott Rixner February

Scott Rixner The Imagine Stream Processor 36

Zero-Cost Conditionals

Most Approaches to Conditional Operations are Costly– Branching control flow - dead issue slots on mispredicted branches

– Predication (SIMD select, masked vectors) - large fraction of execution ‘opportunities’ go idle.

Conditional Streams– append an element to an output stream depending on a case variable.

Value Stream

Case Stream {0,1}

Output Stream

Page 37: The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA 94305 Scott Rixner February

Scott Rixner The Imagine Stream Processor 37

Modern DRAM

Synchronous Three-dimensional

– Bank

– Row

– Column

Shared address lines Shared data lines Not truly random access

B ank N

B ank 1

Sense A m plifie rs(R ow B uffer)

C o lum n D ecoder

Ro

w D

eco

de

r

D a ta

A ddress

M em oryA rray

(B ank 0)

ConcurrentAccess

PipelinedAccess

Page 38: The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA 94305 Scott Rixner February

Scott Rixner The Imagine Stream Processor 38

DRAM Throughput and Latency

P: Bank PrechargeA: Row ActivationC: Column Access

11

(0 ,0 ,0 )

(1 ,1 ,2 )

(1 ,0 ,1 )

(1 ,1 ,1 )

(1 ,0 ,0 )

(0 ,1 ,3 )

(0 ,0 ,1 )

(0 ,1 ,0 )P A C

292827262524232221191817161514131210987654321 20 39383736353433323130 4443424140

P A CP A C

P A CP A C

P A CP A C

P A C

4948474645 50 565554535251

11

(0 ,0 ,0 )

(1 ,1 ,2 )

(1 ,0 ,1 )

(1 ,1 ,1 )

(1 ,0 ,0 )

(0 ,1 ,3 )

(0 ,0 ,1 )

(0 ,1 ,0 )P A C

191817161514131210987654321

P A

CP

CP A

C

A C

C

C

C

Time (Cycles)

Ref

eren

ces

( Ba

nk,

Ro

w, C

olu

mn

)R

efer

ence

s( B

an

k, R

ow

, Co

lum

n)

Time (Cycles)

Without Access Scheduling (56 DRAM cycles)

With Access Scheduling (19 DRAM cycles)

Page 39: The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA 94305 Scott Rixner February

Scott Rixner The Imagine Stream Processor 39

Streaming Memory System

Stream load/store architecture

Memory access scheduling– Out-of-order access

– High stream throughput

Address Generators

Index Data Data

DRAM

Ba

nk

0

Ba

nk

Bu

ffe

r

DRAM

Ba

nk n

Ba

nk

Bu

ffe

r

Reorder Buffer

Page 40: The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA 94305 Scott Rixner February

Scott Rixner The Imagine Stream Processor 40

Streaming Memory Performance

Bank Buffer Effectiveness

0.00.20.40.60.81.01.21.41.61.82.0

1 2 4 8 16 32 64 Infinite

Buffer Size

Cyc

les/

Acc

ess

Page 41: The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA 94305 Scott Rixner February

Scott Rixner The Imagine Stream Processor 41

Performance

Application Trace Bandwidth

0

200

400

600

800

1000

1200

1400

1600

1800

2000

FFT Depth MPEG Tex WeightedMean

Me

mo

ry B

an

dw

idth

(M

B/s

)

80-90% Improvement

Application Bandwidth

0

200

400

600

800

1000

1200

1400

1600

1800

2000

FFT Depth MPEG Tex WeightedMean

Me

mo

ry B

an

dw

idth

(M

B/s

)

in order first ready col/open col/closed row/open row/closed

35% Improvement

Page 42: The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA 94305 Scott Rixner February

Scott Rixner The Imagine Stream Processor 42

Evaluation Tools

Simulators– Imagine RTL model (verilog)

– isim: cycle-accurate simulator (C++)

– idebug: stream-level simulator (C++) Compiler tools

– iscd: kernel scheduler (communication scheduling)

– istream: stream scheduler (SRF/memory management)

– schedviz: visualizer

Page 43: The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA 94305 Scott Rixner February

Scott Rixner The Imagine Stream Processor 43

Stream Programming

Stream-level– Host Processor

– C++

– istream

load (datin1, mem1); // Memory to SRF

load (datin2, mem2);

op (ADDSUB, datin1, datin2, datout1, datout2);

send (datout1, P2, chan0); // SRF to Network

send (datout2, P3, chan1);

Kernel-level– Clusters

– C-like Syntax

– iscd

loop(input_stream0) {

input_stream0 >> a;

input_stream1 >> b;

c = a + b;

d = a - b;

output_stream0 << c;

output_stream1 << d;

}

Page 44: The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA 94305 Scott Rixner February

Scott Rixner The Imagine Stream Processor 44

Stereo Depth Extractor

Clusters Mem_0 Mem_121300

21400

21500

21600

21700

21800

21900

22000

22100

22200

22300

22400

22500

22600

22700

22800

22900

23000

23100

23200

23300

23400

23500

23600

CONV 3x3

STOREUNPACK LOADCONV 7x7

CONV 3x3

STOREUNPACK LOADCONV 7x7

CONV 3x3

STOREUNPACK LOADCONV 7x7

Clust Mem0 Mem1501300

501400

501500

501600

501700

501800

501900

502000

502100

502200

502300

502400

502500

502600

502700

502800

502900

503000

503100

503200

503300

BlockSAD

BlockSADLoad Load

BlockSAD

StoreBlockSAD

BlockSAD

BlockSADLoad Load

BlockSAD

StoreBlockSAD

Load originalpacked row

Unpack (8bit 16 bit)

7x7 Convolve

3x3 Convolve

Store convolved row

Load ConvolvedRows

CalculateBlockSADs atdifferent disparities

Store bestdisparity values

Convolutions Disparity Search

Page 45: The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA 94305 Scott Rixner February

Scott Rixner The Imagine Stream Processor 45

ADD0 ADD1 ADD2 MUL0 MUL1 DIV0 INP0 INP1 INP2 INP3 OUT0 OUT1 SP_0 SP_0 COM0 MC_0 J UK0 VAL0

G E N _ C I S T A T E

C O N D _ I N _ D

G E N _ C C E N D

S P C R E A D _ W T S P C W R I T E

C O M M U C D A T A

C H K _ A N Y

S E L E C T

S H I F T A 1 6

C O M M U C P E R M

C O M M U C P E R M

C O M M U C P E R M

S E L E C T

S E L E C T C O M M U C P E R M

I M U L R N D 1 6 S E L E C T

I M U L R N D 1 6

I M U L R N D 1 6I M U L R N D 1 6 N S E L E C T

I M U L R N D 1 6I M U L R N D 1 6

I M U L R N D 1 6I M U L R N D 1 6 P A S S

I M U L R N D 1 6 I M U L R N D 1 6 P A S S

I M U L R N D 1 6 I M U L R N D 1 6 P A S S

I M U L R N D 1 6 I M U L R N D 1 6

I M U L R N D 1 6 I M U L R N D 1 6

I M U L R N D 1 6 I M U L R N D 1 6

I M U L R N D 1 6 I M U L R N D 1 6

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6I A D D S 1 6

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 N S E L E C T

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6 S E L E C T P A S S

I M U L R N D 1 6 I M U L R N D 1 6 P A S SI A D D S 1 6 N S E L E C TP A S S

I M U L R N D 1 6 I M U L R N D 1 6 P A S SI A D D S 1 6 I A D D S 1 6 N S E L E C T

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6I A D D S 1 6S H U F F L E

I M U L R N D 1 6 I M U L R N D 1 6S H U F F L E P A S S

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6 P A S SI A D D S 1 6

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6 P A S SS H U F F L E

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 S H U F F L E

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 S H U F F L ES H U F F L E

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6I A D D S 1 6

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6I A D D S 1 6I A D D S 1 6

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6I A D D S 1 6I A D D S 1 6

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6 P A S S P A S S

P A S SS H U F F L E

S H U F F L ES H U F F L E S H U F F L E

I A D D S 1 6 I A D D S 1 6 I A D D S 1 6 P A S S

I A D D S 1 6I A D D S 1 6I A D D S 1 6 P A S S P A S SP A S S

I A D D S 1 6 I A D D S 1 6I A D D S 1 6

I A D D S 1 6I A D D S 1 6 I A D D S 1 6

S H U F F L E S H U F F L ES H U F F L E D A T A _ I N

I A D D S 1 6I A D D S 1 6 S H U F F L E P A S SD A T A _ I N

I A D D S 1 6 I A D D S 1 6I A D D S 1 6 P A S SP A S S D A T A _ I N

S E L E C TI A D D S 1 6I A D D S 1 6 D A T A _ I N

I A D D S 1 6 S E L E C TI A D D S 1 6I A D D S 1 6 N S E L E C T D A T A _ O U T

I A D D S 1 6 S E L E C TI A D D S 1 6 I A D D S 1 6 D A T A _ I N D A T A _ O U T

I A D D S 1 6 N S E L E C T D A T A _ I N D A T A _ O U T

I A D D S 1 6 N S E L E C TN S E L E C T D A T A _ O U T

L O O PI A D D S 1 6 I A D D S 1 6 D A T A _ O U T

D A T A _ O U T D A T A _ O U T

ADD0 ADD1 ADD2 MUL0 MUL1 DIV0 INP0 INP1 INP2 INP3 OUT0 OUT1 SP_0 SP_0 COM0 MC_0 J UK0 VAL0

IMULRND16 IMULRND16 PASSIADDS16 NSELECTPASS

IMULRND16 IMULRND16 PASSIADDS16 IADDS16 NSELECTSHIFTA16

IMULRND16 IMULRND16IADDS16 IADDS16

IMULRND16 IMULRND16IADDS16IADDS16SHUFFLE

IMULRND16 IMULRND16SHUFFLE PASS

IMULRND16 IMULRND16IADDS16 IADDS16 PASSIADDS16

IMULRND16 IMULRND16IADDS16 IADDS16

IMULRND16 IMULRND16IADDS16 IADDS16 PASSSHUFFLE

IMULRND16 IMULRND16IADDS16 SHUFFLE

IMULRND16 IMULRND16IADDS16 SHUFFLESHUFFLE

IMULRND16 IMULRND16IADDS16 IADDS16IADDS16 COMMUCPERM

IMULRND16 IMULRND16IADDS16IADDS16IADDS16 COMMUCPERM

IMULRND16 IMULRND16IADDS16IADDS16IADDS16 COMMUCPERM

IMULRND16 IMULRND16IADDS16 IADDS16 PASS PASS

PASSSHUFFLE SELECT

SHUFFLESHUFFLE SHUFFLE SELECT COMMUCPERM

IADDS16 IADDS16 IADDS16 PASSIMULRND16 SELECT

IADDS16IADDS16IADDS16 PASS PASSPASS IMULRND16

IADDS16 IADDS16IADDS16 IMULRND16IMULRND16 NSELECT

IADDS16IADDS16 IADDS16 IMULRND16IMULRND16

IMULRND16IMULRND16 PASS

SHUFFLE SHUFFLESHUFFLE DATA_INIMULRND16 IMULRND16 PASS

IADDS16IADDS16 SHUFFLE PASSDATA_INIMULRND16 IMULRND16 PASS GEN_CISTATE

IADDS16 IADDS16IADDS16 PASSPASS DATA_INIMULRND16 IMULRND16 COND_IN_D

SELECTIADDS16IADDS16 DATA_INIMULRND16 IMULRND16 GEN_CCEND

IADDS16 SELECTIADDS16IADDS16 NSELECTIMULRND16 IMULRND16 SPCREAD_WT SPCWRITEDATA_OUT

IADDS16 SELECTIADDS16 IADDS16 DATA_INIMULRND16 IMULRND16 COMMUCDATADATA_OUT

IADDS16 NSELECT DATA_INIMULRND16 IMULRND16IADDS16 CHK_ANYDATA_OUT

IADDS16 NSELECTNSELECTIMULRND16 IMULRND16IADDS16IADDS16 DATA_OUT

LOOPIADDS16 IADDS16 IMULRND16 IMULRND16IADDS16 NSELECTSELECT DATA_OUT

IMULRND16 IMULRND16IADDS16 IADDS16 SELECT PASSDATA_OUT DATA_OUT

7x7 Convolution Kernel

ALUs Comm/SPStreams

Pipeline Stage 0

Pipeline Stage 1

Pipeline Stage 2

Page 46: The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA 94305 Scott Rixner February

Scott Rixner The Imagine Stream Processor 46

Performance

12.110.2 11.0

23.925.6

7.0

0

5

10

15

20

25

30

GO

PS

depth mpeg qrd dct convolve fft

16-bit kernels

floating-pointkernel

16-bitapplications

floating-pointapplication

Page 47: The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA 94305 Scott Rixner February

Scott Rixner The Imagine Stream Processor 47

Power

GOPS/W: 4.6 6.9 4.1 10.2 9.6 2.4 6.3

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Wa

tts

depth mpeg qrd dct convolve fft average

OtherMem SysPinsSRF ClustClock

24%

63%

5%5%

1% 2%

Page 48: The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA 94305 Scott Rixner February

Scott Rixner The Imagine Stream Processor 48

Implementation

Goals– < 0.7 cm2 chip in 0.18 m CMOS

– Operates at 500 MHz

Status– Cycle accurate C++ simulator

– Behavioral verilog

– Detailed floorplan with all critical signals planned

– Schematics and layout for key cells

– timing analysis of key units and communication paths

Page 49: The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA 94305 Scott Rixner February

Scott Rixner The Imagine Stream Processor 49

The Imagine Stream Processor

Stream Register FileNetworkInterface

StreamController

Imagine Stream Processor

HostProcessor

Net

wor

k

AL

U C

lust

er 0

AL

U C

lust

er 1

AL

U C

lust

er 2

AL

U C

lust

er 3

AL

U C

lust

er 4

AL

U C

lust

er 5

AL

U C

lust

er 6

AL

U C

lust

er 7

SDRAMSDRAM SDRAMSDRAM

Streaming Memory SystemM

icro

con

trol

ler

Page 50: The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA 94305 Scott Rixner February

Scott Rixner The Imagine Stream Processor 50

Imagine Floorplan

SRF/SB CtrlAG

Clu

st/UC

Stream

Bu

ffers

SRF

SRF

SRF

SRF

SRF

SRF

SRF

SRF

Cluster 2

Cluster 3

Cluster 0

Cluster 1

Cluster 6

Cluster 7

Cluster 4

Cluster 5

MS

/HI/N

I Stream

/Re-O

rder B

uffers

CCCCCCCCC CCore

HI

MB

0M

B1

MB

2M

B3

BitCoderNI

Routing

5.9mm

1mm

1mm

5mm2mm0.6mm

Page 51: The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA 94305 Scott Rixner February

Scott Rixner The Imagine Stream Processor 51

Imagine Pipeline

MEM

SRF

F1 D

R X1

R X1 X2 X3

….Microcontroller

X2 W

X4 W

Stream buffersClusters

F2

SEL

Page 52: The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA 94305 Scott Rixner February

Scott Rixner The Imagine Stream Processor 52

Communication paths

SRF/SB CtrlAG

Stream

Bu

ffers

SRFS

treamb

uffers

C

HI

Mem

Sys

BitCoderNI

Control Routing

Clusters

MEM

F1 D

R X1

R X1 X2 X3….

X2W

X4W

Stream buffers

F2

SEL

Page 53: The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA 94305 Scott Rixner February

Scott Rixner The Imagine Stream Processor 53

460 m

146.7 m

Adder Layout

Local RF

Integer Adder

Page 54: The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA 94305 Scott Rixner February

Scott Rixner The Imagine Stream Processor 54

Imagine Team

Professor William J. Dally

– Scott Rixner

– Ghazi Ben Amor

– Ujval Kapasi

– Brucek Khailany

– Peter Mattson

– Jinyung Namkoong

– John Owens

– Brian Towles

– Don Alpert (Intel)

– Chris Buehler (MIT)

– JP Grossman (MIT)

– Brad Johanson

– Dr. Abelardo Lopez-Lagunas

– Ben Mowery

– Manman Ren

Page 55: The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA 94305 Scott Rixner February

Scott Rixner The Imagine Stream Processor 55

Imagine Summary

Imagine operates on streams of records– simplifies programming

– exposes locality and concurrency

Compound stream operations– perform a subroutine on each stream element

– reduces global register bandwidth

Bandwidth hierarchy– use bandwidth where its inexpensive

– distributed and hierarchical register organization

Conditional stream operations– sort elements into homogeneous streams

– avoid predication or speculation

Page 56: The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA 94305 Scott Rixner February

Scott Rixner The Imagine Stream Processor 56

Imagine gives high performance with low power and flexible programming

Matches capabilities of communication-limited technology to demands of signal and image processing applications

Performance– compound stream operations realize >10GOPS on key applications

– can be extended by partitioning an application across several Imagines (TFLOPS on a circuit board)

Power– three-level register hierarchy gives 2-10GOPS/W

Flexibility– programmed in “C”

– streaming model

– conditional stream operations enable applications like sort

Page 57: The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA 94305 Scott Rixner February

Scott Rixner The Imagine Stream Processor 57

MPEG-2 Encoder

BlockSAD(future)

B lockSAD(past)

Color

Pick BestMV

DCTQ RLE VLC

IQDCT Correlate

Save toMemory

I

P

A L L

P

Rate Control Feeback

Page 58: The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA 94305 Scott Rixner February

Scott Rixner The Imagine Stream Processor 58

Advantages of Implementing MPEG-2 Encode on Imagine

Block Matching well suited to Imagine– Clusters provide high arithmetic bandwidth

– SRF provides necessary data locality, bandwidth

– Good performance Programmable Solution

– Simplifies system design

– Allows tradeoff between complexity of block matcher vs. bitrate, picture quality

Real Time Power Efficient

Page 59: The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA 94305 Scott Rixner February

Scott Rixner The Imagine Stream Processor 59

MPEG-2 Challenges

VLC (Huffman Coding)– Difficult and inefficient to implement on SIMD clusters

– A separate module which provides LUT and shift register capabilities is used instead

Rate Control– Difficult because multiple macroblocks encoded in parallel

– Must perform on a coarser granularity (impact on picture quality?)