The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA 94305 Scott Rixner February

The Imagine Stream Processor

Concurrent VLSI Architecture Group

Stanford University Computer Systems Laboratory

Stanford, CA 94305

Scott Rixner

February 23, 2000

Scott Rixner The Imagine Stream Processor 2

Outline

The Stanford Concurrent VLSI Architecture Group Media processing Technology

– Distributed register files

– Stream organization The Imagine Stream Processor

– 20GFLOPS on a 0.7cm2 chip

– Conditional streams

– Memory access scheduling

– Programming tools


The Concurrent VLSI Architecture Group

Architecture and design technology for VLSI Routing chips

– Torus Routing Chip, Network Design Frame, Reliable Router

– Basis for Intel, Cray/SGI, Mercury, Avici network chips


Parallel computer systems

J-Machine (MDP) led to Cray T3D/T3E M-Machine (MAP)

– Fast messaging, scalable processing nodes, scalable memory architecture

MDP Chip J-Machine Cray T3D MAP Chip


Design technology

Off-chip I/O– Simultaneous bidirectional signaling, 1989

• now used by Intel and Hitachi

– High-speed signalling• 4Gb/s in 0.6m CMOS, Equalization, 1995

On-Chip Signalling– Low-voltage on-chip signalling

– Low-skew clock distribution Synchronization

– Mesochronous, Plesiochronous

– Self-Timed Design4Gb/s CMOS I/O

250ps/division


Forces Acting on Media Processor Architecture

Applications - streams of low-precision samples– video, graphics, audio, DSL modems, cellular base stations

Technology - becoming wire-limited– power/delay dominated by communication, not arithmetic

– global structures: register files, instruction issue don’t scale Microarchitecture - ILP has been mined out

– to the point of diminishing returns on squeezing performance from sequential code

– explicit parallelism (data parallelism and thread-level parallelism) required to continue scaling performance


640x480 @ 30 fps Requirements

– 11 GOPS

Imagine stream processor– 12.1 GOPS, 4.6 GOPS/W

Stereo Depth Extraction

Left Camera Image Right Camera Image

Depth Map


Applications

Little locality of reference– read each pixel once

– often non-unit stride

– but there is producer-consumer locality Very high arithmetic intensity

– 100s of arithmetic operations per memory reference

Dominated by low-precision (16-bit) integer operations


Stream Processing

SAD

Kernel StreamInput Data

Output Data

Image 1 convolve convolve


Depth Map

Little data reuse (pixels never revisited) Highly data parallel (output pixels not dependent on other output pixels) Compute intensive (60 operations per memory reference)


Locality and Concurrency

SAD



Depth Map

Operations within a kernel operate on local data

Streams expose data parallelism

Kernels can be partitioned across chips to exploit control parallelism


These Applications Demand HighPerformance Density

GOPs to TOPs of sustained performance Space, weight, and power are limited Historically achieved using special-purpose hardware Building a special processor for each application is

– expensive

– time consuming

– inflexible


Performance of Special-Purpose Processors

Fed by dedicated wires/memoriesLots (100s) of ALUs


Care and Feeding of ALUs

DataBandwidth

Instruction Bandwidth

Regs

Instr.Cache

IR

IP‘Feeding’ Structure Dwarfs ALU


Technology scaling makes communication the scarce resource

0.18m256Mb DRAM16 64b FP Proc

500MHz

0.07m4Gb DRAM

256 64b FP Proc2.5GHz

P

1999 2008

18mm30,000 tracks

1 clockrepeaters every 3mm

25mm120,000 tracks

16 clocksrepeaters every 0.5mm

P


What Does This Say About Architecture?

Tremendous opportunities– Media problems have lots of parallelism and locality

– VLSI technology enables 100s of ALUs/chip (1000s soon)• (in 0.18um 0.1mm2 per integer adder, 0.5mm2 per FP adder)

Challenging problems– Locality - global structures won’t work

– Explicit parallelism - ILP won’t keep 100 ALUs busy

– Memory - streaming applications don’t cache well Its time to try some new approaches


Example: Register File Organization

Register files serve two functions:– Short term storage for intermediate results

– Communication between multiple function units Global register files don’t scale well as N, number of

ALUs increases– Need more registers to hold more results (grows with N)

– Need more ports to connect all of the units (grows with N2)


Register Files Dwarf ALUs

N A rithm etic Units

1 cm

32 ALUs

Size of RFto support32 ALUs

Size of1 ALU

Size of RFto support

1 ALU1 cm

4 ALUs 16 ALUs


Register File Area

Each cell requires:– 1 word line per port

– 1 bit line per port Each cell grows as p2

R registers in the file Area: p2R N3

Bit Lines

Wor

d Li

nes

...

1 wiregrid

...

p

p w

h

Register Bit Cell


SIMD and Distributed Register Files

N A rithm etic U n its N/CA rithm etic

U nits

C S IM D C lusters

N/CA rithm etic

U nits

C S IM D C lusters

N A rithm etic U n itsN/C A rithm etic

U nitsN/C A rithm etic

U nits

Scalar SIMD

Central

DRF


Hierarchical Organizations

N/CA rith . U n its

C S IM D C lusters

N/CA rith . U n its

C S IM D C lusters

N/C A rithm eticU nits

N/C A rithm eticU nits

N A rithm etic U n its

N A rithm etic U n its

Scalar SIMD

Central

DRF


Stream Organizations

N A rithm etic U n its N/CA rith . U n its

N/CA rith . U n its

C S IM D C lusters

C S IM D C lusters

N/C A rith . U n itsN A rithm etic U n its N/C A rith . U n its

Scalar SIMD

Central

DRF


Organizations

0.1

1

10

100

1000

1 10 100 1000Number of Arithmetic Units

SIMDCentral

Stream/SIMD/DRF

Hier/SIMD/DRF

SIMD/DRF

480.1

1

10

100

1000

1 10 100 1000Number of Arithmetic Units

SIMD

Central

Hier/SIMD/DRF &Stream/SIMD/DRF

SIMD/DRF

48

48 ALUs (32-bit), 500 MHz Stream organization improves central organization by

Area: 195x, Delay: 20x, Power: 430x


Performance

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Sp

eed

up

CENTRAL SIMD SIMD/DRF HIER. STREAM

16% Performance Drop(8% with latency constraints)

0

50

100

150

200

250

Per

form

ance

/Are

a

CENTRAL S IMD S IMD/DRF HIER. S TREAM

Convolve DCT Transform Shader FIR FFT Mean

180x Improvement


Scheduling Distributed Register Files

Distributed register files means:– Not all functional units can access all data

– Each functional unit input/output no longer has a dedicated route from/to all register files

A D D 0 L/S A D D 1

can write toeither or

both busescan read

from eitherbus


Communication Scheduling

Schedule operation on instruction Assign operation to functional unit Schedule communication for

operation

load x

A D D 0 L/S A D D 1

0

... = x + ...

xInst

ruct

ion

Functiona l U n it

... = x + ...

?

x

??

? ?

...load xy = ... z = ...

A D D 0 L/S A D D 1

0

Inst

ruct

ion

Functiona l U n it

XX

When to schedule route from operation A to operation B?– Can’t schedule route when A

is scheduled since B is not assigned to a FU

– Can’t wait to schedule route when B is scheduled since other routes may block


Stubs: Communication Placeholders

FU(Op 1)

RF

FU(Op 2)

RF

Op 1

Op 2

Wri

te s

tub

Re

ad

stu

bDa

ta t

ran

sfe

r

Stubs– ensure that other

communications will not block write of Op1’s result

Copy-connected architecture– ensures that any functional

unit that performs Op2 can read Op1’s result


Scheduling With Stubs

load x

A D D 0 L/S A D D 1

0

Functiona l U n it

Inst

ruct

ion

load xy = ...

A D D 0 L/S A D D 1

0

Functiona l U n it

Inst

ruct

ion

= ... x + ...

1

x

load xy = ...

A D D 0 L/S A D D 1

0

Functiona l U n it

Inst

ruct

ion



Stream Register FileNetworkInterface

StreamController

Imagine Stream Processor

HostProcessor

Net

wor

k

AL

U C

lust

er 0

AL

U C

lust

er 1

AL

U C

lust

er 2

AL

U C

lust

er 3

AL

U C

lust

er 4

AL

U C

lust

er 5

AL

U C

lust

er 6

AL

U C

lust

er 7

SDRAMSDRAM SDRAMSDRAM

Streaming Memory SystemM

icro

con

trol

ler


Arithmetic Clusters

CU

Inte

rclu

ster

N

etw

ork+

From SRF

To SRF

+ + * * /

Cross Point

Local Register File


Bandwidth Hierarchy

VLIW clusters with shared control 41.2 32-bit operations per word of memory bandwidth

2GB/s 32GB/s

SDRAM

SDRAM

SDRAM

SDRAM

Str

eam

R

egis

ter

File

ALU Cluster

ALU Cluster

ALU Cluster

544GB/s


Imagine is a Stream Processor

Instructions are Load, Store, and Operate– operands are streams

– Send and Receive for multiple-imagine systems Operate performs a compound stream operation

– read elements from input streams

– perform a local computation

– append elements to output streams

– repeat until input stream is consumed

– (e.g., triangle transform)


Bandwidth Demands of Transformation

References (per ) Stream

Memory 5.5 117 (21.3x) 48 (8.7x)

Global RF 48 624 (13.0x) 261 (5.4x)

Local RF 372

Scalar Vector

N/A N/A


Bandwidth Usage

Memory BW Global RF BW Local RF BW

Depth Extractor 0.80 GB/s 18.45 GB/s 210.85 GB/s

MPEG Encoder 0.47 GB/s 2.46 GB/s 121.05 GB/s

Polygon Rendering 0.78 GB/s 4.06 GB/s 102.46 GB/s

QR Decomposition 0.46 GB/s 3.67 GB/s 234.57 GB/s


Data Parallelism is easier than ILP

Kernel 1 to 8 Cluster Speedup

FFT (1024) 6.4

DCT (8x8) 7.8

Blockwarp (8x8) 7.2

Transform () 8.0

Harmonic Mean 7.3


Conventional Approaches to Data-Dependent Conditional Execution

A

x>0

B

C

J

K

Data-DependentBranch

Y N

A

x>0

Y

B

C

Whoops

J

K

SpeculativeLoss D x W~100s

A

B

J

y=(x>0)

if y

if ~y

C

K

if y

if ~y

ExponentiallyDecreasing Duty Factor


Zero-Cost Conditionals

Most Approaches to Conditional Operations are Costly– Branching control flow - dead issue slots on mispredicted branches

– Predication (SIMD select, masked vectors) - large fraction of execution ‘opportunities’ go idle.

Conditional Streams– append an element to an output stream depending on a case variable.

Value Stream

Case Stream {0,1}

Output Stream


Modern DRAM

Synchronous Three-dimensional

– Bank

– Row

– Column

Shared address lines Shared data lines Not truly random access

B ank N

B ank 1

Sense A m plifie rs(R ow B uffer)

C o lum n D ecoder

Ro

w D

eco

de

r

D a ta

A ddress

M em oryA rray

(B ank 0)

ConcurrentAccess

PipelinedAccess


DRAM Throughput and Latency

P: Bank PrechargeA: Row ActivationC: Column Access

11

(0 ,0 ,0 )

(1 ,1 ,2 )

(1 ,0 ,1 )

(1 ,1 ,1 )

(1 ,0 ,0 )

(0 ,1 ,3 )

(0 ,0 ,1 )

(0 ,1 ,0 )P A C

292827262524232221191817161514131210987654321 20 39383736353433323130 4443424140

P A CP A C

P A CP A C

P A CP A C

P A C

4948474645 50 565554535251

11

(0 ,0 ,0 )

(1 ,1 ,2 )

(1 ,0 ,1 )

(1 ,1 ,1 )

(1 ,0 ,0 )

(0 ,1 ,3 )

(0 ,0 ,1 )

(0 ,1 ,0 )P A C

191817161514131210987654321

P A

CP

CP A

C

A C

C

C

C

Time (Cycles)

Ref

eren

ces

( Ba

nk,

Ro

w, C

olu

mn

)R

efer

ence

s( B

an

k, R

ow

, Co

lum

n)

Time (Cycles)

Without Access Scheduling (56 DRAM cycles)

With Access Scheduling (19 DRAM cycles)


Streaming Memory System

Stream load/store architecture

Memory access scheduling– Out-of-order access

– High stream throughput

Address Generators

Index Data Data

DRAM

Ba

nk

0

Ba

nk

Bu

ffe

r

DRAM

Ba

nk n

Ba

nk

Bu

ffe

r

Reorder Buffer


Streaming Memory Performance

Bank Buffer Effectiveness

0.00.20.40.60.81.01.21.41.61.82.0

1 2 4 8 16 32 64 Infinite

Buffer Size

Cyc

les/

Acc

ess


Performance

Application Trace Bandwidth

0

200

400

600

800

1000

1200

1400

1600

1800

2000

FFT Depth MPEG Tex WeightedMean

Me

mo

ry B

an

dw

idth

(M

B/s

)

80-90% Improvement

Application Bandwidth

0

200

400

600

800

1000

1200

1400

1600

1800

2000

FFT Depth MPEG Tex WeightedMean

Me

mo

ry B

an

dw

idth

(M

B/s

)

in order first ready col/open col/closed row/open row/closed

35% Improvement


Evaluation Tools

Simulators– Imagine RTL model (verilog)

– isim: cycle-accurate simulator (C++)

– idebug: stream-level simulator (C++) Compiler tools

– iscd: kernel scheduler (communication scheduling)

– istream: stream scheduler (SRF/memory management)

– schedviz: visualizer


Stream Programming

Stream-level– Host Processor

– C++

– istream

load (datin1, mem1); // Memory to SRF

load (datin2, mem2);

op (ADDSUB, datin1, datin2, datout1, datout2);

send (datout1, P2, chan0); // SRF to Network

send (datout2, P3, chan1);

Kernel-level– Clusters

– C-like Syntax

– iscd

loop(input_stream0) {

input_stream0 >> a;

input_stream1 >> b;

c = a + b;

d = a - b;

output_stream0 << c;

output_stream1 << d;

}


Stereo Depth Extractor

Clusters Mem_0 Mem_121300

21400

21500

21600

21700

21800

21900

22000

22100

22200

22300

22400

22500

22600

22700

22800

22900

23000

23100

23200

23300

23400

23500

23600

CONV 3x3

STOREUNPACK LOADCONV 7x7

CONV 3x3


CONV 3x3


Clust Mem0 Mem1501300

501400

501500

501600

501700

501800

501900

502000

502100

502200

502300

502400

502500

502600

502700

502800

502900

503000

503100

503200

503300

BlockSAD

BlockSADLoad Load

BlockSAD

StoreBlockSAD

BlockSAD

BlockSADLoad Load

BlockSAD

StoreBlockSAD

Load originalpacked row

Unpack (8bit 16 bit)

7x7 Convolve

3x3 Convolve

Store convolved row

Load ConvolvedRows

CalculateBlockSADs atdifferent disparities

Store bestdisparity values

Convolutions Disparity Search


ADD0 ADD1 ADD2 MUL0 MUL1 DIV0 INP0 INP1 INP2 INP3 OUT0 OUT1 SP_0 SP_0 COM0 MC_0 J UK0 VAL0

G E N _ C I S T A T E

C O N D _ I N _ D

G E N _ C C E N D

S P C R E A D _ W T S P C W R I T E

C O M M U C D A T A

C H K _ A N Y

S E L E C T

S H I F T A 1 6

C O M M U C P E R M

C O M M U C P E R M

C O M M U C P E R M

S E L E C T

S E L E C T C O M M U C P E R M

I M U L R N D 1 6 S E L E C T

I M U L R N D 1 6

I M U L R N D 1 6I M U L R N D 1 6 N S E L E C T

I M U L R N D 1 6I M U L R N D 1 6

I M U L R N D 1 6I M U L R N D 1 6 P A S S

I M U L R N D 1 6 I M U L R N D 1 6 P A S S

I M U L R N D 1 6 I M U L R N D 1 6 P A S S

I M U L R N D 1 6 I M U L R N D 1 6




I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6I A D D S 1 6

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 N S E L E C T

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6 S E L E C T P A S S

I M U L R N D 1 6 I M U L R N D 1 6 P A S SI A D D S 1 6 N S E L E C TP A S S

I M U L R N D 1 6 I M U L R N D 1 6 P A S SI A D D S 1 6 I A D D S 1 6 N S E L E C T

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6I A D D S 1 6S H U F F L E

I M U L R N D 1 6 I M U L R N D 1 6S H U F F L E P A S S

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6 P A S SI A D D S 1 6

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6 P A S SS H U F F L E

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 S H U F F L E

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 S H U F F L ES H U F F L E

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6I A D D S 1 6

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6I A D D S 1 6I A D D S 1 6

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6I A D D S 1 6I A D D S 1 6

I M U L R N D 1 6 I M U L R N D 1 6I A D D S 1 6 I A D D S 1 6 P A S S P A S S

P A S SS H U F F L E

S H U F F L ES H U F F L E S H U F F L E

I A D D S 1 6 I A D D S 1 6 I A D D S 1 6 P A S S

I A D D S 1 6I A D D S 1 6I A D D S 1 6 P A S S P A S SP A S S

I A D D S 1 6 I A D D S 1 6I A D D S 1 6

I A D D S 1 6I A D D S 1 6 I A D D S 1 6

S H U F F L E S H U F F L ES H U F F L E D A T A _ I N

I A D D S 1 6I A D D S 1 6 S H U F F L E P A S SD A T A _ I N

I A D D S 1 6 I A D D S 1 6I A D D S 1 6 P A S SP A S S D A T A _ I N

S E L E C TI A D D S 1 6I A D D S 1 6 D A T A _ I N

I A D D S 1 6 S E L E C TI A D D S 1 6I A D D S 1 6 N S E L E C T D A T A _ O U T

I A D D S 1 6 S E L E C TI A D D S 1 6 I A D D S 1 6 D A T A _ I N D A T A _ O U T

I A D D S 1 6 N S E L E C T D A T A _ I N D A T A _ O U T

I A D D S 1 6 N S E L E C TN S E L E C T D A T A _ O U T

L O O PI A D D S 1 6 I A D D S 1 6 D A T A _ O U T

D A T A _ O U T D A T A _ O U T

ADD0 ADD1 ADD2 MUL0 MUL1 DIV0 INP0 INP1 INP2 INP3 OUT0 OUT1 SP_0 SP_0 COM0 MC_0 J UK0 VAL0

IMULRND16 IMULRND16 PASSIADDS16 NSELECTPASS

IMULRND16 IMULRND16 PASSIADDS16 IADDS16 NSELECTSHIFTA16

IMULRND16 IMULRND16IADDS16 IADDS16

IMULRND16 IMULRND16IADDS16IADDS16SHUFFLE

IMULRND16 IMULRND16SHUFFLE PASS

IMULRND16 IMULRND16IADDS16 IADDS16 PASSIADDS16

IMULRND16 IMULRND16IADDS16 IADDS16

IMULRND16 IMULRND16IADDS16 IADDS16 PASSSHUFFLE

IMULRND16 IMULRND16IADDS16 SHUFFLE

IMULRND16 IMULRND16IADDS16 SHUFFLESHUFFLE

IMULRND16 IMULRND16IADDS16 IADDS16IADDS16 COMMUCPERM

IMULRND16 IMULRND16IADDS16IADDS16IADDS16 COMMUCPERM

IMULRND16 IMULRND16IADDS16IADDS16IADDS16 COMMUCPERM

IMULRND16 IMULRND16IADDS16 IADDS16 PASS PASS

PASSSHUFFLE SELECT

SHUFFLESHUFFLE SHUFFLE SELECT COMMUCPERM

IADDS16 IADDS16 IADDS16 PASSIMULRND16 SELECT

IADDS16IADDS16IADDS16 PASS PASSPASS IMULRND16

IADDS16 IADDS16IADDS16 IMULRND16IMULRND16 NSELECT

IADDS16IADDS16 IADDS16 IMULRND16IMULRND16

IMULRND16IMULRND16 PASS

SHUFFLE SHUFFLESHUFFLE DATA_INIMULRND16 IMULRND16 PASS

IADDS16IADDS16 SHUFFLE PASSDATA_INIMULRND16 IMULRND16 PASS GEN_CISTATE

IADDS16 IADDS16IADDS16 PASSPASS DATA_INIMULRND16 IMULRND16 COND_IN_D

SELECTIADDS16IADDS16 DATA_INIMULRND16 IMULRND16 GEN_CCEND

IADDS16 SELECTIADDS16IADDS16 NSELECTIMULRND16 IMULRND16 SPCREAD_WT SPCWRITEDATA_OUT

IADDS16 SELECTIADDS16 IADDS16 DATA_INIMULRND16 IMULRND16 COMMUCDATADATA_OUT

IADDS16 NSELECT DATA_INIMULRND16 IMULRND16IADDS16 CHK_ANYDATA_OUT

IADDS16 NSELECTNSELECTIMULRND16 IMULRND16IADDS16IADDS16 DATA_OUT

LOOPIADDS16 IADDS16 IMULRND16 IMULRND16IADDS16 NSELECTSELECT DATA_OUT

IMULRND16 IMULRND16IADDS16 IADDS16 SELECT PASSDATA_OUT DATA_OUT

7x7 Convolution Kernel

ALUs Comm/SPStreams

Pipeline Stage 0

Pipeline Stage 1

Pipeline Stage 2


Performance

12.110.2 11.0

23.925.6

7.0

0

5

10

15

20

25

30

GO

PS

depth mpeg qrd dct convolve fft

16-bit kernels

floating-pointkernel

16-bitapplications

floating-pointapplication


Power

GOPS/W: 4.6 6.9 4.1 10.2 9.6 2.4 6.3

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Wa

tts

depth mpeg qrd dct convolve fft average

OtherMem SysPinsSRF ClustClock

24%

63%

5%5%

1% 2%


Implementation

Goals– < 0.7 cm2 chip in 0.18 m CMOS

– Operates at 500 MHz

Status– Cycle accurate C++ simulator

– Behavioral verilog

– Detailed floorplan with all critical signals planned

– Schematics and layout for key cells

– timing analysis of key units and communication paths



Stream Register FileNetworkInterface

StreamController

Imagine Stream Processor

HostProcessor

Net

wor

k

AL

U C

lust

er 0

AL

U C

lust

er 1

AL

U C

lust

er 2

AL

U C

lust

er 3

AL

U C

lust

er 4

AL

U C

lust

er 5

AL

U C

lust

er 6

AL

U C

lust

er 7

SDRAMSDRAM SDRAMSDRAM

Streaming Memory SystemM

icro

con

trol

ler


Imagine Floorplan

SRF/SB CtrlAG

Clu

st/UC

Stream

Bu

ffers

SRF

SRF

SRF

SRF

SRF

SRF

SRF

SRF

Cluster 2

Cluster 3

Cluster 0

Cluster 1

Cluster 6

Cluster 7

Cluster 4

Cluster 5

MS

/HI/N

I Stream

/Re-O

rder B

uffers

CCCCCCCCC CCore

HI

MB

0M

B1

MB

2M

B3

BitCoderNI

Routing

5.9mm

1mm

1mm

5mm2mm0.6mm


Imagine Pipeline

MEM

SRF

F1 D

R X1

R X1 X2 X3

….Microcontroller

X2 W

X4 W

Stream buffersClusters

F2

SEL


Communication paths

SRF/SB CtrlAG

Stream

Bu

ffers

SRFS

treamb

uffers

C

HI

Mem

Sys

BitCoderNI

Control Routing

Clusters

MEM

F1 D

R X1

R X1 X2 X3….

X2W

X4W

Stream buffers

F2

SEL


460 m

146.7 m

Adder Layout

Local RF

Integer Adder


Imagine Team

Professor William J. Dally

– Scott Rixner

– Ghazi Ben Amor

– Ujval Kapasi

– Brucek Khailany

– Peter Mattson

– Jinyung Namkoong

– John Owens

– Brian Towles

– Don Alpert (Intel)

– Chris Buehler (MIT)

– JP Grossman (MIT)

– Brad Johanson

– Dr. Abelardo Lopez-Lagunas

– Ben Mowery

– Manman Ren


Imagine Summary

Imagine operates on streams of records– simplifies programming

– exposes locality and concurrency

Compound stream operations– perform a subroutine on each stream element

– reduces global register bandwidth

Bandwidth hierarchy– use bandwidth where its inexpensive

– distributed and hierarchical register organization

Conditional stream operations– sort elements into homogeneous streams

– avoid predication or speculation


Imagine gives high performance with low power and flexible programming

Matches capabilities of communication-limited technology to demands of signal and image processing applications

Performance– compound stream operations realize >10GOPS on key applications

– can be extended by partitioning an application across several Imagines (TFLOPS on a circuit board)

Power– three-level register hierarchy gives 2-10GOPS/W

Flexibility– programmed in “C”

– streaming model

– conditional stream operations enable applications like sort


MPEG-2 Encoder

BlockSAD(future)

B lockSAD(past)

Color

Pick BestMV

DCTQ RLE VLC

IQDCT Correlate

Save toMemory

I

P

A L L

P

Rate Control Feeback


Advantages of Implementing MPEG-2 Encode on Imagine

Block Matching well suited to Imagine– Clusters provide high arithmetic bandwidth

– SRF provides necessary data locality, bandwidth

– Good performance Programmable Solution

– Simplifies system design

– Allows tradeoff between complexity of block matcher vs. bitrate, picture quality

Real Time Power Efficient


MPEG-2 Challenges

VLC (Huffman Coding)– Difficult and inefficient to implement on SIMD clusters

– A separate module which provides LUT and shift register capabilities is used instead

Rate Control– Difficult because multiple macroblocks encoded in parallel

– Must perform on a coarser granularity (impact on picture quality?)

Documents

The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA 94305 Scott Rixner February