Stream Architecture: Rethinking Media Processor Design

Stream Architecture:Rethinking Media Processor Design

Rice University

Computer Systems Laboratory

Scott Rixner

April 9, 2001

Scott Rixner Stream Architecture 2

Media Processing

Video/image compression & decompression– MPEG, JPEG, ...

Signal Processing– DSL modems, cellular base stations, ...

Image synthesis– Polygon rendering, image-based rendering, ...

Image understanding– Face recognition, depth extraction, ...

640x480 @ 30 fps Requirements

– 11 GOPS Imagine stream processor

– 12.1 GOPS, 4.6 GOPS/W

Stereo Depth Extraction

Left Camera Image Right Camera Image

Depth Map

Outline

Stream Processing VLSI Constraints Register Organization Imagine Conclusions

Media Processing Characteristics

Low-precision data– 24% 8-bit integer operations

– 29% 16-bit integer operations Abundant data-parallelism Little global data reuse

– Average of 1.5 references per global data word Numerous computations per global reference

– 50-500 operations per global data reference

Stream Processing

Kernel StreamInput Data

Output Data

Image 1 convolve convolve

Depth Map

Little data reuse (pixels never revisited) Highly data parallel (output pixels not dependent on other output pixels) Compute intensive (>60 operations per memory reference)

Locality and Concurrency

Depth Map

Operations within a kernel operate on local data

Streams expose data parallelism

Kernels can be partitioned across chips to exploit control parallelism

Sony PlayStation2

MIPSCore

GraphicsSynthesizer

RDRAM, I/O,DMAC, etc.

Display

Emotion Engine

Special vs. General Purpose

Special Purpose– Fixed function

– High performance

General Purpose– Programmable

– Insufficient performance

InstructionCache

Register Files Dwarf ALUs

N A rithm etic Units

32 ALUs

Size of RFto support32 ALUs

Size of1 ALU

Size of RFto support

1 ALU1 cm

4 ALUs 16 ALUs

Register File Area

Each cell requires:– 1 word line per port

– 1 bit line per port Each cell grows as p2

R registers in the file Area: p2R N3

Bit Lines

1 wiregrid

Register Bit Cell

Register File Access Delay

Signal must traverse:– Word line to access cell

– Bit line to transfer data Wire capacitance dominates Delay: pR1/2 N3/2

wordline

b it line

registersRp

registersR

Register File

Register File Power Dissipation

100% utilization requires

driving all pR1/2 bit lines Wire capacitance dominates

Power: p2R N3

Register File

registersRp

registersR

linesbit Rp

1 10 100 1000Number of Arithmetic Units

T=1T=40

Centralized Register Organization

– Area, Power N3, Delay N3/2

Partitioned Organizations

SIMD– Data-parallel axis

Distributed Register Files (DRF)– Instruction-level parallel axis

Hierarchical– Memory hierarchy axis

Stream– Optimizing for streams

N/C A rith .Units

C S IM D C lusters

N/C A rith .Units

C S IM D C lusters

N/C A rith Units N/C A rith Units

SIMD Register Organization

– Area, Power N3/C2, Delay (N/C)3/2

N/C A rith .Units

C S IM D C lusters

N/C A rith .Units

SIMD(8 Clusters)

Central

SIMD/DRF

Distributed Register Organization

– Area, Power N2, Delay N

Combining SIMD and DRF

N A rithm etic U n its N/CA rithm etic

U nits

C S IM D C lusters

N/CA rithm etic

U nits

C S IM D C lusters

N A rithm etic U n itsN/C A rithm etic

U nitsN/C A rithm etic

U nits

Scalar SIMD

Central

Hierarchical Register Organization

– Area, Power N3, Delay N3/2

T=40Central

Central

Hierarchical Organizations

N/CA rith . U n its

C S IM D C lusters

N/CA rith . U n its

C S IM D C lusters

N/C A rithm eticU nits

N A rithm etic U n its

Scalar SIMD

Central

Stream Register Organization

– Area, Power N2/C, Delay N/C

C S IM D C lusters

Stream

Hierarchical

Central

Stream Organizations

N A rithm etic U n its N/CA rith . U n its

N/CA rith . U n its

C S IM D C lusters

N/C A rith . U n itsN A rithm etic U n its N/C A rith . U n its

Scalar SIMD

Central

Comparison of Organizations

SIMDCentral

Stream/SIMD/DRF

Hier/SIMD/DRF

SIMD/DRF

Central

Hier/SIMD/DRF &Stream/SIMD/DRF

SIMD/DRF

48 ALUs (32-bit), 500 MHz Stream organization improves central organization by

Area: 195x, Delay: 20x, Power: 430x

Performance

CENTRAL SIMD SIMD/DRF HIER. STREAM

16% Performance Drop(8% with latency constraints)

CENTRAL S IMD S IMD/DRF HIER. S TREAM

Convolve DCT Transform Shader FIR FFT Mean

180x Improvement

Stream Architecture

Stream Processing– Matched to media processing

– Exposes locality and concurrency Stream Register Organization

– Efficiency of special-purpose hardware

– Optimized for streaming applications Data bandwidth

– Bandwidth hierarchy

– Memory access scheduling

– Conditional streams

C S IM D C lusters

The Imagine Stream Processor

Stream Register FileNetworkInterface

StreamController

Imagine Stream Processor

HostProcessor

SDRAMSDRAM SDRAMSDRAM

Streaming Memory SystemM

Arithmetic Clusters

From SRF

To SRF

+ + * * /

Cross Point

Local Register File

Scratch-padRegister File

CommunicationUnit

Bandwidth Hierarchy

41.2 32-bit operations per word of memory bandwidth

2GB/s 32GB/s

ALU Cluster

544GB/s

Stream Recirculation

ColorConvert

Run-LevelEncoding

VariableLengthCoding

Arithmetic ClustersStream Register FileMemory (or I/O)

InputImage

RGBPixels

LuminancePixels

TransformedLuminance

LuminanceReference

EncodedBitstream

RLE Stream

Bitstream

ReferenceChrominance

ReferenceLuminance

ChrominancePixels

TransformedChrominance

ChrominanceReference

Data Referenced: 835KB 4.8MB 154.4MB

Bandwidth Demands of FIR Filter

References (bytes) Stream

Memory £ 4.03 36.0 (8.9x) 49.9 (12.4x)

Global RF 4.03 664.1 (164.8x) 296.7 (73.6x)

Local RF 420.02 N/A N/A

DSP MMX

Bandwidth Utilization of FIR Filter

Stream

Memory (GB/s) £ 2.62

Global RF (GB/s) 2.62

Local RF (GB/s) 273.25

Performance (GOPS) 17.57 1.01 1.47

DSP MMX

N/A N/A

Performance

23.925.6

depth mpeg qrd dct convolve fft

16-bit kernels16-bitapplications

floating-pointapplication

floating-pointkernel

GOPS/W: 4.6 6.9 4.1 10.2 9.6 2.4 6.3

depth mpeg qrd dct convolve fft average

OtherMem SysPinsSRF ClustClock

Relative Performance and Power Efficiency

Imagine AD 21160 TI 'C6701 SA-1100

Dhrystone

1830 412

Jaguar II Imagine DSP-224 PULSAR 'C67 DSP

ProgammableSpecial-PurposeImagine

FFT Performance Power Efficiency

Imagine Floorplan

Tapeout ~Q2 ’01 21 million T’s

– 6M SRF SRAM– 6M UC SRAM– 6M Clusters– 3M Other

Target: 32 FO4– 300 MHz at SSSS – 500 MHz at TTSS

TI GS30KA:

– 0.15 m Ldrawn

457 Signal Pins

Micro-Controller

ALU Cluster 7

ALU Cluster 6

ALU Cluster 5

ALU Cluster 4

ALU Cluster 3

ALU Cluster 2

ALU Cluster 1

ALU Cluster 0

HostInt

NetworkInterface

MemBank

AddrGen

JTAG/BIST

StreamCtrl

Imagine Team

William J. Dally

Ujval Kapasi

Brucek Khailany

Peter Mattson

Jinyung Namkoong

John Owens

Ben Serebrin

Brian Towles

Scott Rixner

Don Alpert (Intel)

Ghazi Ben Amor

Chris Buehler (MIT)

JP Grossman (MIT)

Brad Johanson

Abelardo Lopez-Lagunas

Ben Mowery

Manman Ren

Conclusions

Media Processing– Little data reuse

– Highly data parallel

– Compute intensive

VLSI– Stream register organization

– Bandwidth hierarchy

Imagine– Stream architecture

– 10 GOPS sustained application performance

– 5 GOPS/W application power efficiency

C S IM D C lusters

Stream Architecture: Rethinking Media Processor Design

Documents

An FPGA-Based Stream Processor for Embedded Real-Time ...yann.lecun.com/exdb/publis/pdf/farabet-ecv-09.pdf · An FPGA-Based Stream Processor for Embedded Real-Time Vision with Convolutional

Jamie Grier - The Stream Processor as a Database- Building Online Applications directly on Streams

SPR1100 Stream Processor

Dynamic Load Distribution in the Borealis Stream Processor

Evaluating the Imagine Stream Processor

RAY TRACING ON A STREAM PROCESSOR

Apache Flink Big Data Stream Processing · PDF fileApache Flink Big Data Stream Processing Tilmann Rabl ... Apache Flink! The case for Flink as a stream processor • Ideal basis for

RAY TRACING ON A STREAM PROCESSOR - Computer · PDF file · 2004-03-05RAY TRACING ON A STREAM PROCESSOR ... In this dissertation we show how a ray tracer can be written as a stream

ServerlessArchitectural Patterns and Best Practices · PDF fileServerlessstream processing architecture Sensors Amazon Kinesis: Stream Lambda: Stream Processor S3: ... Lambda Amazon

How did I get here? Building confidence in a distributed stream processor

The Stream Processor as a Database Apache Flink

EE482S Lecture 1 Stream Processor Architecturecva.stanford.edu/classes/ee482c/slides/lect01_slides.pdf– EE482C – stream processor architecture • Course format – Readings –

SAP Event Stream Processor: Troubleshooting Guide · SAP Event Stream Processor: Troubleshooting Guide ... Project log files are generated regardless of ... SAP Event Stream Processor:

EE482S Lecture 1 Stream Processor Architecturecva.stanford.edu/classes/ee482s/slides/lect01_slides.pdf– DLP across stream elements – TLP across sub-streams and across kernels –

SAP Event Stream Processor: Getting Started Guide Event Stream Processor: Getting Started Guide ... 5.3 Editing in the CCL Editor ... SAP Event Stream Processor: Getting Started Guide

SAP Sybase Event Stream Processor- Design Patterns: CCL and … · 2019-11-12 · SAP Sybase Event Stream Processor- Design Patterns: CCL and SPLASH (1st edition) SAP Sybase Event

SAP Event Stream Processor: Cockpit Guide · PDF file5.4 Project Administration.....124 Adding a Project ... SAP Event Stream Processor: Cockpit Guide. Stream Processor

The Stream Processor as a Database - Francisco...§Streaming applications are often not bound by the stream processor itself. Cross system interactionis frequently biggest bottleneck

Rethinking the Weakness of Stream Ciphers and Its

Rethinking the Gulf Stream - Duke University