View
51
Download
0
Category
Tags:
Preview:
DESCRIPTION
Stream Architecture: Rethinking Media Processor Design. Scott Rixner April 9, 2001. Rice University Computer Systems Laboratory. Video/image compression & decompression MPEG, JPEG, ... Signal Processing DSL modems, cellular base stations, ... Image synthesis - PowerPoint PPT Presentation
Citation preview
Stream Architecture:Rethinking Media Processor Design
Rice University
Computer Systems Laboratory
Scott Rixner
April 9, 2001
Scott Rixner Stream Architecture 2
Media Processing
Video/image compression & decompression– MPEG, JPEG, ...
Signal Processing– DSL modems, cellular base stations, ...
Image synthesis– Polygon rendering, image-based rendering, ...
Image understanding– Face recognition, depth extraction, ...
Scott Rixner Stream Architecture 3
640x480 @ 30 fps Requirements
– 11 GOPS Imagine stream processor
– 12.1 GOPS, 4.6 GOPS/W
Stereo Depth Extraction
Left Camera Image Right Camera Image
Depth Map
Scott Rixner Stream Architecture 4
Outline
Stream Processing VLSI Constraints Register Organization Imagine Conclusions
Scott Rixner Stream Architecture 5
Media Processing Characteristics
Low-precision data– 24% 8-bit integer operations
– 29% 16-bit integer operations Abundant data-parallelism Little global data reuse
– Average of 1.5 references per global data word Numerous computations per global reference
– 50-500 operations per global data reference
Scott Rixner Stream Architecture 6
Stream Processing
SAD
Kernel StreamInput Data
Output Data
Image 1 convolve convolve
Image 0 convolve convolve
Depth Map
Little data reuse (pixels never revisited) Highly data parallel (output pixels not dependent on other output pixels) Compute intensive (>60 operations per memory reference)
Scott Rixner Stream Architecture 7
Locality and Concurrency
SAD
Image 1 convolve convolve
Image 0 convolve convolve
Depth Map
Operations within a kernel operate on local data
Streams expose data parallelism
Kernels can be partitioned across chips to exploit control parallelism
Scott Rixner Stream Architecture 8
Sony PlayStation2
MIPSCore
FPU
VPU0
IPU
GraphicsSynthesizer
VPU1
RDRAM, I/O,DMAC, etc.
Display
Emotion Engine
Scott Rixner Stream Architecture 9
Special vs. General Purpose
Special Purpose– Fixed function
– High performance
General Purpose– Programmable
– Insufficient performance
InstructionCache
IR
IP
Reg
iste
rs
Scott Rixner Stream Architecture 10
Register Files Dwarf ALUs
N A rithm etic Units
1 cm
32 ALUs
Size of RFto support32 ALUs
Size of1 ALU
Size of RFto support
1 ALU1 cm
4 ALUs 16 ALUs
Scott Rixner Stream Architecture 11
Register File Area
Each cell requires:– 1 word line per port
– 1 bit line per port Each cell grows as p2
R registers in the file Area: p2R N3
Bit Lines
Wor
d Li
nes
...
1 wiregrid
...
p
p w
h
Register Bit Cell
Scott Rixner Stream Architecture 12
Register File Access Delay
Signal must traverse:– Word line to access cell
– Bit line to transfer data Wire capacitance dominates Delay: pR1/2 N3/2
wordline
b it line
registersRp
p
registersR
Register File
Scott Rixner Stream Architecture 13
Register File Power Dissipation
100% utilization requires
driving all pR1/2 bit lines Wire capacitance dominates
Power: p2R N3
Register File
registersRp
p
registersR
linesbit Rp
Scott Rixner Stream Architecture 14
0.1
1
10
100
1000
1 10 100 1000Number of Arithmetic Units
T=1T=40
Centralized Register Organization
– Area, Power N3, Delay N3/2
N A rithm etic Units
Scott Rixner Stream Architecture 15
Partitioned Organizations
SIMD– Data-parallel axis
Distributed Register Files (DRF)– Instruction-level parallel axis
Hierarchical– Memory hierarchy axis
Stream– Optimizing for streams
N/C A rith .Units
C S IM D C lusters
N/C A rith .Units
N A rithm etic Units
N A rithm etic Units
C S IM D C lusters
N/C A rith Units N/C A rith Units
Scott Rixner Stream Architecture 16
SIMD Register Organization
– Area, Power N3/C2, Delay (N/C)3/2
N/C A rith .Units
C S IM D C lusters
N/C A rith .Units
0.1
1
10
100
1000
1 10 100 1000Number of Arithmetic Units
SIMD(8 Clusters)
Central
Scott Rixner Stream Architecture 17
0.1
1
10
100
1000
1 10 100 1000Number of Arithmetic Units
Central
SIMD/DRF
DRF
Distributed Register Organization
– Area, Power N2, Delay N
N A rithm etic Units
Scott Rixner Stream Architecture 18
Combining SIMD and DRF
N A rithm etic U n its N/CA rithm etic
U nits
C S IM D C lusters
N/CA rithm etic
U nits
C S IM D C lusters
N A rithm etic U n itsN/C A rithm etic
U nitsN/C A rithm etic
U nits
Scalar SIMD
Central
DRF
Scott Rixner Stream Architecture 19
Hierarchical Register Organization
– Area, Power N3, Delay N3/2
N A rithm etic Units
0.1
1
10
100
1000
1 10 100 1000Number of Arithmetic Units
T=1
T=40Central
Central
Central
Hiera
rchic
al T
=40
Scott Rixner Stream Architecture 20
Hierarchical Organizations
N/CA rith . U n its
C S IM D C lusters
N/CA rith . U n its
C S IM D C lusters
N/C A rithm eticU nits
N/C A rithm eticU nits
N A rithm etic U n its
N A rithm etic U n its
Scalar SIMD
Central
DRF
Scott Rixner Stream Architecture 21
Stream Register Organization
– Area, Power N2/C, Delay N/C
C S IM D C lusters
N/C A rith Units N/C A rith Units
0.1
1
10
100
1000
1 10 100 1000Number of Arithmetic Units
Stream
Hierarchical
Central
Scott Rixner Stream Architecture 22
Stream Organizations
N A rithm etic U n its N/CA rith . U n its
N/CA rith . U n its
C S IM D C lusters
C S IM D C lusters
N/C A rith . U n itsN A rithm etic U n its N/C A rith . U n its
Scalar SIMD
Central
DRF
Scott Rixner Stream Architecture 23
Comparison of Organizations
0.1
1
10
100
1000
1 10 100 1000Number of Arithmetic Units
SIMDCentral
Stream/SIMD/DRF
Hier/SIMD/DRF
SIMD/DRF
480.1
1
10
100
1000
1 10 100 1000Number of Arithmetic Units
SIMD
Central
Hier/SIMD/DRF &Stream/SIMD/DRF
SIMD/DRF
48
48 ALUs (32-bit), 500 MHz Stream organization improves central organization by
Area: 195x, Delay: 20x, Power: 430x
Scott Rixner Stream Architecture 24
Performance
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Sp
eed
up
CENTRAL SIMD SIMD/DRF HIER. STREAM
16% Performance Drop(8% with latency constraints)
0
50
100
150
200
250
Per
form
ance
/Are
a
CENTRAL S IMD S IMD/DRF HIER. S TREAM
Convolve DCT Transform Shader FIR FFT Mean
180x Improvement
Scott Rixner Stream Architecture 25
Stream Architecture
Stream Processing– Matched to media processing
– Exposes locality and concurrency Stream Register Organization
– Efficiency of special-purpose hardware
– Optimized for streaming applications Data bandwidth
– Bandwidth hierarchy
– Memory access scheduling
– Conditional streams
C S IM D C lusters
N/C A rith Units N/C A rith Units
Scott Rixner Stream Architecture 26
The Imagine Stream Processor
Stream Register FileNetworkInterface
StreamController
Imagine Stream Processor
HostProcessor
Net
wor
k
AL
U C
lust
er 0
AL
U C
lust
er 1
AL
U C
lust
er 2
AL
U C
lust
er 3
AL
U C
lust
er 4
AL
U C
lust
er 5
AL
U C
lust
er 6
AL
U C
lust
er 7
SDRAMSDRAM SDRAMSDRAM
Streaming Memory SystemM
icro
con
trol
ler
Scott Rixner Stream Architecture 27
Arithmetic Clusters
CU
Inte
rclu
ster
N
etw
ork+
From SRF
To SRF
+ + * * /
Cross Point
Local Register File
Scratch-padRegister File
CommunicationUnit
Scott Rixner Stream Architecture 28
Bandwidth Hierarchy
41.2 32-bit operations per word of memory bandwidth
2GB/s 32GB/s
SDRAM
SDRAM
SDRAM
SDRAM
Str
eam
R
egis
ter
File
ALU Cluster
ALU Cluster
ALU Cluster
544GB/s
Scott Rixner Stream Architecture 29
Stream Recirculation
ColorConvert
DCT
DCT
IDCT
IDCT
Run-LevelEncoding
VariableLengthCoding
Arithmetic ClustersStream Register FileMemory (or I/O)
InputImage
RGBPixels
LuminancePixels
TransformedLuminance
LuminanceReference
EncodedBitstream
RLE Stream
Bitstream
ReferenceChrominance
Image
ReferenceLuminance
Image
ChrominancePixels
TransformedChrominance
ChrominanceReference
Data Referenced: 835KB 4.8MB 154.4MB
Scott Rixner Stream Architecture 30
Bandwidth Demands of FIR Filter
References (bytes) Stream
Memory £ 4.03 36.0 (8.9x) 49.9 (12.4x)
Global RF 4.03 664.1 (164.8x) 296.7 (73.6x)
Local RF 420.02 N/A N/A
DSP MMX
Scott Rixner Stream Architecture 31
Bandwidth Utilization of FIR Filter
Stream
Memory (GB/s) £ 2.62
Global RF (GB/s) 2.62
Local RF (GB/s) 273.25
Performance (GOPS) 17.57 1.01 1.47
DSP MMX
N/A N/A
1.42
24.88
2.73
16.20
Scott Rixner Stream Architecture 32
Performance
12.1
17.9
12.5
23.925.6
7.0
0
5
10
15
20
25
30
GO
PS
depth mpeg qrd dct convolve fft
16-bit kernels16-bitapplications
floating-pointapplication
floating-pointkernel
Scott Rixner Stream Architecture 33
Power
GOPS/W: 4.6 6.9 4.1 10.2 9.6 2.4 6.3
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Wa
tts
depth mpeg qrd dct convolve fft average
OtherMem SysPinsSRF ClustClock
24%
63%
5%5%
1% 2%
Scott Rixner Stream Architecture 34
Relative Performance and Power Efficiency
2.4
0.28
0.22
1.2
0.0
0.5
1.0
1.5
2.0
2.5
GO
PS/
W
Imagine AD 21160 TI 'C6701 SA-1100
Dhrystone
FFT
7640
7000
5120
1830 412
0
1000
2000
3000
4000
5000
6000
7000
8000
GO
PS
Jaguar II Imagine DSP-224 PULSAR 'C67 DSP
ProgammableSpecial-PurposeImagine
FFT Performance Power Efficiency
Scott Rixner Stream Architecture 35
Imagine Floorplan
Tapeout ~Q2 ’01 21 million T’s
– 6M SRF SRAM– 6M UC SRAM– 6M Clusters– 3M Other
Target: 32 FO4– 300 MHz at SSSS – 500 MHz at TTSS
TI GS30KA:
– 0.15 m Ldrawn
457 Signal Pins
Micro-Controller
ALU Cluster 7
ALU Cluster 6
ALU Cluster 5
ALU Cluster 4
ALU Cluster 3
ALU Cluster 2
ALU Cluster 1
ALU Cluster 0
HostInt
12
mm
12 mm
NetworkInterface
Str
ea
m R
eg
iste
r F
ile
MemBank
0
MemBank
1
MemBank
2
MemBank
3
AddrGen
JTAG/BIST
StreamCtrl
Scott Rixner Stream Architecture 36
Imagine Team
William J. Dally
Ujval Kapasi
Brucek Khailany
Peter Mattson
Jinyung Namkoong
John Owens
Ben Serebrin
Brian Towles
Scott Rixner
Don Alpert (Intel)
Ghazi Ben Amor
Chris Buehler (MIT)
JP Grossman (MIT)
Brad Johanson
Abelardo Lopez-Lagunas
Ben Mowery
Manman Ren
Scott Rixner Stream Architecture 37
Conclusions
Media Processing– Little data reuse
– Highly data parallel
– Compute intensive
VLSI– Stream register organization
– Bandwidth hierarchy
Imagine– Stream architecture
– 10 GOPS sustained application performance
– 5 GOPS/W application power efficiency
C S IM D C lusters
N/C A rith Units N/C A rith Units
Recommended