18
Vector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks Christos Kozyrakis David Patterson Stanford University U.C. Berkeley http://csl.stanford.edu/~christos

Vector Architectures Vs. Superscalar and VLIW for …csl.stanford.edu/~christos/publications/2002.micro35.viram.slides.pdf · Vector Architectures Vs. Superscalar and VLIW for Embedded

  • Upload
    vannhan

  • View
    237

  • Download
    0

Embed Size (px)

Citation preview

Vector Architectures Vs. Superscalar and VLIW

for Embedded Media Benchmarks

Christos Kozyrakis David Patterson

Stanford University U.C. Berkeley

http://csl.stanford.edu/~christos

2© C. Kozyrakis,11/ 2002

Motivation

• Ideal processor for embedded media processing– High performance for media tasks

– Low cost• Small code size, low power consumption, highly integrated

– Low power consumption (for portable applications)

– Low design complexity

– Easy to program with HLLs

– Scalable

• This work– How efficient is a simple vector processor for embedded media

processing?• No cache, no wide issue, no out-of-order execution

– How does it compare to superscalar and VLIW embedded designs?

3© C. Kozyrakis,11/ 2002

Outline

• Motivation

• Overview of VIRAM architecture

– Multimedia instruction set, processor organization, vectorizing compiler

• EEMBC benchmarks & alternative architectures

• Evaluation

– Instruction set analysis & code size comparison

– Performance comparison

– VIRAM scalability study

• Conclusions

4© C. Kozyrakis,11/ 2002

VIRAM Instruction Set

• Vector load-store instruction set for media processing

– Coprocessor extension to MIPS architecture

• Architecture state

– 32 general-purpose vector registers

– 16 flag registers

– Scalar registers for control, addresses, strides, etc

• Vector instructions

– Arithmetic: integer, floating-point, logical

– Load-store: unit-stride, strided, indexed

– Misc: vector & flag processing (pop count, insert/extract)

– 90 unique instructions

5© C. Kozyrakis,11/ 2002

VIRAM ISA Enhancements

• Multimedia processing

– Support for multiple data-types (64b/32b/16b)

• Element & operation width specified with control register

– Saturated and fixed-point arithmetic

• Flexible multiply-add model without accumulators

– Simple element permutations for reductions and FFTs

– Conditional execution using the flag registers

• General-purpose systems

– TLB-based virtual memory

• Separate TLB for vector loads & stores

– Hardware support for reduced context switch overhead

• Valid/dirty bits for vector registers

• Support for “lazy” save/restore of vector state

6© C. Kozyrakis,11/ 2002

VIRAM Processor Microarchitecture

7© C. Kozyrakis,11/ 2002

VIRAM Chip

DR

AM

DR

AM

DR

AM

DR

AM

DR

AM

DR

AM

DR

AM

DR

AM

LANE LANE LANE LANE

MIPS Core

IO-FPU

Vector DRAM Control

• 0.18µm CMOS (IBM)

• Features– 8KB vector register file

– 2 256-bit integer ALUs

– 1 256-bit FPU

– 13MB DRAM

• 325mm2 die area

• 125M transistors

• 200 MHz, 2 Watts

• Peak vector performance– Integer: 1.6/3.2/6.4 Gop/s

(64b/32b/16b)

– Fixed-point: 2.4/4.8/9.6 Gop/s (64b/32b/16b)

– FP: 1.6 Gflop/s (32b)

8© C. Kozyrakis,11/ 2002

Vectorizing Compiler

• Based on Cray PDGCS compiler

– Used with all vector and MPP Cray machines

• Extensive vectorization capabilities

– Including outer-loop vectorization

• Vectorization of narrow operations and reductions

Optimizer

C

Fortran95

C++

Frontends Code Generators

Cray’s

PDGCS

T3D, T3E

X1(SV2), VIRAM

C90, T90, SV1

9© C. Kozyrakis,11/ 2002

EEMBC Benchmarks

• The de-facto industrial standard for embedded CPUs

• Used consumer & telecommunication categories– Representative of workload for multimedia devices with

wireless/broadband capabilities

– C code, EEMBC reference input data

• Consumer category– Image processing tasks for digital camera devices

– Rgb2cmyk & rgb2yiq conversions, convolutional filter, jpeg encode & decode

• Telecommunication category– Encoding/decoding tasks for DSL/wireless

– Autocorrelation compression, convolutional encoder, DSL bit allocation, FFT, Viterbi decoding

10© C. Kozyrakis,11/ 2002

EEMBC Metrics

• Performance: repeats/second (throughput)

– Use geometric means to summarize scores

• Code size and static data size in bytes

– Data sizes the same for the processors we discuss

• Pitfall with caching behavior

– Fundamentally, no temporal locality in most benchmarks

– Repeating kernel on same (small) data creates locality

– Unfair advantage for cache based architectures

• VIRAM has no data cache for vector loads/stores

11© C. Kozyrakis,11/ 2002

Embedded Processors

1.730096/512/–––8TMS320C6203

VLIW+DSP

VelociTI

2.716632/16/–––5TM1300VLIW+SIMD

Trimedia

5.025032/32/–––2VR5000RISCMIPS

21.3100032/32/256√√4MPC7455RISCPowerPC

21.655032/32/256√√3K6-IIIE+CISCx86

2.02008/––/–––1VIRAMVectorVIRAM

Power

(W)

Clock

(MHz)

CacheL1I/L1D/L2

(KB)

OOO

Issue WidthChipArchitecture

12© C. Kozyrakis,11/ 2002

Degree of Vectorization

0%

20%

40%

60%

80%

100%

Rgb2cm

yk

Rgb2yiq

Filter

Cjpeg

Djpeg

Autoco

r

Convenc

Bital Fft

Viterb

i

% o

f D

ynam

ic O

per

atio

ns

Vector Operations Scalar Operations

• Typical degree of vectorization is 90%– Even for Viterbi decoding which is partially vectorizable

– ISA & compiler can capture the data parallelism in EEMBC

– Great potential for vector hardware

13© C. Kozyrakis,11/ 2002

0

20

40

60

80

100

120

Rgb2cm

yk Filter

CjpegDjpeg

Conven

cVite

rbi

Rgb2yiq

CjpegDjpeg

Autocor Bita

l FftVect

or L

engt

h (E

lem

ents

)Average Maximum Supported

Vector Length

16-bit ops 32-bit ops

• Short vectors– Unavoidable with Viterbi, artifact of code with Cjpeg/Djpeg

• Even shortest length operations have ~20 16-bit elements– More parallelism than 128-bit SSE/Altivec can capture

14© C. Kozyrakis,11/ 2002

Code Size

• VIRAM high code density due to

– No loop unrolling/software pipelining, compact expression of strided/indexed/permutation patterns, no small loops

• Vectors: a “code compression” technique for RISC

1.1 1.1

2.0

4.1

12.3

0.61.0

2.5

1.91.6

1.9

1.0

0

1

2

3

4

5

VIRAM

x86

PowerPC

MIP

S

VLIW (c

c)

VLIW (o

pt)

VIRAM

x86

PowerPC

MIP

S

VLIW (c

c)

VLIW (o

pt)

No

rmal

ized

Co

de

Siz

eConsumer Telecommunications

15© C. Kozyrakis,11/ 2002

Comparison: Performance

16 41 48 1217

0

2

4

6

8

10

Rgb2cm

yk

Rgb2yiq

Filter

Cjpeg

Djpeg

Autoco

r

Convenc

Bital Fft

Viterb

i

Geom

. Mea

n

Per

form

ance

VIRAM PowerPC VLIW (cc) VLIW (opt)

• 200MHz, cache-less, single-issue VIRAM is

– 100% faster than 1-GHz, 4-way OOO processor

– 40% faster than manually-optimized VLIW with SIMD/DSP

• VIRAM can sustain high computation throughput

– Up to 16 (32-bit) to 32 (16-bit) arithmetic ops/cycle

• VIRAM can hide latency of accesses to DRAM

16© C. Kozyrakis,11/ 2002

Comparison: Performance/MHz

43 112 131 3331

0

5

10

15

20

25

Rgb2cm

yk

Rgb2yiq

Filter

Cjpeg

Djpeg

Autoco

r

Convenc

Bital Fft

Viterb

i

Geom

. Mea

n

Per

form

ance

/MH

z

VIRAM PowerPC VLIW (cc) VLIW (opt)

• VIRAM Vs 4-way OOO & VLIW with compiler code

– 10x-50x with long vectors, 2x-5x with short vectors

• VIRAM Vs VLIW with manually optimized code

– VLIW within 50% of VIRAM

– VIRAM would benefit from the same optimizations

• Similar results if normalized by power or complexity

17© C. Kozyrakis,11/ 2002

VIRAM Scalability

• Same executable, frequency, and memory system

• Decreased efficiency for 8 lanes

– Short vector lengths, conflicts in the memory system

– Difficult to hide overhead with short execution times

• Overall: 2.5x with 4 lanes, 3.5x with 8 lanes

– Can you scale similarly a superscalar processor?

1.01.7

2.53.5

012345678

Rgbcmyk Rgbyiq Filter Cjpeg Djpeg Autocor Convenc Bital Fft Viterbi Geom.Mean

Per

form

ance

1 Lane 2 Lanes 4 Lanes 8 Lanes

18© C. Kozyrakis,11/ 2002

Conclusions

• Vectors architectures are great match for embedded multimedia processing

– Combined high performance, low power, low complexity

– Add a vector unit to your media-processor!

• VIRAM code density

– Similar to x86, 5-10 times better than optimized VLIW

• VIRAM performance• With compiler vectorization and no hand-tuning

– 2x performance of 4-way OOO superscalar

• Even if OOO runs at 5x the clock frequency

– 50% faster than manually-optimized 5 to 8-way VLIW

• Even if VLIW has hand-inserted SIMD and DSP support