46
A qualitative analysis of the benefits of LUTs, Processors, embedded memory and interconnect in MPSoC platforms Kees Vissers Xilinx Research

mpsoc2004 vissers final - MPSoC 2018 · MPsoC, July 5 - 9, 2004 Kees Vissers 25 ... [bytestream] To Off-Chip Comp. Buffer 32 1 5 2 1 5 2 BRAMs 2 x 1024 x 4b ... Simulink and MATLAB

Embed Size (px)

Citation preview

A qualitative analysis of the benefits of LUTs, Processors, embedded memory and interconnect in MPSoCplatforms

Kees VissersXilinx Research

Kees VissersMPsoC, July 5 - 9, 2004 2

OUTLINE

• Historical Perspective• Conventional FPGAs• Applications and Programming• Future Directions

Kees VissersMPsoC, July 5 - 9, 2004 3

FPGA, the last 10 years

1/91 1/92 1/93 1/94 1/95 1/96 1/97 1/98 1/99 1/00 1/01 1/02 1/03 1/04

Year

Price: 300x

Speed: 20x

Capacity: 200x

1

10

100

1000

CapacitySpeedPrice Virtex-II Pro

10x

1x

Spartan-3

Virtex-II Pro

MemoryBandwidth

4000x- 6000xcustomer value!

Kees VissersMPsoC, July 5 - 9, 2004 4

The market

Num

ber o

f Des

igns

(tho

usan

ds)

25

50

75

100

1999 2000 2001 2002 2003 2004 2005

FPGA

ASIC

• Design Starts$

Mill

ions

010,

20,

40,

50,

60,

70,

2001 2002 2003 2004 2005 2006 2007

ASICASSP

→ 2007 ASIC/ASSP = 80 B$→ NRE = 30M$→ R&D = 20% revenue→ Revenue = 150M$→ ~500 successful ASSP/ASIC

30,

Kees VissersMPsoC, July 5 - 9, 2004 5

The ultimate solution

• Glueless interfaces• Can handle all external memory, e.g DDR RAM,

QDR RAM, SRAM• Can do all the processing• Is fully programmable/reconfigurable in all

aspects: I/O, function, memory architecture, signal level, etc

ONLY Platform FPGAs can do this.

Kees VissersMPsoC, July 5 - 9, 2004 6

Processors: Itanium 2 6M

• 130 W (107 W typical)• 3 Levels of Cache: 16k + 16k, 256K, 6M• 130nm• 1.5 GHz• 95 percent of the 7,877 pins are for power • 1,322 Specint base2000 for a single processor!• Next generation: 1.7GHz and 9MByte cache!Ref: IEEE Micro April 2004, vol. 24, Issue 2, pp. 10-18: ITANIUM 2 PROCESSOR 6M: HIGHER FREQUENCY AND

LARGER L3 CACHE, Rusu et. al, Intel.

Kees VissersMPsoC, July 5 - 9, 2004 7

Processors: DSP

• TI C6414: 2 levels of cache• Philips PNX1500 (Trimedia core): 2 levels of

cache• Often specific DMA, Bus architecture and

optimized main memory architecture: does not easily fit the C programming paradigm.

Conclusion: the memory model is a problem

Kees VissersMPsoC, July 5 - 9, 2004 8

ALU

A Von Neumann-style Computation Example

256 Tap FIR Filter Example

256 Tap FIR Filter yn = Σ ckxn-kK=0

255

RegisterFile

r1 =

x0

r0 =

r1*

#c0

r1 =

x1

r0 =

r1*

#c1

...

...

...

...

...

y0 =

r1

...

r0 =

#0

x0

x1

x2

x3

...

x253

x2

54

x255

y0

InstructionDecode

Bottle-necks!

Kees VissersMPsoC, July 5 - 9, 2004 9

• For highest speed, use as many compute elements as problem allows.

• Capable of producing an output sample in one “instruction cycle”.

• Call this “Spatial computing”. (see DeHon)

FIR Implementation in RC

....

c255

xn

ynReg Reg

c254 c1Reg

c0

....

Kees VissersMPsoC, July 5 - 9, 2004 10

Full Parallel

Semi-Parallel

Serial

• Tailor a solution to optimize throughput, precision and cost

Reg Reg

Customizing Architecture in RC

Kees VissersMPsoC, July 5 - 9, 2004 11

Architectural EfficiencySpatial vs. Temporal Computing

FPGA (c=w=1) “Processor” (c=1024, w=64)

Figures courtesy of André DeHon, California Institute of Technology

Kees VissersMPsoC, July 5 - 9, 2004 12

OUTLINE

• Historical Perspective• Conventional FPGAs• Applications and Programming• Technology Impact• Financial Impact• Future Directions

Kees VissersMPsoC, July 5 - 9, 2004 13

4

16 words x 1 bit m

emory

[A,B,C,D]

F

• A 4-input lookup table(LUT) can implement any function of 4 inputs.

• For example, a 1-bit adder needs 2 LUTs:

A⊕B⊕Ci

A.B.Ci

AB

Ci

Co

S

Building an FPGA: Logic First

Kees VissersMPsoC, July 5 - 9, 2004 14

Out

In4

FF

CE RST

M16 words x 1 bit m

emory

M

M

ClkCERst

Add FF to make a Logic Cell

Kees VissersMPsoC, July 5 - 9, 2004 15

4

FF

CE RST

M16 words x 1 bit m

emory

Carry M

M

M

DinWECin

Cout

• Make LUT RAM a user resource.

• Fast carry ripple to neighbor.

Arithmetic, Distributed RAM

Kees VissersMPsoC, July 5 - 9, 2004 16

4

4

4

4

4

4

4

4

40

• Group logic cells to reduce overhead.

• Add H, V routing channels with switchboxes.

• Add input, output MUXing between logic and routing.

Add Interconnect

Kees VissersMPsoC, July 5 - 9, 2004 17

4

4

4

4

4

4

4

4

40

4

4

4

4

4

4

4

4

40

4

4

4

4

4

4

4

4

40

4

4

4

4

4

4

4

4

40

Build an Array

Kees VissersMPsoC, July 5 - 9, 2004 18

Putting the ‘R’ in RC • Fine-grained FPGAs are the platform of choice for

Reconfigurable Computing.

State

Configuration

User Logic Configuration RAM

Kees VissersMPsoC, July 5 - 9, 2004 19

Add Bells & Whistles HardProcessor

I/O

BRAM

Gigabit Serial

Multiplier

ProgrammableTermination

Z

VCCIO

Z

Z

ImpedanceControl Clock

Mgmt

18 Bit

18 Bit36 Bit

Kees VissersMPsoC, July 5 - 9, 2004 20

Problem definition

• Perform processing on a stream of data• Sampling rates in the order of 100KHz to 100MHz• per sample 1000 – 100,000 operationsPerform 1010-1011 operations/sec, like add, multiply

etc.

Note: performance is a required part of the solution!

Kees VissersMPsoC, July 5 - 9, 2004 21

So when do I need what?

• Processor + Cache• LUT• Embedded Memory• Bus• Reconfigurable interconect• Internal Memory• External Memory• I/O

Kees VissersMPsoC, July 5 - 9, 2004 22

OUTLINE

• Historical perspective• Conventional FPGAs• Applications and Programming• Technology and Financial Impact• Future Directions

Kees VissersMPsoC, July 5 - 9, 2004 23

JPEG2000 Encoder: Functional Block Diagram

2-D DWT

BitModele

r

Inputdata

Compresseddata“MQ”

Entropy

Coder

Header/Stream Creatio

n

Tier-2CodingTier-1 Coding

Quantizer

DC-Shift

&Comp.Trans.

⇑MQ Coder:

Create codewordsusing arithmetic

coding

Kees VissersMPsoC, July 5 - 9, 2004 24

87.8%

6.6%4.0%1.7%

Interface2-D DWTTier-1 CodingTier-2 Coding

JPEG2000: Run-Time Profiling

* Run on XRL’s JPEG2000 ANSI C code

Kees VissersMPsoC, July 5 - 9, 2004 25

Tier 1 Coder

JPEG2000 Encoder on Multimedia Board

ZBT Memor

y

Multi-Memory / Multi-Port ZBT Interface

Frame Capture

Multi-Pass DWT

Buffer 1 Buffer 2 Buffer 3

RGB

DMA 32 bit

MicroBlaze/ OPB

Bus

Processor

Memory INSTRU

CT

32 32

Data OPB Bus

Interrupt request (IR)IR IR

32Arbite

r/ Bridge

Tier 1 LogicTier 1 CoderTier-1

CoderYCrCb

ZBT Memor

y

ZBT Memor

y

Processor

Memory DATA32

3232

Kees VissersMPsoC, July 5 - 9, 2004 26

Tier-1 Coder: Block Diagram[Original Design]

BRAMs4096 x

Nb

[sign-magnitud

e]

FIFO2048 x

8b

From Off-Chip DWT Buffer

“MQ”Arithmetic Coder

Bit Modeling

FIFO1024 x

32b

[bytestream]

To Off-Chip Comp. Buffer

32

1

5

2

1

5

2

BRAMs2 x

1024 x 4b

BRAM1024 x

16b

Tier-1 Coding Control

Dist. ROM

47 x 29b

Probability/Next State Table

Significance/ Process Flags

Magnitude Bit Planes

D

CX

Flags

Dist. RAM

19 x 7b

Index/MPS Table

MemoryFPGA Fabric

BRAM1024 x

4b

Sign-Mag

Convert&

Sticky Bits

N-1

NN N

Sign Bit Plane

Kees VissersMPsoC, July 5 - 9, 2004 27

Technology ApplicationPerformance

Input Data Rate(Msamples/sec)

CompressionRatio

SystemClock Rate

(MHz)FPGA1 640 x 480

@ 30 fps18.4 10:1 54

DSPProcessor2 352 x 240

@ 8 fps

1.35 10:1 600

ASIC3 720 x 480@ 30 fps

20.7 N/A 27

Sources: 1Schumacher 2Aware Inc. 3Yamauchi

2-D DWT Arithmetic Encoder

Bit Modeler

JPEG2000 Example

Kees VissersMPsoC, July 5 - 9, 2004 28

Power Dissipation (in Watts)

Technology Device ExternalMemory Total

Relative PowerEfficiency(Relative

MOPS/mW)FPGA 3.5 3.5 7.0 1DSP

Processor 1.7 2.4 4.1 0.13

ASIC 2.0 0.2 2.2 3.6

• FPGA has 8x efficiency of processor– 14 x throughput– 1/10 x system clock

• More on-board memory would help. Look for evolving memory hierarchies.

Power Efficiency Results

Kees VissersMPsoC, July 5 - 9, 2004 29

Job Farm w/ Eight MicroBlazes

MicroBlaze

0

UARTOPBTimer

BlockRam

HWSemaphore

MicroBlaze

1

BlockRam

MicroBlaze

2

BlockRam

MicroBlaze

3

BlockRam

MicroBlaze

4

BlockRam

MicroBlaze

5

BlockRam

MicroBlaze

6

BlockRam

MicroBlaze

7

BlockRam

ZBT Memory

Job Farm InformationJob 1 Job 2 Job 3 Job 4Job 5 Job 6 • • • Job N

ImageData

Kees VissersMPsoC, July 5 - 9, 2004 30

Multiple MicroBlaze DWT Results

1

1.5

2

2.5

3

3.5

4

1 2 3 4 5 6 7 8

Processors

Spee

dup

Kees VissersMPsoC, July 5 - 9, 2004 31

Job Farm

• Make sure the ‘bus’ or the internal network can handle the required bandwidth

• Make sure that the latency is under control: drive towards multi-threading/multi context

• The granularity of the communication matters

Kees VissersMPsoC, July 5 - 9, 2004 32

OUTLINE

• Historical perspective• Conventional FPGAs• Applications and Programming• Future Directions

Kees VissersMPsoC, July 5 - 9, 2004 33

System Generator for DSP• Library-based, visual data flow• Polymorphic operators• Arbitrary precision fixed-point• Bit and cycle true modeling

• Seamlessly integrated with Simulink and MATLAB

– Test bench and data analysis

• Automatic code generation– Synthesizable VHDL– IP cores– HDL test bench– Project and constraint files

Kees VissersMPsoC, July 5 - 9, 2004 34

System Generator for DSP

System Generator for DSP v3.1 Allows Systems Designers to target external simulation engines- Hardware in the loop- HDL co-simulation

System Generator for DSP v3.1 Allows Systems Designers to target external simulation engines- Hardware in the loop- HDL co-simulation

HDL co-simulation using ModelSimAllows the user automatically invoke ModelSim and simulate his Verilog/VHDLdirectly from Simulink.

HDL co-simulation using ModelSimAllows the user automatically invoke ModelSim and simulate his Verilog/VHDLdirectly from Simulink.

Hardware in the loopco-simulationAllows the user to simulate part of his system on actual hardware. This means acceleration & verification

Hardware in the loopco-simulationAllows the user to simulate part of his system on actual hardware. This means acceleration & verification

Kees VissersMPsoC, July 5 - 9, 2004 35

Documented Reference Designs

• 16-QAM receiver, including LMS based equalizer and carrier recovery loop

• A/D and delta-sigma D/A conversion• Concatenated FEC codec for DVB• Custom FIR filter reference library• Digital down converter for GSM• 2D discrete wavelet transform (DWT)

filter• 2D filtering using a 5x5 operator• Color space conversion• CORDIC reference design• Polyphase MAC-based FIR• Streaming FFT/IFFT• BER Tester using AWGN model

Kees VissersMPsoC, July 5 - 9, 2004 36

OUTLINE

• Historical Perspective• Conventional FPGAs• Applications and Programming• Technology and Financial Impact• Future Directions

Kees VissersMPsoC, July 5 - 9, 2004 37

The mix revisited

• Processor + Cache• LUT• Embedded Memory• Bus• Reconfigurable interconect• Internal Memory• External Memory• I/O

Kees VissersMPsoC, July 5 - 9, 2004 38

Virtex-4 LXLogic Platform

Virtex-4 SXDSP Platform

Virtex-4 FXConnectivity

Platform

The Future : Domain Optimized Programmable Platforms

• Programmable – mass market of one

• Parallelism – performance and power

• Scalable – ride Moore’s Law

• Distributed memory – data transfer bottleneck

• Regular – manufacturable

• Domain optimized – cost

• Hyper-programming – more degrees of freedom, – spatial computing

Processing DomainCompute and IO

performance

Logic DomainLogic density

DSP DomainDSP performance

Kees VissersMPsoC, July 5 - 9, 2004 39

The Characteristics

500MHz DCM DigitalClock Management

500MHz FlexibleLogic Architecture

500MHz multi-portDistributed RAM

500MHzProgrammable DSP

Execution Units

450MHz PowerPC™Processor

WithAuxiliary Processor Unit

1Gbps DifferentialI/O

0.6-11.1GbpsSerial Transceivers

Kees VissersMPsoC, July 5 - 9, 2004 40

The tool we need

Decode H

eaders(V

OL, V

OP)

DM

UX

Motion Decoder

TextureVL

DecoderVLD

1

1

1

8FIFO

FIFO

ObjectFIFO

Motion Vectors

InverseScan Inverse

Quantisation / IDCT

8

Inverse AC DC

Prediction

Coeffs and DC

MotionCompensation

ObjectFIFOMotion

Vectors8×8

Block

Not Coded

8x8 Block

8×8 Block

Current FrameMemory

Controller

External SRAM

Current Frame

8×8 Block

8×8 Block

Display Controller

MPEG-4 Decoder

InputInputstream stream

Output Image Output Image Blocks Blocks

ObjectFIFO

Software Orchestrator

DCT Coeff

FIFO

Kees VissersMPsoC, July 5 - 9, 2004 41

The author gratefully acknowledges the contributions of material, insight and feedback from:

David Parlour, Steve Trimberger, Robert Turney, Paul Schumacher, Rick Ballantyne, Mark Paluszkiewicz-Xilinx Research Labs

André DeHon - California Institute of Technology

Kristof Denolf, Adrian Chirila-Rus, Bart Vanhoof, Jan Bormans - IMEC

Acknowledgements

Kees VissersMPsoC, July 5 - 9, 2004 42

University donations

•http://www.xilinx.com/univ/xup/ubroch/qform.htm

•http://www.xilinx.com/univ/ML310/ml310_mainpage.html

Kees VissersMPsoC, July 5 - 9, 2004 43

What is it?

Kees VissersMPsoC, July 5 - 9, 2004 44

Box

Kees VissersMPsoC, July 5 - 9, 2004 45

Board

Kees VissersMPsoC, July 5 - 9, 2004 46

Block Diagram