101
Swedish INTELECT Summer School on Multiprocessor Systems on Chip Distributed Memory and Datastream-based Reconfigurable Computing Reiner Hartenstein Kaiserslautern University of Technology Örebro, Aug. 25-27, 2003 11.00 – 13.00 hrs

Distributed Memory and Datastream-based Reconfigurable Computing

  • Upload
    mikko

  • View
    44

  • Download
    0

Embed Size (px)

DESCRIPTION

Örebro, Aug. 25-27, 2003 11.00 – 13.00 hrs. Distributed Memory and Datastream-based Reconfigurable Computing. Reiner Hartenstein Kaiserslautern University of Technology. “Mainstream Silicon Application. is switching every 10 Years”. Makimoto’s Wave. “The Programmable System-on-a-Chip. - PowerPoint PPT Presentation

Citation preview

Page 1: Distributed Memory and Datastream-based Reconfigurable Computing

Swedish INTELECTSummer School

on MultiprocessorSystems on Chip

Distributed Memoryand Datastream-based

Reconfigurable Computing

Reiner Hartenstein

KaiserslauternUniversity of Technology

Örebro, Aug. 25-27, 200311.00 – 13.00 hrs

Page 2: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de2

KaiserslauternUniversity ofTechnology

Semiconductor Revolutions

“Mainstream Silicon Applicationis switching every 10 Years”

TTL µproc.,memory

custom

standard

1957

1967

1977

1987

1997

2007

Makimoto’s Wave

ASICs,accel’s

LSI,MSI

“The Programmable System-on-a-Chipis the next wave“

reconfigurable

Published

in 1989

vN machineparadigm

anti machine paradigmanti machineparadigm

Page 3: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de3

KaiserslauternUniversity ofTechnology

How’s next Wave ?

2007FPGAs

custom

standard

1957

1967

1977

1987

1997

Tredennick’sParadigm Shifts

procedural programming

algorithm: variable

resources: fixed

hardwired

algorithm: fixed

resources: fixed

2007

?

structural programming

algorithm: variable

resources: variable

Coarse grain

RAs

no further wave !

Hartenstein’s Curve

?4th wave ?

vN machineparadigm

anti machine paradigm

anti machineparadigm

Page 4: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de4

KaiserslauternUniversity ofTechnology

data streams ...

Mainstream Markets

mainframesPC

?

19571967

19771987

1997

2007

technology issue andbusiness model

Trittbrettfahrer

morphware

TTL

µproc.memory

reconfigurab

lestandard

custom

LSI,MSI

ASICs,accel’s

here?

Page 5: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de5

KaiserslauternUniversity ofTechnology

The Impact of Makimoto’s Paradigm

Shifts

TTL µproc.,memory

custom

standard

ASICs,accel’s

LSI,MSI

reconfigurable

1957

1967

1977

1987

1997

2007

Proceduralpersonalization via RAM-based

Machine Paradigm

Personalization(CAD) beforefabrication

structuralpersonalization:

RAM-basedbefore run time

Dr. Makimoto: FPL 2000 keynote

Software Industry’sSecret of Success

Repeat Success Story bynew Machine Paradigm !

Page 6: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de6

KaiserslauternUniversity ofTechnology

Reconfigurable Computing: a second programming

domain

Migration of programming to the structural domain

The opportunity to introduce the structural domain to programmers ...

The structural domain has become RAM-based

... to bridge the gap by clever abstraction mechanisms using a simple new machine paradigm

Page 7: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de7

KaiserslauternUniversity ofTechnology

Ubiquitous embedded systems

Embedded System Engineering (ESE) requires:

• Hardware (HW) / (E)Software (ESW) co-design

• Configware (CW) / ESW co-design

• HW / CW / ESW co-design

ESE becomes the main focus in system design:

ESW becomes main vehicle to product differentiation

Page 8: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de8

KaiserslauternUniversity ofTechnology

Coarse grain vs. Fine grain

coarse grain (PACT AG, Munich)

multi grain (e. g. by slice bundling)

fine grain (FPGAs, rGAs)

Reconfigurability:

Page 9: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de9

KaiserslauternUniversity ofTechnology Makimoto’s 3rd Wave

• Fine Grain Subsystems (FPGAs):–

1st half of 3rd wave

universal (but less efficient)

• Coarse Grain Subsystems:–

2nd half of 3rd wave

domain-specific

much more flexible than 2nd half of 2rd wave

Page 10: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de10

KaiserslauternUniversity ofTechnology

Principle of a Typical FPGA

FF

FF

FF

FF

FF FFFF FF

Connection-Point

Tap

CLBCLB

CLBCLB

CLBCLBFF of hidden RAM

Page 11: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de11

KaiserslauternUniversity ofTechnology Routing Overhead in FPGAs

FF

FF

FF

FF

FF FF

>1000 transistorsat each cross bar

FF part of thehidden RAM

most FPGAvendors’gate count:

1 flipflop ofconfigurationRAM = 4 gates

Routing Congestion [DeHon]:often 50% or less of CLBs used

FF FF

Ý 40 transistorsat eachswitchingpoint

>

Ý 15 transistorsat each tap>

Page 12: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de12

KaiserslauternUniversity ofTechnology Reconfigurability Overhead

S S

S Sresources needed for reconfigurability

partly for configuration code storage

L

L L

LL

L

L LL

area used by application

“hidden RAM”not shown

Page 13: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de13

KaiserslauternUniversity ofTechnology Reconfigurability Overhead

• Fine Grain morphware platforms:–

about 1 of 100 transistors serve the application

the rest serves for reconfigurability

• Coarse Grain platforms:–

If well layouted by structured VLSI design

area efficiency almost like hardwired designs

Page 14: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de14

KaiserslauternUniversity ofTechnology

Sources: Proc ISSCC, ICSPAT, DAC, DSPWorld

Why Coarse Grain instead of FPGA ?

physicallogical

supersystolic

FPGAlogical

1980 1990 2000 2010

FPGAphysical

100 000 000 000

10 000 000 000

1000 000 000

100 000 000

10 000 000

1000 000

100 000

10 000

1000

Tra

nsi

sto

rs /

chip

~ 10

~ 10 000

drastically smaller configuration memorya lot of more benefits

much faster loading

FPGArouted

memory

microprocessor

reduced reconfigurability overhead by up to ~ 1000

Page 15: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de15

KaiserslauternUniversity ofTechnology

Throughput vs. Efficiency

1000

100

10

1

0.1

0.01

0.0012 1 0.5 0.25 0.13 0.1 0,07

MOPS / mW

µ feature size

FPGAs (reconfigurable logic)hardwired

instruction set processors

standard microprocessor

DSP

S S

S S

resources needed for

reconfigurability

L

L L

LL

L

L LL

area used by application

1 Bit CLB

T. Claasen et al.: ISSCC 1999

Wiring by abutment:32 Bit example

*) R. Hartenstein: ISIS 1997

rDPAs (reconfigurable computing)*

Page 16: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de16

KaiserslauternUniversity ofTechnology

Throughput vs. Flexibilityy

1000

100

10

1

0.1

0.01

0.0012 1 0.5 0.25 0.13 0.1 0,07

MOPS / mW

µ feature size

FPGAs (reconfigurable logic)hardwired

instruction set processors

standard microprocessor

DSP

T. Claasen et al.: ISSCC 1999

Wiring by abutment:32 Bit example

*) R. Hartenstein: ISIS 1997

rDPAs (reconfigurable computing)*

flexibility

throughput

hard-wired

vonNeumann

FPGAs

coarse grain goes far beyond bridging the gap

coarsegrain

Page 17: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de17

KaiserslauternUniversity ofTechnology >> outline <<

•embedded System Design Crisis•the CS crisis•datastream-based Computing•the Anti Machine Paradigm•application-specific distributed

memory•anti machine architectural resources•final remarkshttp://www.uni-kl.de

Page 18: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de18

KaiserslauternUniversity ofTechnology Embedded System Design

Crisis

desi

gn c

ost

year

product life cycle

Page 19: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de19

KaiserslauternUniversity ofTechnology

What are the Challenges ? (5)[ST microelectronics, MorphICs, Dataquest, eASIC]

1

2

0 10 12 18months

factor

*) Department of Trade and Industry, London

30y

Embe

dded

sof

twar

e [D

TI*

law

]

Comm

unicat

ion

band

wid

th [H

anse

n’s la

w]

Integra

tion densit

y (1.4/year)

[Moore

’s law]

µprocessor integration density (1.2/year)

Battery capacity (1.03/year)

Memory bandwidth [Patterson‘s law] (1.07/year) 10y

4yMask and NRE cost (1.25/year) 3y

5y

2y

design complexity

(1.4/year)

designer productivity (1.15/year)

designer productivity (1.15/year)

newcompilationtechniques

needed !supportedby a newmachine

paradigm

Page 20: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de20

KaiserslauternUniversity ofTechnology

The microelectronics spare part problem

IC physical life

expectance /years

2 1 0.5 0.25 0.13 0.1 0,07µ feature size

[Hartenstein 2002]

demand

/years of

availability

IC m

arke

t vo

lum

e

key problem in many application areas: medical, aerospace, automotive, other transportation, military, industrial equipment controllers, et al.

Page 21: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de21

KaiserslauternUniversity ofTechnology

The microelectronics spare part problem

•Original fab line is no more existing

•ICs do not survive storage time

•Demand: several decades of availability

IC physical life

expectance /years

2 1 0.5 0.25 0.13 0.1 0,07µ feature size

[Hartenstein 2002]

•e. g. car price: ~25% electronics

demand

/years of

availability

IC m

arke

t vo

lum

e

Page 22: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de22

KaiserslauternUniversity ofTechnology

Mask & NRE cost[ST microelectronics]

Page 23: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de23

KaiserslauternUniversity ofTechnology

Shannon‘s Law

•In a number of application areas throughput requirements are growing faster than Moore's law

•Fundamental flaws in software processor solutions

•32 soft ARM cores fit onto contemporary FPGA

•Data-stream-based distributed processing is the way to go

Page 24: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de24

KaiserslauternUniversity ofTechnology

Foundries: Adoption Rate By Process[Nick Tredennick]

Page 25: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de25

KaiserslauternUniversity ofTechnology

SoC System level Design:Embedded SW (ESW)

new design automation from high level descriptions

ESE becomes the main focus in system design:

HW-(E)SW codesign onto highly programmable platforms (SoC)

ESW becomes main vehicle to product differentiation

formal verification for (E)SW

HW-(E)SW-co-verificationH.]

SW synthesis included (SoC)

CW-

CW and

CW-

and CW

(ECW)

ECW

Page 26: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de26

KaiserslauternUniversity ofTechnology

ITRS SoC design cost model[ITRS 2001]

RTL methodology only

w. future improvements

tall

th

in e

ng

inee

r

sma

ll b

lock

reu

se

larg

e b

lock

reu

se

IC im

plem

enta

tion

tool

s

Inte

llig

ent

test

ben

ch

ES

lev

el m

eth

od

olo

gy

http://public.itrs.net/Files/2001ITRS/Design.pdf

most

ly s

yste

m le

vel i

ssues

Page 27: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de27

KaiserslauternUniversity ofTechnology >> CS crisis <<

•embedded System Design Crisis•the CS crisis•datastream-based Computing•the Anti Machine Paradigm•application-specific distributed

memory•anti machine architectural resources•final remarkshttp://www.uni-kl.de

Page 28: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de28

KaiserslauternUniversity ofTechnology

„EDA industry shifts into CS mentality“

[Wojciech Maly]

•patches instead of engineering

•innovation stalled many years ago

•netlist-based: do not care about efficiency, ...

•... do not care about transistor density

•85% users hate their tools

Page 29: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de29

KaiserslauternUniversity ofTechnology

Where are we heading ?

1

2

0 10 12 18 months

factor

*) Department of Trade and Industry, London

Embe

dded

sof

twar

e [D

TI*

law

](1

.4/year) [M

oore’s

law]

90% by 2010

10 times more programmers will write embedded applications than computer software by 2010

CS is not prepared:heading toward disaster

CS is not prepared:heading toward disaster

Page 30: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de30

KaiserslauternUniversity ofTechnology

Crusty Computing Sciences

[David Padua, John Hennessy]

shrinking supercomputing conferences

more and more efforts yield only marginal improvements

dataflow machines dead

98.5% vN-only

this monopoly is the problem

areas fade away

Page 31: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de31

KaiserslauternUniversity ofTechnology

Dead Supercomputer Society

•ACRI •Alliant •American Supercomputer

•Ametek •Applied Dynamics •Astronautics •BBN •CDC•Convex•Cray Computer •Cray Research •Culler-Harris •Culler Scientific •Cydrome •Dana/Ardent/ Stellar/Stardent

•DAPP •Denelcor •Elexsi •ETA Systems •Evans and Sutherland•Computer•Floating Point Systems •Galaxy YH-1 •Goodyear Aerospace MPP •Gould NPL •Guiltech •ICL •Intel Scientific Computers •International Parallel Machines

•Kendall Square Research •Key Computer Laboratories

[Gordon Bell, keynote at ISCA 2000]

•MasPar•Meiko •Multiflow •Myrias •Numerix •Prisma •Tera •Thinking Machines •Saxpy •Scientific Computer•Systems (SCS) •Soviet Supercomputers •Supertek •Supercomputer Systems •Suprenum •Vitesse Electronics

Page 32: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de32

KaiserslauternUniversity ofTechnology

CS: young ? dynamic?

.. but the von Neumann Paradigm is still the dominant doctrine ...

Microelectronics is ignored (except falling cost of computational effort)

... still pushing he basic models from the times of mainframe dinosaurs

after >10 technology generations ...

• 1th 4004• 2nd 8008• 3rd 8086• 4th 80286• 5th 80386• 6th 80486• 7th P5 (Pentium)• 8th P6 (Pentium Pro / Pentium II)• 9th Pentium III• 10th ....• 11th

• .......

... the vN Microprocessor is a methusela, the steam engine of the silicon age.

computing sciences

are ultra conservative …

… to avoid saying: senileA Re-

orientation is

over-due

A Re-

orientation is

over-due

Page 33: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de33

KaiserslauternUniversity ofTechnology MPU designs more

complex

greatly complicates the verification process

chip-level multiprocessing + simultaneous multithreading

many bugs relate to concurrency issues

new kinds of concurrency are becoming important

Page 34: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de34

KaiserslauternUniversity ofTechnology MPU performance stalled

Moore’s law will stall soon for MPUs

relative computation time needed doubles every 2 years

had been compensated by Moore’s law

Bill Gates’ law:

Page 35: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de35

KaiserslauternUniversity ofTechnology

blinders:

„we are o.k. !“ (no new direction)

CS: Lacking Sense of Direction ?

for ignoring the impact of RC

Page 36: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de36

KaiserslauternUniversity ofTechnology

Stealthy CS Crisis

progress in CS stalled by qualification problems in industry and academia

communication barriers between disciplines

severe software quality problems

often hardware people needed to solve CS problems

Page 37: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de37

KaiserslauternUniversity ofTechnology

What‘s the problem ?

.... by signals rippling through a network of transistors.

The typical programmer has problems to understand function evaluation without machine mechanisms....

Traditional CS: programming is (control-)procedural, instruction-stream-based – sources: software

acceleratorsacceleratorsµprocessorµprocessor

It‘s the gap between procedural and structural mind set

Crossing the Hardware / Software Chasm [Mike

Butts]

Page 38: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de38

KaiserslauternUniversity ofTechnology

What‘s the problem ? (2)

acceleratorsacceleratorsµprocessorµprocessor

The brain hurts on paradigm shift ?

no, it can‘t ...

Brain usage:procedural-only

structuralhemispheremissing

Crossing the Hardware / Software Chasm [Mike

Butts]

Page 39: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de39

KaiserslauternUniversity ofTechnology Changing Models of

Computing

host

re-

downloading

conf.accelerator(s)

RAM RAM

SoftwareConfigware

(structural)

Morphware

configware/software co-design

hardware/configware/software co-design

“von Neumann”

downloading

RAM

downloading

data path instructionsequencer

I / O

(procedural)Software

host

hardwired

downloading

accelerator(s)

CAD

RAM

Hardware

Software

hardware/software co-design

software design

Page 40: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de40

KaiserslauternUniversity ofTechnology “Programming” Domains

Morphware Configware Space Compile Time

procedural (e.g.“von Neumann”)

Software Time Run Time

Systolic Array CAD Time and Space Fabrication Time

Hardware

PlatformPersonalization

( “Programs” ) byProgramming

DomainCommunication

Paths Setup Time

Fabrication TimeCAD Space

Embedded Morphware

Configware / Soft-ware Co-Compilation

Compile Timeand Run TimeTime and Space

Page 41: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de41

KaiserslauternUniversity ofTechnology

Terminology: Digital System Platforms clearly distinguished

platformsource

running on it

machine paradigm

hardware (not running on it)

nonemorphwar

e

fine grain

rGA (FPGA)configware

coarse grain

rDPU, rDPAreconfigurable data stream processor

flowware & configware anti

machinedata stream processor (hardwired) flowware

instruction stream processor softwarevon Neumann machine

Page 42: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de42

KaiserslauternUniversity ofTechnology

There are more Levels of Parallelism

Loop Level (data-stream-based, pipe nets, etc.)

Instruction Level (VLIW etc.)

Logic Level (FPGAs)

RT Level (special architectures etc.)

Process level

ignored by typical CS people& ignored by CS curricula

Page 43: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de43

KaiserslauternUniversity ofTechnology

Complexity: System Level Design Challenge

language infrastructures for complex models (SystemC etc.)

must be leveraged by industry consensus on use-methodology and abstraction levels”

[ITRS 2001]

from HW + (processor-dependent embedded) C code level

“abstraction levels must be raised above present-day RT-level

Page 44: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de44

KaiserslauternUniversity ofTechnology >> datastream-based computing

<<

•embedded System Design Crisis•the CS crisis•datastream-based computing•the Anti Machine Paradigm•application-specific distributed

memory•anti machine architectural resources•final remarkshttp://www.uni-kl.de

Page 45: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de45

KaiserslauternUniversity ofTechnology

computingin space

Computing in space and time

datastreams

y10( )

y20( )

y30( )

---

y1

y2

y3

---

x1

x2

x3

-

- -

computingin time

a12

a11 a21

a32

a31

a23 a33

a22

a13

placement

systolicarrays etc.

and other transformationsmigration by re-timing

this dichotomy iscompletely ignoredby our CS curricula

Page 46: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de46

KaiserslauternUniversity ofTechnology

2

General Stream-based Computing Systemheterogenous Array of rDPUs (reconf. data path units)

Scheduler

Mapper

expression treeDPU architectures

y

+*

x

a

1

simultaneousplacement& routing

3

+

++

+

***sh

*sh

sh sh

xf

xf

-

- datastreams

4

The same mapper for both:Reconfigurable,or hardwired

Kress DPSS [1995]

simulated

annealing

free form

pipe network

time

space

Page 47: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de47

KaiserslauternUniversity ofTechnology

flowware defines ....

time

port #

time

DPA

xxx

xxx

xxx

|

||

x x

x

x

x

x

x x

x

- -

-

input data streams

xx

x

x

x

x

xx

x

--

-

-

-

-

-

-

-

-

-

-

xxx

xxx

xxx

|

|

|

|

|

|

|

|

|

|

|

|

|

|output data streams

time

port #time

port #

... which data item at which time at which port

1980: data streams

(Kung, Leiserson)

1995: super systolic

rDPA (Kress)

1996+: SCCC (LANL), SCORE, ASPRC, Bee (UCB), ...

(tutorials and courses available on all this)

flowware history:

Page 48: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de48

KaiserslauternUniversity ofTechnology

control-procedural vs. data-procedural

The structural domain is primarily data-stream-based:

..... mostly not yet modelled that way: most flowware is hidden by its indirect

instruction-stream-based implementation

Flowware provides a (data-)procedural abstraction from the (data-stream-based) structural domain

Flowware converts „procedural vs. structural“ into „control-procedural vs. data-procedural“ ...

... a Troyan horse to introduce the structural domain to the procedural mind set of programmers

Flowware

Page 49: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de49

KaiserslauternUniversity ofTechnology

asM

Configware / Flowware Compilation

r. DataPath

Array

rDPA intermediate

high level source program

wrapper

configwareconfigware

mapper

flowwareflowware

scheduler

M M M M

M M M M

MM

MM

MM

MM

data streams

data sequencer

address generato

r

students should know

that also P & R is a

compilation technique

Page 50: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de50

KaiserslauternUniversity ofTechnology

>> the anti machine paradigm <<

•embedded System Design Crisis•the CS crisis•datastream-based Computing•the Anti Machine Paradigm•application-specific distributed

memory•anti machine architectural resources•final remarkshttp://www.uni-kl.de

Page 51: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de51

KaiserslauternUniversity ofTechnology

Why a dichotomy of machine paradigms?

data stream machine:

• bad message: caches do not help

• good message: no vN bottleneck

• caches not needed

stolen from Bob Colwell

CPU

caches, ...

vN bottleneckvN: unbalanced

The anti machine has novon Neumann bottleneck

Page 52: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de52

KaiserslauternUniversity ofTechnology

Terminology: DPU versus CPU ...

• DPU: data path unit• DPA: DPU array• GA: gate array• rDPU: reconfigurable DPU• rDPA: reconfigurable DPA• rGA: reconfigurable GA

• DPU is no CPU: there is nothing central - like in a DPA

DPUDPU

DPUinstructionsequencer

CPU

DPAr

r

Page 53: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de53

KaiserslauternUniversity ofTechnology

Machine paradigms

von Neumanninstruction

stream machineM

I/O

instructionsequencer

CPU

instructionstream

I/OMM MM M

(r)DPU

DPU

Software

I/OMM MM M

(r)DPA

memorydistributed memory architecture*

data stream

data-stream machine

M

DPU or rDPU

data addressgenerator(data sequencer)

memory

I/O

asM**

Flowware

(Configware)

(reconf.)

*) the new discipline came just in time:see Herz et al.: Proc. IEEE ICECS 2002

instruction stream+

CPU

- data stream

-DPU

+

memory

Page 54: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de54

KaiserslauternUniversity ofTechnology >> distributed memory <<

•embedded System Design Crisis•the CS crisis•datastream-based Computing•the Anti Machine Paradigm•application-specific distributed

memory•anti machine architectural resources•final remarkshttp://www.uni-kl.de

Page 55: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de55

KaiserslauternUniversity ofTechnology

Processor Memory Performance Gap

1

10

100

1000Performance

1980 1990 2000

µProc60%/yr..

DRAM7%/yr..

Processor-MemoryPerformance Gap:(grows 50% / year)

DRAM

CPU

Page 56: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de56

KaiserslauternUniversity ofTechnology

Just in time

The new distributed memory discipline:

just in time to implement the anti machine.

M. Herz et al. (invited): Memory Organization for Data-Stream-based

Reconfigurable Computing; Proc. ICECS 2002

key issues:power and performance optimization

Page 57: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de57

KaiserslauternUniversity ofTechnology

address generators for Flowware execution

asM

r. DataPath

Array

rDPA

M M M M

M M M M

MM

MM

MM

MM

data streams

address generato

r

Page 58: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de58

KaiserslauternUniversity ofTechnology

Distributed Memory

SA: scrambling and descrambling the data ?

Just in time: a new research area:

Application-specific distributed memory:

e. g. book by F. Catthoor et al. ...

Data address generators - 20 years research:

Page 59: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de59

KaiserslauternUniversity ofTechnology

Significance of Address Generators

• Address generators have the potential to reduce computation time significantly.

• In a grid-based design rule check a speed-up of more than 2000 has been achieved, compared to a VAX-11/750

• Dedicated address generators contributed a factor of 10 - avoiding memory cycles for address computation overhead

Page 60: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de60

KaiserslauternUniversity ofTechnology

Smart Address Generators

1983 The Structured Memory Access (SMA) Machine

1984 The GAG (generic address generator)

1989 Application-specific Address Generator (ASAG)

1990 The slider method: GAG of the MoM-2 machine

1991 The AGU

1994 The GAG of the MoM-3 machine

1997 The Texas Instruments TMS320C54x DSP

1997 Intersil HSP45240 Address Sequencer

1999 Adopt (IMEC)

Page 61: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de61

KaiserslauternUniversity ofTechnology

Adopt (from IMEC)

•cMMU synthesis environment:

•application-specific ACUs for array index reference

•ACU as a counter modified by multi-level logic filter

•ACU with ASUs from a Cathedral-3 library

•distributed ACU alleviates interconnect overhead (delay, power, area)

•nested loop minimization by algebraic transformations

•AE splitting/clustering

•AE multiplexing to obtain interleaved ASs

•other features

•customized MMU (cMMU) • address expression (AE)

•Address Sequence (AS)•Address Calculation Unit (ACU) • Application-Specific Unit (ASU)

Page 62: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de62

KaiserslauternUniversity ofTechnology

Synthesizable distributed memory architecture...

as Memory(data memory)

memory bank

memory bank

memory bank

memory bank

memory bank

...

...

Scheduler

address generators for the anti machine

rDPA“instructions”

Compiler

Sequencers(data stream

generator)

Page 63: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de63

KaiserslauternUniversity ofTechnology >> architectural resources <<

•embedded System Design Crisis•the CS crisis•datastream-based Computing•the Anti Machine Paradigm•application-specific distributed

memory•anti machine architectural resources•final remarkshttp://www.uni-kl.de

Page 64: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de64

KaiserslauternUniversity ofTechnology

GAG generic address generator Scheme

BaseSlider

B0

LimitSlider

L0

0B

[

AddressStepper

DA

A

DA

|| ||

L

]limit

all 3 are copiesof the same BSU*

stepper circuitGAU

*) Basic Slider Unit

Page 65: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de65

KaiserslauternUniversity ofTechnology

GAG Slider Model

LimitStepper

BaseStepper

AddressStepper

B0AL0

A

LimitStepper

BaseStepper

AddressStepper

B0AL0

A

sliders

B0B

[

0 L

]0L0

B0B

[

0 AD

AD

L

]0L0

GAUGenericAddress

GeneratorUnit

floor ceiling

Page 66: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de66

KaiserslauternUniversity ofTechnology GAG: Address Stepper

GAG =

AddressGenerator

Generic

+ / –

EscapeClause

EndDetect

StepCounter

=o

L A DA

inittag

AAddress

endExec

maxStepCount0B

Limit Base stepVector[] | |

DA LB0

[ ]|| ||limit

GAG: Address Stepper

Page 67: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de67

KaiserslauternUniversity ofTechnology

Generic Sequence Examples

a) b)

c)

d) e) f) g)

LimitSlider

BaseSlider

GAG

AddressStepper

B0DAL0

A

Page 68: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de68

KaiserslauternUniversity ofTechnology

ceiling

C

address

GAG Slider Operation Demo Example

yx

LB

L0B0AF

floor

LB

Page 69: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de69

KaiserslauternUniversity ofTechnology

3-by-3 tileJPEG zigzagscan pattern example

Page 70: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de70

KaiserslauternUniversity ofTechnology

implementation of a JPEG zigzag tile

constant sliders

constant sliders

Page 71: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de71

KaiserslauternUniversity ofTechnology

zigzag tile rotated 45o

Page 72: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de72

KaiserslauternUniversity ofTechnology

rotated zigzag tile scan pattern implementation

slidingsliders

slidingsliders

Page 73: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de73

KaiserslauternUniversity ofTechnology

3-by-3 tilerotated JPEG zigzag scan pattern example

higherlevelslider

higherlevel

slider

Page 74: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de74

KaiserslauternUniversity ofTechnology

GAG Complex Sequencer Implementation

LimitSlider

BaseSlider

GAU

AddressStepper

B0DAL0

A

LimitSlider

BaseSlider

GAU

AddressStepper

B0DAL0

A

LimitSlider

BaseSlider

GAU

AddressStepper

B0DAL0

A

GAUGAU

GAGGeneric Address Generator

SDS

GAG

VLIWstack

controller

Page 75: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de75

KaiserslauternUniversity ofTechnology

GAG Complex Sequencer Implementation (2)

LimitSlider

BaseSlider

GAU

AddressStepper

B0DAL0

A

LimitSlider

BaseSlider

GAU

AddressStepper

B0DAL0

A

LimitSlider

BaseSlider

GAU

AddressStepper

B0DAL0

A

LimitSlider

BaseSlider

GAU

AddressStepper

B0DAL0

A

LimitSlider

BaseSlider

GAU

AddressStepper

B0DAL0

A

LimitSlider

BaseSlider

GAU

AddressStepper

B0DAL0

A

Page 76: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de76

KaiserslauternUniversity ofTechnology

instruction stream-based Compilation Principles

scheduler

parser

source text

library

link/load instruction call placement

1-D memory space

execution order by location

Page 77: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de77

KaiserslauternUniversity ofTechnology

Antimachine: MoM architecture

x

y

handle positions

scan window

scan pattern (high level sequencing)

example

intra scan window accesses(low level sequencing)

Handle Position Generator

Scan Window Generator

handleposition

bank 0 1 • • • n

y-GAG x-GAG

memory accesses

Page 78: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de78

KaiserslauternUniversity ofTechnology

simple MoM* anti machine architecture

Scan Window

*) map-oriented machine

RAMrDPA

Smart memory interface

Page 79: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de79

KaiserslauternUniversity ofTechnology

MoM anti machine architecture

scan Windows

.

.

.

...

distributedmemory banksrDPA

Smart memory interface

Page 80: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de80

KaiserslauternUniversity ofTechnology

Linear Filter Application

b)

r

r r r

r

r/w r r

r

rr r

w / r r r

r

r r r

r

w/r r r

r

r r r Bank a

Bank a

Bank b

w r

r

r

scan step

Page 81: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de81

KaiserslauternUniversity ofTechnology

Scanline unrolling

r r

r/w r r

r

r r r

r/w r r

r/w r r

r r r

Page 82: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de82

KaiserslauternUniversity ofTechnology

90o Rotation of Scan Pattern

r r

rr

r

r

r

r

r

r

Bank a

Bank a

Bank b

Bank b

w wwr rr rr

r rr rrw ww

w w w

r

w

r

rr

r

r

r

r

w

r

r

w

Bank a

Bank a

Bank b

Bank b

scanwindowoverlaparea

r r/wr r/w r/w

r

r

r/w

r

rr

r

r

r

r/w

r

r

r/w

r

r

Page 83: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de83

KaiserslauternUniversity ofTechnology

Linear Filter Application

after inner scan line loop unrolling

final design

after scan line

unrolling

hardw. level access optim.

initial design

Parallelized Merged Buffer Linear Filter Applicationwith example image of x=22 by y=11 pixel

Page 84: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de84

KaiserslauternUniversity ofTechnology

Storage scheme manipulation by scan pattern transformations

ab

a'b'

memory bank no. 0memory bank no. 1memory bank no. 2memory bank no. 3

c)

Page 85: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de85

KaiserslauternUniversity ofTechnology CGFFT: Nested and Parallel Scan

Pattern

scanouter loop

patternHLScan is 3 steps [2, 0]

SP1 is 7 steps [0, 2]

SP23 is 7 steps [0, 1]

inner loopcompoundscanpatterns

3 in parallel

Page 86: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de86

KaiserslauternUniversity ofTechnology CGFFT: Parallel Scan Pattern Animation

Page 87: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de87

KaiserslauternUniversity ofTechnology

Scan window in real-time image processing(e. g. automotive)

Page 88: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de88

KaiserslauternUniversity ofTechnology

>>> final remarks

finalremarks

Page 89: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de89

KaiserslauternUniversity ofTechnology Antimatter Search ?

Antimatter Search

in EE & CS we do not need to search

Page 90: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de90

KaiserslauternUniversity ofTechnology

What is the trend ?

•vN is needed for embedded systems, OS, compilers, Sauerkraut software, non-performance-critical applications, others ….

•vN is obsolete for massive parallelism, except some special application areas

•Anti machine is the way to go for massive parallelism, also data-intensive applications

•Morphware is the way for high performance with short product life cycles, unstable standards

•Data-stream-based Computing is heading for mainstream

–1979 „data streams“ (Kung / Leiserson)

–1997 SCCC (LANL) Streams-C Configurabble Computing

–SCORE (UCB) Stream Computations Organized for Reconfigurable Execution

–ASPRC (UCB) Adapting Software Pipelining for Reconfigurable Computing

–2000 Bee (UCB), ...

–Most stream-based multimedia systems, etc.

–Many other areas ....

Page 91: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de91

KaiserslauternUniversity ofTechnology >> final remarks

<<

•embedded System Design Crisis•the CS crisis•datastream-based Computing•the Anti Machine Paradigm•application-specific distributed

memory•anti machine architectural resources•final remarkshttp://www.uni-kl.de

Page 92: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de92

KaiserslauternUniversity ofTechnology

The Situation in Computing Sciences

• Computing Sciences are in a severe crisis

• New fundamentals and R&D directions are inevitable

• All knowledge needed is readily available ...

• ... even from Computing Sciences

• But curricula are obsolete and have to be upgraded

• Silicon application and EDA provide useful concepts

• Reconfigurable Computing has the remedy

Page 93: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de93

KaiserslauternUniversity ofTechnology

roadmap

old CS lab course philosophy:given an application: implement it by a program

-/-new CS freshman lab course environment:Given an application:

a) implement it by writing a programb) implement it as a morphware prototypec) Partition it into P and Q

c.1) implement P by softwarec.2) implement Q by morphwarec.3) implement P / Q communication interface

Page 94: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de94

KaiserslauternUniversity ofTechnology

Algorithms and Data Structures

... have to go beyond pointers, queues, and stacks

Extend by includingalgorithmic issues in software /morphware/ hardware migration additional levels of parallelism: chaining, pipelining, systolic, super-systolic, wavefront arraysadditional data structures and storage organization: the new distributed memory discipline

Page 95: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de95

KaiserslauternUniversity ofTechnology

Computer Organization / Architecture

... have to go beyond von Neumann,

Extend by includingnested machines, address generators the anti machine paradigmExtended taxonomy of platforms: procedural, structural, hardwired, reconfigurable, zhybrid systems

Page 96: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de96

KaiserslauternUniversity ofTechnology

Languages and Compilers

... have to go beyond von Neumann,

Extend by includingConfigware / flowware compilers, Procedural / structural co-compilers (data-procedural) flowware languages

Page 97: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de97

KaiserslauternUniversity ofTechnology

Conclusion: all knowledge needed is available

•machine paradigm

•anti machine architectural resources

•sequencing methodology: hw & sw

•parallel memory IP core and module generator vendors

•anything else needed

•compilation techniques

•hw / sw partitioning methodology

•languages

Page 98: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de98

KaiserslauternUniversity ofTechnology

>>> thank you <<<<<

thank youfor yourpatience

Page 99: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de99

KaiserslauternUniversity ofTechnology

>>> END <<<

END

Page 100: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de100

KaiserslauternUniversity ofTechnology

JPEG zigzag scan pattern

x

y

EastScan is step by [1,0]end EastScan;

SouthScan isstep by [0,1]endSouthScan;

*> Declarations

NorthEastScan isloop 8 times until [*,1]step by [1,-1]endloopend NorthEastScan;

SouthWestScan isloop 8 times until [1,*]step by [-1,1]endloopend SouthWestScan;

HalfZigZag isEastScanloop 3 times SouthWestScanSouthScanNorthEastScanEastScanendloopend HalfZigZag;

goto PixMap[1,1]

HalfZigZag;SouthWestScanuturn (HalfZigZag)

HalfZigZag

data counterdata counter

data counterdata counter

2

1

3

4

HalfZigZag

Page 101: Distributed Memory and Datastream-based Reconfigurable Computing

© 2003, [email protected] http://hartenstein.de101

KaiserslauternUniversity ofTechnology

r r

r/w r r

r

r r r

r/w r r

r/w r r

r r r

after inner scan line loop unrolling

final design

after scan line

unrolling

hardw. level access optim.

initial design

rr

w/r r r

r

r r r Bank a

Bank a

Bank b

Storage scheme optimization: scanline unrolling

x

y

handle positions

scan window

scan pattern (high level sequencing)

example

intra scan window accesses(low level sequencing)

MoM anti machine architecture

Linear Filter Application

scan windowgenerator

Scan line unrolling

90o rotatedscan pattern

r r/wr r/w r/w

r

r

r/w

r

rr

r

r

r

r/w

r

r

r/w

r

r

scanpatternoverlap