Distributed Memory and Datastream-based Reconfigurable Computing

Swedish INTELECTSummer School

on MultiprocessorSystems on Chip

Distributed Memoryand Datastream-based

Reconfigurable Computing

Reiner Hartenstein

KaiserslauternUniversity of Technology

Örebro, Aug. 25-27, 200311.00 – 13.00 hrs

© 2003, [email protected] http://hartenstein.de2

KaiserslauternUniversity ofTechnology

Semiconductor Revolutions

“Mainstream Silicon Applicationis switching every 10 Years”

TTL µproc.,memory

custom

standard

1957

1967

1977

1987

1997

2007

Makimoto’s Wave

ASICs,accel’s

LSI,MSI

“The Programmable System-on-a-Chipis the next wave“

reconfigurable

Published

in 1989

vN machineparadigm

anti machine paradigmanti machineparadigm



How’s next Wave ?

2007FPGAs

custom

standard

1957

1967

1977

1987

1997

Tredennick’sParadigm Shifts

procedural programming

algorithm: variable

resources: fixed

hardwired

algorithm: fixed

resources: fixed

2007

?

structural programming

algorithm: variable

resources: variable

Coarse grain

RAs

no further wave !

Hartenstein’s Curve

?4th wave ?

vN machineparadigm

anti machine paradigm

anti machineparadigm



data streams ...

Mainstream Markets

mainframesPC

?

19571967

19771987

1997

2007

technology issue andbusiness model

Trittbrettfahrer

morphware

TTL

µproc.memory

reconfigurab

lestandard

custom

LSI,MSI

ASICs,accel’s

here?



The Impact of Makimoto’s Paradigm

Shifts

TTL µproc.,memory

custom

standard

ASICs,accel’s

LSI,MSI

reconfigurable

1957

1967

1977

1987

1997

2007

Proceduralpersonalization via RAM-based

Machine Paradigm

Personalization(CAD) beforefabrication

structuralpersonalization:

RAM-basedbefore run time

Dr. Makimoto: FPL 2000 keynote

Software Industry’sSecret of Success

Repeat Success Story bynew Machine Paradigm !



Reconfigurable Computing: a second programming

domain

Migration of programming to the structural domain

The opportunity to introduce the structural domain to programmers ...

The structural domain has become RAM-based

... to bridge the gap by clever abstraction mechanisms using a simple new machine paradigm



Ubiquitous embedded systems

Embedded System Engineering (ESE) requires:

• Hardware (HW) / (E)Software (ESW) co-design

• Configware (CW) / ESW co-design

• HW / CW / ESW co-design

ESE becomes the main focus in system design:

ESW becomes main vehicle to product differentiation



Coarse grain vs. Fine grain

coarse grain (PACT AG, Munich)

multi grain (e. g. by slice bundling)

fine grain (FPGAs, rGAs)

Reconfigurability:


KaiserslauternUniversity ofTechnology Makimoto’s 3rd Wave

• Fine Grain Subsystems (FPGAs):–

1st half of 3rd wave

–

universal (but less efficient)

• Coarse Grain Subsystems:–

2nd half of 3rd wave

–

domain-specific

–

much more flexible than 2nd half of 2rd wave



Principle of a Typical FPGA

FF

FF

FF

FF

FF FFFF FF

Connection-Point

Tap

CLBCLB

CLBCLB

CLBCLBFF of hidden RAM


KaiserslauternUniversity ofTechnology Routing Overhead in FPGAs

FF

FF

FF

FF

FF FF

>1000 transistorsat each cross bar

FF part of thehidden RAM

most FPGAvendors’gate count:

1 flipflop ofconfigurationRAM = 4 gates

Routing Congestion [DeHon]:often 50% or less of CLBs used

FF FF

Ý 40 transistorsat eachswitchingpoint

>

Ý 15 transistorsat each tap>


KaiserslauternUniversity ofTechnology Reconfigurability Overhead

S S

S Sresources needed for reconfigurability

partly for configuration code storage

L

L L

LL

L

L LL

area used by application

“hidden RAM”not shown


KaiserslauternUniversity ofTechnology Reconfigurability Overhead

• Fine Grain morphware platforms:–

about 1 of 100 transistors serve the application

–

the rest serves for reconfigurability

• Coarse Grain platforms:–

If well layouted by structured VLSI design

–

area efficiency almost like hardwired designs



Sources: Proc ISSCC, ICSPAT, DAC, DSPWorld

Why Coarse Grain instead of FPGA ?

physicallogical

supersystolic

FPGAlogical

1980 1990 2000 2010

FPGAphysical

100 000 000 000

10 000 000 000

1000 000 000

100 000 000

10 000 000

1000 000

100 000

10 000

1000

Tra

nsi

sto

rs /

chip

~ 10

~ 10 000

drastically smaller configuration memorya lot of more benefits

much faster loading

FPGArouted

memory

microprocessor

reduced reconfigurability overhead by up to ~ 1000



Throughput vs. Efficiency

1000

100

10

1

0.1

0.01

0.0012 1 0.5 0.25 0.13 0.1 0,07

MOPS / mW

µ feature size

FPGAs (reconfigurable logic)hardwired

instruction set processors

standard microprocessor

DSP

S S

S S

resources needed for

reconfigurability

L

L L

LL

L

L LL

area used by application

1 Bit CLB

T. Claasen et al.: ISSCC 1999

Wiring by abutment:32 Bit example

*) R. Hartenstein: ISIS 1997

rDPAs (reconfigurable computing)*



Throughput vs. Flexibilityy

1000

100

10

1

0.1

0.01

0.0012 1 0.5 0.25 0.13 0.1 0,07

MOPS / mW

µ feature size

FPGAs (reconfigurable logic)hardwired

instruction set processors

standard microprocessor

DSP

T. Claasen et al.: ISSCC 1999

Wiring by abutment:32 Bit example

*) R. Hartenstein: ISIS 1997

rDPAs (reconfigurable computing)*

flexibility

throughput

hard-wired

vonNeumann

FPGAs

coarse grain goes far beyond bridging the gap

coarsegrain


KaiserslauternUniversity ofTechnology >> outline <<

•embedded System Design Crisis•the CS crisis•datastream-based Computing•the Anti Machine Paradigm•application-specific distributed

memory•anti machine architectural resources•final remarkshttp://www.uni-kl.de


KaiserslauternUniversity ofTechnology Embedded System Design

Crisis

desi

gn c

ost

year

product life cycle



What are the Challenges ? (5)[ST microelectronics, MorphICs, Dataquest, eASIC]

1

2

0 10 12 18months

factor

*) Department of Trade and Industry, London

30y

Embe

dded

sof

twar

e [D

TI*

law

]

Comm

unicat

ion

band

wid

th [H

anse

n’s la

w]

Integra

tion densit

y (1.4/year)

[Moore

’s law]

µprocessor integration density (1.2/year)

Battery capacity (1.03/year)

Memory bandwidth [Patterson‘s law] (1.07/year) 10y

4yMask and NRE cost (1.25/year) 3y

5y

2y

design complexity

(1.4/year)

designer productivity (1.15/year)

designer productivity (1.15/year)

newcompilationtechniques

needed !supportedby a newmachine

paradigm



The microelectronics spare part problem

IC physical life

expectance /years

2 1 0.5 0.25 0.13 0.1 0,07µ feature size

[Hartenstein 2002]

demand

/years of

availability

IC m

arke

t vo

lum

e

key problem in many application areas: medical, aerospace, automotive, other transportation, military, industrial equipment controllers, et al.



The microelectronics spare part problem

•Original fab line is no more existing

•ICs do not survive storage time

•Demand: several decades of availability

IC physical life

expectance /years

2 1 0.5 0.25 0.13 0.1 0,07µ feature size

[Hartenstein 2002]

•e. g. car price: ~25% electronics

demand

/years of

availability

IC m

arke

t vo

lum

e



Mask & NRE cost[ST microelectronics]



Shannon‘s Law

•In a number of application areas throughput requirements are growing faster than Moore's law

•Fundamental flaws in software processor solutions

•32 soft ARM cores fit onto contemporary FPGA

•Data-stream-based distributed processing is the way to go



Foundries: Adoption Rate By Process[Nick Tredennick]



SoC System level Design:Embedded SW (ESW)

new design automation from high level descriptions

ESE becomes the main focus in system design:

HW-(E)SW codesign onto highly programmable platforms (SoC)

ESW becomes main vehicle to product differentiation

formal verification for (E)SW

HW-(E)SW-co-verificationH.]

SW synthesis included (SoC)

CW-

CW and

CW-

and CW

(ECW)

ECW



ITRS SoC design cost model[ITRS 2001]

RTL methodology only

w. future improvements

tall

th

in e

ng

inee

r

sma

ll b

lock

reu

se

larg

e b

lock

reu

se

IC im

plem

enta

tion

tool

s

Inte

llig

ent

test

ben

ch

ES

lev

el m

eth

od

olo

gy

http://public.itrs.net/Files/2001ITRS/Design.pdf

most

ly s

yste

m le

vel i

ssues


KaiserslauternUniversity ofTechnology >> CS crisis <<





„EDA industry shifts into CS mentality“

[Wojciech Maly]

•patches instead of engineering

•innovation stalled many years ago

•netlist-based: do not care about efficiency, ...

•... do not care about transistor density

•85% users hate their tools



Where are we heading ?

1

2

0 10 12 18 months

factor

*) Department of Trade and Industry, London

Embe

dded

sof

twar

e [D

TI*

law

](1

.4/year) [M

oore’s

law]

90% by 2010

10 times more programmers will write embedded applications than computer software by 2010

CS is not prepared:heading toward disaster

CS is not prepared:heading toward disaster



Crusty Computing Sciences

[David Padua, John Hennessy]

shrinking supercomputing conferences

more and more efforts yield only marginal improvements

dataflow machines dead

98.5% vN-only

this monopoly is the problem

areas fade away



Dead Supercomputer Society

•ACRI •Alliant •American Supercomputer

•Ametek •Applied Dynamics •Astronautics •BBN •CDC•Convex•Cray Computer •Cray Research •Culler-Harris •Culler Scientific •Cydrome •Dana/Ardent/ Stellar/Stardent

•DAPP •Denelcor •Elexsi •ETA Systems •Evans and Sutherland•Computer•Floating Point Systems •Galaxy YH-1 •Goodyear Aerospace MPP •Gould NPL •Guiltech •ICL •Intel Scientific Computers •International Parallel Machines

•Kendall Square Research •Key Computer Laboratories

[Gordon Bell, keynote at ISCA 2000]

•MasPar•Meiko •Multiflow •Myrias •Numerix •Prisma •Tera •Thinking Machines •Saxpy •Scientific Computer•Systems (SCS) •Soviet Supercomputers •Supertek •Supercomputer Systems •Suprenum •Vitesse Electronics



CS: young ? dynamic?

.. but the von Neumann Paradigm is still the dominant doctrine ...

Microelectronics is ignored (except falling cost of computational effort)

... still pushing he basic models from the times of mainframe dinosaurs

after >10 technology generations ...

• 1th 4004• 2nd 8008• 3rd 8086• 4th 80286• 5th 80386• 6th 80486• 7th P5 (Pentium)• 8th P6 (Pentium Pro / Pentium II)• 9th Pentium III• 10th ....• 11th

• .......

... the vN Microprocessor is a methusela, the steam engine of the silicon age.

computing sciences

are ultra conservative …

… to avoid saying: senileA Re-

orientation is

over-due

A Re-

orientation is

over-due


KaiserslauternUniversity ofTechnology MPU designs more

complex

greatly complicates the verification process

chip-level multiprocessing + simultaneous multithreading

many bugs relate to concurrency issues

new kinds of concurrency are becoming important


KaiserslauternUniversity ofTechnology MPU performance stalled

Moore’s law will stall soon for MPUs

relative computation time needed doubles every 2 years

had been compensated by Moore’s law

Bill Gates’ law:



blinders:

„we are o.k. !“ (no new direction)

CS: Lacking Sense of Direction ?

for ignoring the impact of RC



Stealthy CS Crisis

progress in CS stalled by qualification problems in industry and academia

communication barriers between disciplines

severe software quality problems

often hardware people needed to solve CS problems



What‘s the problem ?

.... by signals rippling through a network of transistors.

The typical programmer has problems to understand function evaluation without machine mechanisms....

Traditional CS: programming is (control-)procedural, instruction-stream-based – sources: software

acceleratorsacceleratorsµprocessorµprocessor

It‘s the gap between procedural and structural mind set

Crossing the Hardware / Software Chasm [Mike

Butts]



What‘s the problem ? (2)

acceleratorsacceleratorsµprocessorµprocessor

The brain hurts on paradigm shift ?

no, it can‘t ...

Brain usage:procedural-only

structuralhemispheremissing

Crossing the Hardware / Software Chasm [Mike

Butts]


KaiserslauternUniversity ofTechnology Changing Models of

Computing

host

re-

downloading

conf.accelerator(s)

RAM RAM

SoftwareConfigware

(structural)

Morphware

configware/software co-design

hardware/configware/software co-design

“von Neumann”

downloading

RAM

downloading

data path instructionsequencer

I / O

(procedural)Software

host

hardwired

downloading

accelerator(s)

CAD

RAM

Hardware

Software

hardware/software co-design

software design


KaiserslauternUniversity ofTechnology “Programming” Domains

Morphware Configware Space Compile Time

procedural (e.g.“von Neumann”)

Software Time Run Time

Systolic Array CAD Time and Space Fabrication Time

Hardware

PlatformPersonalization

( “Programs” ) byProgramming

DomainCommunication

Paths Setup Time

Fabrication TimeCAD Space

Embedded Morphware

Configware / Soft-ware Co-Compilation

Compile Timeand Run TimeTime and Space



Terminology: Digital System Platforms clearly distinguished

platformsource

running on it

machine paradigm

hardware (not running on it)

nonemorphwar

e

fine grain

rGA (FPGA)configware

coarse grain

rDPU, rDPAreconfigurable data stream processor

flowware & configware anti

machinedata stream processor (hardwired) flowware

instruction stream processor softwarevon Neumann machine



There are more Levels of Parallelism

Loop Level (data-stream-based, pipe nets, etc.)

Instruction Level (VLIW etc.)

Logic Level (FPGAs)

RT Level (special architectures etc.)

Process level

ignored by typical CS people& ignored by CS curricula



Complexity: System Level Design Challenge

language infrastructures for complex models (SystemC etc.)

must be leveraged by industry consensus on use-methodology and abstraction levels”

[ITRS 2001]

from HW + (processor-dependent embedded) C code level

“abstraction levels must be raised above present-day RT-level


KaiserslauternUniversity ofTechnology >> datastream-based computing

<<

•embedded System Design Crisis•the CS crisis•datastream-based computing•the Anti Machine Paradigm•application-specific distributed




computingin space

Computing in space and time

datastreams

y10( )

y20( )

y30( )

---

y1

y2

y3

---

x1

x2

x3

-

- -

computingin time

a12

a11 a21

a32

a31

a23 a33

a22

a13

placement

systolicarrays etc.

and other transformationsmigration by re-timing

this dichotomy iscompletely ignoredby our CS curricula



2

General Stream-based Computing Systemheterogenous Array of rDPUs (reconf. data path units)

Scheduler

Mapper

expression treeDPU architectures

y

+*

x

a

1

simultaneousplacement& routing

3

+

++

+

***sh

*sh

sh sh

xf

xf

-

- datastreams

4

The same mapper for both:Reconfigurable,or hardwired

Kress DPSS [1995]

simulated

annealing

free form

pipe network

time

space



flowware defines ....

time

port #

time

DPA

xxx

xxx

xxx

|

||

x x

x

x

x

x

x x

x

- -

-

input data streams

xx

x

x

x

x

xx

x

--

-

-

-

-

-

-

-

-

-

-

xxx

xxx

xxx

|

|

|

|

|

|

|

|

|

|

|

|

|

|output data streams

time

port #time

port #

... which data item at which time at which port

1980: data streams

(Kung, Leiserson)

1995: super systolic

rDPA (Kress)

1996+: SCCC (LANL), SCORE, ASPRC, Bee (UCB), ...

(tutorials and courses available on all this)

flowware history:



control-procedural vs. data-procedural

The structural domain is primarily data-stream-based:

..... mostly not yet modelled that way: most flowware is hidden by its indirect

instruction-stream-based implementation

Flowware provides a (data-)procedural abstraction from the (data-stream-based) structural domain

Flowware converts „procedural vs. structural“ into „control-procedural vs. data-procedural“ ...

... a Troyan horse to introduce the structural domain to the procedural mind set of programmers

Flowware



asM

Configware / Flowware Compilation

r. DataPath

Array

rDPA intermediate

high level source program

wrapper

configwareconfigware

mapper

flowwareflowware

scheduler

M M M M

M M M M

MM

MM

MM

MM

data streams

data sequencer

address generato

r

students should know

that also P & R is a

compilation technique



>> the anti machine paradigm <<





Why a dichotomy of machine paradigms?

data stream machine:

• bad message: caches do not help

• good message: no vN bottleneck

• caches not needed

stolen from Bob Colwell

CPU

caches, ...

vN bottleneckvN: unbalanced

The anti machine has novon Neumann bottleneck



Terminology: DPU versus CPU ...

• DPU: data path unit• DPA: DPU array• GA: gate array• rDPU: reconfigurable DPU• rDPA: reconfigurable DPA• rGA: reconfigurable GA

• DPU is no CPU: there is nothing central - like in a DPA

DPUDPU

DPUinstructionsequencer

CPU

DPAr

r



Machine paradigms

von Neumanninstruction

stream machineM

I/O

instructionsequencer

CPU

instructionstream

I/OMM MM M

(r)DPU

DPU

Software

I/OMM MM M

(r)DPA

memorydistributed memory architecture*

data stream

data-stream machine

M

DPU or rDPU

data addressgenerator(data sequencer)

memory

I/O

asM**

Flowware

(Configware)

(reconf.)

*) the new discipline came just in time:see Herz et al.: Proc. IEEE ICECS 2002

instruction stream+

CPU

- data stream

-DPU

+

memory


KaiserslauternUniversity ofTechnology >> distributed memory <<





Processor Memory Performance Gap

1

10

100

1000Performance

1980 1990 2000

µProc60%/yr..

DRAM7%/yr..

Processor-MemoryPerformance Gap:(grows 50% / year)

DRAM

CPU



Just in time

The new distributed memory discipline:

just in time to implement the anti machine.

M. Herz et al. (invited): Memory Organization for Data-Stream-based

Reconfigurable Computing; Proc. ICECS 2002

key issues:power and performance optimization



address generators for Flowware execution

asM

r. DataPath

Array

rDPA

M M M M

M M M M

MM

MM

MM

MM

data streams

address generato

r



Distributed Memory

SA: scrambling and descrambling the data ?

Just in time: a new research area:

Application-specific distributed memory:

e. g. book by F. Catthoor et al. ...

Data address generators - 20 years research:



Significance of Address Generators

• Address generators have the potential to reduce computation time significantly.

• In a grid-based design rule check a speed-up of more than 2000 has been achieved, compared to a VAX-11/750

• Dedicated address generators contributed a factor of 10 - avoiding memory cycles for address computation overhead



Smart Address Generators

1983 The Structured Memory Access (SMA) Machine

1984 The GAG (generic address generator)

1989 Application-specific Address Generator (ASAG)

1990 The slider method: GAG of the MoM-2 machine

1991 The AGU

1994 The GAG of the MoM-3 machine

1997 The Texas Instruments TMS320C54x DSP

1997 Intersil HSP45240 Address Sequencer

1999 Adopt (IMEC)



Adopt (from IMEC)

•cMMU synthesis environment:

•application-specific ACUs for array index reference

•ACU as a counter modified by multi-level logic filter

•ACU with ASUs from a Cathedral-3 library

•distributed ACU alleviates interconnect overhead (delay, power, area)

•nested loop minimization by algebraic transformations

•AE splitting/clustering

•AE multiplexing to obtain interleaved ASs

•other features

•customized MMU (cMMU) • address expression (AE)

•Address Sequence (AS)•Address Calculation Unit (ACU) • Application-Specific Unit (ASU)



Synthesizable distributed memory architecture...

as Memory(data memory)

memory bank

memory bank

memory bank

memory bank

memory bank

...

...

Scheduler

address generators for the anti machine

rDPA“instructions”

Compiler

Sequencers(data stream

generator)


KaiserslauternUniversity ofTechnology >> architectural resources <<





GAG generic address generator Scheme

BaseSlider

B0

LimitSlider

L0

0B

[

AddressStepper

DA

A

DA

|| ||

L

]limit

all 3 are copiesof the same BSU*

stepper circuitGAU

*) Basic Slider Unit



GAG Slider Model

LimitStepper

BaseStepper

AddressStepper

B0AL0

A

LimitStepper

BaseStepper

AddressStepper

B0AL0

A

sliders

B0B

[

0 L

]0L0

B0B

[

0 AD

AD

L

]0L0

GAUGenericAddress

GeneratorUnit

floor ceiling


KaiserslauternUniversity ofTechnology GAG: Address Stepper

GAG =

AddressGenerator

Generic

+ / –

EscapeClause

EndDetect

StepCounter

=o

L A DA

inittag

AAddress

endExec

maxStepCount0B

Limit Base stepVector[] | |

DA LB0

[ ]|| ||limit

GAG: Address Stepper



Generic Sequence Examples

a) b)

c)

d) e) f) g)

LimitSlider

BaseSlider

GAG

AddressStepper

B0DAL0

A



ceiling

C

address

GAG Slider Operation Demo Example

yx

LB

L0B0AF

floor

LB



3-by-3 tileJPEG zigzagscan pattern example



implementation of a JPEG zigzag tile

constant sliders

constant sliders



zigzag tile rotated 45o



rotated zigzag tile scan pattern implementation

slidingsliders

slidingsliders



3-by-3 tilerotated JPEG zigzag scan pattern example

higherlevelslider

higherlevel

slider



GAG Complex Sequencer Implementation

LimitSlider

BaseSlider

GAU

AddressStepper

B0DAL0

A

LimitSlider

BaseSlider

GAU

AddressStepper

B0DAL0

A

LimitSlider

BaseSlider

GAU

AddressStepper

B0DAL0

A

GAUGAU

GAGGeneric Address Generator

SDS

GAG

VLIWstack

controller



GAG Complex Sequencer Implementation (2)

LimitSlider

BaseSlider

GAU

AddressStepper

B0DAL0

A

LimitSlider

BaseSlider

GAU

AddressStepper

B0DAL0

A

LimitSlider

BaseSlider

GAU

AddressStepper

B0DAL0

A

LimitSlider

BaseSlider

GAU

AddressStepper

B0DAL0

A

LimitSlider

BaseSlider

GAU

AddressStepper

B0DAL0

A

LimitSlider

BaseSlider

GAU

AddressStepper

B0DAL0

A



instruction stream-based Compilation Principles

scheduler

parser

source text

library

link/load instruction call placement

1-D memory space

execution order by location



Antimachine: MoM architecture

x

y

handle positions

scan window

scan pattern (high level sequencing)

example

intra scan window accesses(low level sequencing)

Handle Position Generator

Scan Window Generator

handleposition

bank 0 1 • • • n

y-GAG x-GAG

memory accesses



simple MoM* anti machine architecture

Scan Window

*) map-oriented machine

RAMrDPA

Smart memory interface



MoM anti machine architecture

scan Windows

.

.

.

...

distributedmemory banksrDPA

Smart memory interface



Linear Filter Application

b)

r

r r r

r

r/w r r

r

rr r

w / r r r

r

r r r

r

w/r r r

r

r r r Bank a

Bank a

Bank b

w r

r

r

scan step



Scanline unrolling

r r

r/w r r

r

r r r

r/w r r

r/w r r

r r r



90o Rotation of Scan Pattern

r r

rr

r

r

r

r

r

r

Bank a

Bank a

Bank b

Bank b

w wwr rr rr

r rr rrw ww

w w w

r

w

r

rr

r

r

r

r

w

r

r

w

Bank a

Bank a

Bank b

Bank b

scanwindowoverlaparea

r r/wr r/w r/w

r

r

r/w

r

rr

r

r

r

r/w

r

r

r/w

r

r




after inner scan line loop unrolling

final design

after scan line

unrolling

hardw. level access optim.

initial design

Parallelized Merged Buffer Linear Filter Applicationwith example image of x=22 by y=11 pixel



Storage scheme manipulation by scan pattern transformations

ab

a'b'

memory bank no. 0memory bank no. 1memory bank no. 2memory bank no. 3

c)


KaiserslauternUniversity ofTechnology CGFFT: Nested and Parallel Scan

Pattern

scanouter loop

patternHLScan is 3 steps [2, 0]

SP1 is 7 steps [0, 2]

SP23 is 7 steps [0, 1]

inner loopcompoundscanpatterns

3 in parallel


KaiserslauternUniversity ofTechnology CGFFT: Parallel Scan Pattern Animation



Scan window in real-time image processing(e. g. automotive)



>>> final remarks

finalremarks


KaiserslauternUniversity ofTechnology Antimatter Search ?

Antimatter Search

in EE & CS we do not need to search



What is the trend ?

•vN is needed for embedded systems, OS, compilers, Sauerkraut software, non-performance-critical applications, others ….

•vN is obsolete for massive parallelism, except some special application areas

•Anti machine is the way to go for massive parallelism, also data-intensive applications

•Morphware is the way for high performance with short product life cycles, unstable standards

•Data-stream-based Computing is heading for mainstream

–1979 „data streams“ (Kung / Leiserson)

–1997 SCCC (LANL) Streams-C Configurabble Computing

–SCORE (UCB) Stream Computations Organized for Reconfigurable Execution

–ASPRC (UCB) Adapting Software Pipelining for Reconfigurable Computing

–2000 Bee (UCB), ...

–Most stream-based multimedia systems, etc.

–Many other areas ....


KaiserslauternUniversity ofTechnology >> final remarks

<<





The Situation in Computing Sciences

• Computing Sciences are in a severe crisis

• New fundamentals and R&D directions are inevitable

• All knowledge needed is readily available ...

• ... even from Computing Sciences

• But curricula are obsolete and have to be upgraded

• Silicon application and EDA provide useful concepts

• Reconfigurable Computing has the remedy



roadmap

old CS lab course philosophy:given an application: implement it by a program

-/-new CS freshman lab course environment:Given an application:

a) implement it by writing a programb) implement it as a morphware prototypec) Partition it into P and Q

c.1) implement P by softwarec.2) implement Q by morphwarec.3) implement P / Q communication interface



Algorithms and Data Structures

... have to go beyond pointers, queues, and stacks

Extend by includingalgorithmic issues in software /morphware/ hardware migration additional levels of parallelism: chaining, pipelining, systolic, super-systolic, wavefront arraysadditional data structures and storage organization: the new distributed memory discipline



Computer Organization / Architecture

... have to go beyond von Neumann,

Extend by includingnested machines, address generators the anti machine paradigmExtended taxonomy of platforms: procedural, structural, hardwired, reconfigurable, zhybrid systems



Languages and Compilers

... have to go beyond von Neumann,

Extend by includingConfigware / flowware compilers, Procedural / structural co-compilers (data-procedural) flowware languages



Conclusion: all knowledge needed is available

•machine paradigm

•anti machine architectural resources

•sequencing methodology: hw & sw

•parallel memory IP core and module generator vendors

•anything else needed

•compilation techniques

•hw / sw partitioning methodology

•languages



>>> thank you <<<<<

thank youfor yourpatience



>>> END <<<

END



JPEG zigzag scan pattern

x

y

EastScan is step by [1,0]end EastScan;

SouthScan isstep by [0,1]endSouthScan;

*> Declarations

NorthEastScan isloop 8 times until [*,1]step by [1,-1]endloopend NorthEastScan;

SouthWestScan isloop 8 times until [1,*]step by [-1,1]endloopend SouthWestScan;

HalfZigZag isEastScanloop 3 times SouthWestScanSouthScanNorthEastScanEastScanendloopend HalfZigZag;

goto PixMap[1,1]

HalfZigZag;SouthWestScanuturn (HalfZigZag)

HalfZigZag

data counterdata counter

data counterdata counter

2

1

3

4

HalfZigZag



r r

r/w r r

r

r r r

r/w r r

r/w r r

r r r

after inner scan line loop unrolling

final design

after scan line

unrolling

hardw. level access optim.

initial design

rr

w/r r r

r

r r r Bank a

Bank a

Bank b

Storage scheme optimization: scanline unrolling

x

y

handle positions

scan window

scan pattern (high level sequencing)

example

intra scan window accesses(low level sequencing)

MoM anti machine architecture


scan windowgenerator

Scan line unrolling

90o rotatedscan pattern

r r/wr r/w r/w

r

r

r/w

r

rr

r

r

r

r/w

r

r

r/w

r

r

scanpatternoverlap

Documents

Distributed Memory and Datastream-based Reconfigurable Computing