Upload
mikko
View
44
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Örebro, Aug. 25-27, 2003 11.00 – 13.00 hrs. Distributed Memory and Datastream-based Reconfigurable Computing. Reiner Hartenstein Kaiserslautern University of Technology. “Mainstream Silicon Application. is switching every 10 Years”. Makimoto’s Wave. “The Programmable System-on-a-Chip. - PowerPoint PPT Presentation
Citation preview
Swedish INTELECTSummer School
on MultiprocessorSystems on Chip
Distributed Memoryand Datastream-based
Reconfigurable Computing
Reiner Hartenstein
KaiserslauternUniversity of Technology
Örebro, Aug. 25-27, 200311.00 – 13.00 hrs
© 2003, [email protected] http://hartenstein.de2
KaiserslauternUniversity ofTechnology
Semiconductor Revolutions
“Mainstream Silicon Applicationis switching every 10 Years”
TTL µproc.,memory
custom
standard
1957
1967
1977
1987
1997
2007
Makimoto’s Wave
ASICs,accel’s
LSI,MSI
“The Programmable System-on-a-Chipis the next wave“
reconfigurable
Published
in 1989
vN machineparadigm
anti machine paradigmanti machineparadigm
© 2003, [email protected] http://hartenstein.de3
KaiserslauternUniversity ofTechnology
How’s next Wave ?
2007FPGAs
custom
standard
1957
1967
1977
1987
1997
Tredennick’sParadigm Shifts
procedural programming
algorithm: variable
resources: fixed
hardwired
algorithm: fixed
resources: fixed
2007
?
structural programming
algorithm: variable
resources: variable
Coarse grain
RAs
no further wave !
Hartenstein’s Curve
?4th wave ?
vN machineparadigm
anti machine paradigm
anti machineparadigm
© 2003, [email protected] http://hartenstein.de4
KaiserslauternUniversity ofTechnology
data streams ...
Mainstream Markets
mainframesPC
?
19571967
19771987
1997
2007
technology issue andbusiness model
Trittbrettfahrer
morphware
TTL
µproc.memory
reconfigurab
lestandard
custom
LSI,MSI
ASICs,accel’s
here?
© 2003, [email protected] http://hartenstein.de5
KaiserslauternUniversity ofTechnology
The Impact of Makimoto’s Paradigm
Shifts
TTL µproc.,memory
custom
standard
ASICs,accel’s
LSI,MSI
reconfigurable
1957
1967
1977
1987
1997
2007
Proceduralpersonalization via RAM-based
Machine Paradigm
Personalization(CAD) beforefabrication
structuralpersonalization:
RAM-basedbefore run time
Dr. Makimoto: FPL 2000 keynote
Software Industry’sSecret of Success
Repeat Success Story bynew Machine Paradigm !
© 2003, [email protected] http://hartenstein.de6
KaiserslauternUniversity ofTechnology
Reconfigurable Computing: a second programming
domain
Migration of programming to the structural domain
The opportunity to introduce the structural domain to programmers ...
The structural domain has become RAM-based
... to bridge the gap by clever abstraction mechanisms using a simple new machine paradigm
© 2003, [email protected] http://hartenstein.de7
KaiserslauternUniversity ofTechnology
Ubiquitous embedded systems
Embedded System Engineering (ESE) requires:
• Hardware (HW) / (E)Software (ESW) co-design
• Configware (CW) / ESW co-design
• HW / CW / ESW co-design
ESE becomes the main focus in system design:
ESW becomes main vehicle to product differentiation
© 2003, [email protected] http://hartenstein.de8
KaiserslauternUniversity ofTechnology
Coarse grain vs. Fine grain
coarse grain (PACT AG, Munich)
multi grain (e. g. by slice bundling)
fine grain (FPGAs, rGAs)
Reconfigurability:
© 2003, [email protected] http://hartenstein.de9
KaiserslauternUniversity ofTechnology Makimoto’s 3rd Wave
• Fine Grain Subsystems (FPGAs):–
1st half of 3rd wave
–
universal (but less efficient)
• Coarse Grain Subsystems:–
2nd half of 3rd wave
–
domain-specific
–
much more flexible than 2nd half of 2rd wave
© 2003, [email protected] http://hartenstein.de10
KaiserslauternUniversity ofTechnology
Principle of a Typical FPGA
FF
FF
FF
FF
FF FFFF FF
Connection-Point
Tap
CLBCLB
CLBCLB
CLBCLBFF of hidden RAM
© 2003, [email protected] http://hartenstein.de11
KaiserslauternUniversity ofTechnology Routing Overhead in FPGAs
FF
FF
FF
FF
FF FF
>1000 transistorsat each cross bar
FF part of thehidden RAM
most FPGAvendors’gate count:
1 flipflop ofconfigurationRAM = 4 gates
Routing Congestion [DeHon]:often 50% or less of CLBs used
FF FF
Ý 40 transistorsat eachswitchingpoint
>
Ý 15 transistorsat each tap>
© 2003, [email protected] http://hartenstein.de12
KaiserslauternUniversity ofTechnology Reconfigurability Overhead
S S
S Sresources needed for reconfigurability
partly for configuration code storage
L
L L
LL
L
L LL
area used by application
“hidden RAM”not shown
© 2003, [email protected] http://hartenstein.de13
KaiserslauternUniversity ofTechnology Reconfigurability Overhead
• Fine Grain morphware platforms:–
about 1 of 100 transistors serve the application
–
the rest serves for reconfigurability
• Coarse Grain platforms:–
If well layouted by structured VLSI design
–
area efficiency almost like hardwired designs
© 2003, [email protected] http://hartenstein.de14
KaiserslauternUniversity ofTechnology
Sources: Proc ISSCC, ICSPAT, DAC, DSPWorld
Why Coarse Grain instead of FPGA ?
physicallogical
supersystolic
FPGAlogical
1980 1990 2000 2010
FPGAphysical
100 000 000 000
10 000 000 000
1000 000 000
100 000 000
10 000 000
1000 000
100 000
10 000
1000
Tra
nsi
sto
rs /
chip
~ 10
~ 10 000
drastically smaller configuration memorya lot of more benefits
much faster loading
FPGArouted
memory
microprocessor
reduced reconfigurability overhead by up to ~ 1000
© 2003, [email protected] http://hartenstein.de15
KaiserslauternUniversity ofTechnology
Throughput vs. Efficiency
1000
100
10
1
0.1
0.01
0.0012 1 0.5 0.25 0.13 0.1 0,07
MOPS / mW
µ feature size
FPGAs (reconfigurable logic)hardwired
instruction set processors
standard microprocessor
DSP
S S
S S
resources needed for
reconfigurability
L
L L
LL
L
L LL
area used by application
1 Bit CLB
T. Claasen et al.: ISSCC 1999
Wiring by abutment:32 Bit example
*) R. Hartenstein: ISIS 1997
rDPAs (reconfigurable computing)*
© 2003, [email protected] http://hartenstein.de16
KaiserslauternUniversity ofTechnology
Throughput vs. Flexibilityy
1000
100
10
1
0.1
0.01
0.0012 1 0.5 0.25 0.13 0.1 0,07
MOPS / mW
µ feature size
FPGAs (reconfigurable logic)hardwired
instruction set processors
standard microprocessor
DSP
T. Claasen et al.: ISSCC 1999
Wiring by abutment:32 Bit example
*) R. Hartenstein: ISIS 1997
rDPAs (reconfigurable computing)*
flexibility
throughput
hard-wired
vonNeumann
FPGAs
coarse grain goes far beyond bridging the gap
coarsegrain
© 2003, [email protected] http://hartenstein.de17
KaiserslauternUniversity ofTechnology >> outline <<
•embedded System Design Crisis•the CS crisis•datastream-based Computing•the Anti Machine Paradigm•application-specific distributed
memory•anti machine architectural resources•final remarkshttp://www.uni-kl.de
© 2003, [email protected] http://hartenstein.de18
KaiserslauternUniversity ofTechnology Embedded System Design
Crisis
desi
gn c
ost
year
product life cycle
© 2003, [email protected] http://hartenstein.de19
KaiserslauternUniversity ofTechnology
What are the Challenges ? (5)[ST microelectronics, MorphICs, Dataquest, eASIC]
1
2
0 10 12 18months
factor
*) Department of Trade and Industry, London
30y
Embe
dded
sof
twar
e [D
TI*
law
]
Comm
unicat
ion
band
wid
th [H
anse
n’s la
w]
Integra
tion densit
y (1.4/year)
[Moore
’s law]
µprocessor integration density (1.2/year)
Battery capacity (1.03/year)
Memory bandwidth [Patterson‘s law] (1.07/year) 10y
4yMask and NRE cost (1.25/year) 3y
5y
2y
design complexity
(1.4/year)
designer productivity (1.15/year)
designer productivity (1.15/year)
newcompilationtechniques
needed !supportedby a newmachine
paradigm
© 2003, [email protected] http://hartenstein.de20
KaiserslauternUniversity ofTechnology
The microelectronics spare part problem
IC physical life
expectance /years
2 1 0.5 0.25 0.13 0.1 0,07µ feature size
[Hartenstein 2002]
demand
/years of
availability
IC m
arke
t vo
lum
e
key problem in many application areas: medical, aerospace, automotive, other transportation, military, industrial equipment controllers, et al.
© 2003, [email protected] http://hartenstein.de21
KaiserslauternUniversity ofTechnology
The microelectronics spare part problem
•Original fab line is no more existing
•ICs do not survive storage time
•Demand: several decades of availability
IC physical life
expectance /years
2 1 0.5 0.25 0.13 0.1 0,07µ feature size
[Hartenstein 2002]
•e. g. car price: ~25% electronics
demand
/years of
availability
IC m
arke
t vo
lum
e
© 2003, [email protected] http://hartenstein.de22
KaiserslauternUniversity ofTechnology
Mask & NRE cost[ST microelectronics]
© 2003, [email protected] http://hartenstein.de23
KaiserslauternUniversity ofTechnology
Shannon‘s Law
•In a number of application areas throughput requirements are growing faster than Moore's law
•Fundamental flaws in software processor solutions
•32 soft ARM cores fit onto contemporary FPGA
•Data-stream-based distributed processing is the way to go
© 2003, [email protected] http://hartenstein.de24
KaiserslauternUniversity ofTechnology
Foundries: Adoption Rate By Process[Nick Tredennick]
© 2003, [email protected] http://hartenstein.de25
KaiserslauternUniversity ofTechnology
SoC System level Design:Embedded SW (ESW)
new design automation from high level descriptions
ESE becomes the main focus in system design:
HW-(E)SW codesign onto highly programmable platforms (SoC)
ESW becomes main vehicle to product differentiation
formal verification for (E)SW
HW-(E)SW-co-verificationH.]
SW synthesis included (SoC)
CW-
CW and
CW-
and CW
(ECW)
ECW
© 2003, [email protected] http://hartenstein.de26
KaiserslauternUniversity ofTechnology
ITRS SoC design cost model[ITRS 2001]
RTL methodology only
w. future improvements
tall
th
in e
ng
inee
r
sma
ll b
lock
reu
se
larg
e b
lock
reu
se
IC im
plem
enta
tion
tool
s
Inte
llig
ent
test
ben
ch
ES
lev
el m
eth
od
olo
gy
http://public.itrs.net/Files/2001ITRS/Design.pdf
most
ly s
yste
m le
vel i
ssues
© 2003, [email protected] http://hartenstein.de27
KaiserslauternUniversity ofTechnology >> CS crisis <<
•embedded System Design Crisis•the CS crisis•datastream-based Computing•the Anti Machine Paradigm•application-specific distributed
memory•anti machine architectural resources•final remarkshttp://www.uni-kl.de
© 2003, [email protected] http://hartenstein.de28
KaiserslauternUniversity ofTechnology
„EDA industry shifts into CS mentality“
[Wojciech Maly]
•patches instead of engineering
•innovation stalled many years ago
•netlist-based: do not care about efficiency, ...
•... do not care about transistor density
•85% users hate their tools
© 2003, [email protected] http://hartenstein.de29
KaiserslauternUniversity ofTechnology
Where are we heading ?
1
2
0 10 12 18 months
factor
*) Department of Trade and Industry, London
Embe
dded
sof
twar
e [D
TI*
law
](1
.4/year) [M
oore’s
law]
90% by 2010
10 times more programmers will write embedded applications than computer software by 2010
CS is not prepared:heading toward disaster
CS is not prepared:heading toward disaster
© 2003, [email protected] http://hartenstein.de30
KaiserslauternUniversity ofTechnology
Crusty Computing Sciences
[David Padua, John Hennessy]
shrinking supercomputing conferences
more and more efforts yield only marginal improvements
dataflow machines dead
98.5% vN-only
this monopoly is the problem
areas fade away
© 2003, [email protected] http://hartenstein.de31
KaiserslauternUniversity ofTechnology
Dead Supercomputer Society
•ACRI •Alliant •American Supercomputer
•Ametek •Applied Dynamics •Astronautics •BBN •CDC•Convex•Cray Computer •Cray Research •Culler-Harris •Culler Scientific •Cydrome •Dana/Ardent/ Stellar/Stardent
•DAPP •Denelcor •Elexsi •ETA Systems •Evans and Sutherland•Computer•Floating Point Systems •Galaxy YH-1 •Goodyear Aerospace MPP •Gould NPL •Guiltech •ICL •Intel Scientific Computers •International Parallel Machines
•Kendall Square Research •Key Computer Laboratories
[Gordon Bell, keynote at ISCA 2000]
•MasPar•Meiko •Multiflow •Myrias •Numerix •Prisma •Tera •Thinking Machines •Saxpy •Scientific Computer•Systems (SCS) •Soviet Supercomputers •Supertek •Supercomputer Systems •Suprenum •Vitesse Electronics
© 2003, [email protected] http://hartenstein.de32
KaiserslauternUniversity ofTechnology
CS: young ? dynamic?
.. but the von Neumann Paradigm is still the dominant doctrine ...
Microelectronics is ignored (except falling cost of computational effort)
... still pushing he basic models from the times of mainframe dinosaurs
after >10 technology generations ...
• 1th 4004• 2nd 8008• 3rd 8086• 4th 80286• 5th 80386• 6th 80486• 7th P5 (Pentium)• 8th P6 (Pentium Pro / Pentium II)• 9th Pentium III• 10th ....• 11th
• .......
... the vN Microprocessor is a methusela, the steam engine of the silicon age.
computing sciences
are ultra conservative …
… to avoid saying: senileA Re-
orientation is
over-due
A Re-
orientation is
over-due
© 2003, [email protected] http://hartenstein.de33
KaiserslauternUniversity ofTechnology MPU designs more
complex
greatly complicates the verification process
chip-level multiprocessing + simultaneous multithreading
many bugs relate to concurrency issues
new kinds of concurrency are becoming important
© 2003, [email protected] http://hartenstein.de34
KaiserslauternUniversity ofTechnology MPU performance stalled
Moore’s law will stall soon for MPUs
relative computation time needed doubles every 2 years
had been compensated by Moore’s law
Bill Gates’ law:
© 2003, [email protected] http://hartenstein.de35
KaiserslauternUniversity ofTechnology
blinders:
„we are o.k. !“ (no new direction)
CS: Lacking Sense of Direction ?
for ignoring the impact of RC
© 2003, [email protected] http://hartenstein.de36
KaiserslauternUniversity ofTechnology
Stealthy CS Crisis
progress in CS stalled by qualification problems in industry and academia
communication barriers between disciplines
severe software quality problems
often hardware people needed to solve CS problems
© 2003, [email protected] http://hartenstein.de37
KaiserslauternUniversity ofTechnology
What‘s the problem ?
.... by signals rippling through a network of transistors.
The typical programmer has problems to understand function evaluation without machine mechanisms....
Traditional CS: programming is (control-)procedural, instruction-stream-based – sources: software
acceleratorsacceleratorsµprocessorµprocessor
It‘s the gap between procedural and structural mind set
Crossing the Hardware / Software Chasm [Mike
Butts]
© 2003, [email protected] http://hartenstein.de38
KaiserslauternUniversity ofTechnology
What‘s the problem ? (2)
acceleratorsacceleratorsµprocessorµprocessor
The brain hurts on paradigm shift ?
no, it can‘t ...
Brain usage:procedural-only
structuralhemispheremissing
Crossing the Hardware / Software Chasm [Mike
Butts]
© 2003, [email protected] http://hartenstein.de39
KaiserslauternUniversity ofTechnology Changing Models of
Computing
host
re-
downloading
conf.accelerator(s)
RAM RAM
SoftwareConfigware
(structural)
Morphware
configware/software co-design
hardware/configware/software co-design
“von Neumann”
downloading
RAM
downloading
data path instructionsequencer
I / O
(procedural)Software
host
hardwired
downloading
accelerator(s)
CAD
RAM
Hardware
Software
hardware/software co-design
software design
© 2003, [email protected] http://hartenstein.de40
KaiserslauternUniversity ofTechnology “Programming” Domains
Morphware Configware Space Compile Time
procedural (e.g.“von Neumann”)
Software Time Run Time
Systolic Array CAD Time and Space Fabrication Time
Hardware
PlatformPersonalization
( “Programs” ) byProgramming
DomainCommunication
Paths Setup Time
Fabrication TimeCAD Space
Embedded Morphware
Configware / Soft-ware Co-Compilation
Compile Timeand Run TimeTime and Space
© 2003, [email protected] http://hartenstein.de41
KaiserslauternUniversity ofTechnology
Terminology: Digital System Platforms clearly distinguished
platformsource
running on it
machine paradigm
hardware (not running on it)
nonemorphwar
e
fine grain
rGA (FPGA)configware
coarse grain
rDPU, rDPAreconfigurable data stream processor
flowware & configware anti
machinedata stream processor (hardwired) flowware
instruction stream processor softwarevon Neumann machine
© 2003, [email protected] http://hartenstein.de42
KaiserslauternUniversity ofTechnology
There are more Levels of Parallelism
Loop Level (data-stream-based, pipe nets, etc.)
Instruction Level (VLIW etc.)
Logic Level (FPGAs)
RT Level (special architectures etc.)
Process level
ignored by typical CS people& ignored by CS curricula
© 2003, [email protected] http://hartenstein.de43
KaiserslauternUniversity ofTechnology
Complexity: System Level Design Challenge
language infrastructures for complex models (SystemC etc.)
must be leveraged by industry consensus on use-methodology and abstraction levels”
[ITRS 2001]
from HW + (processor-dependent embedded) C code level
“abstraction levels must be raised above present-day RT-level
© 2003, [email protected] http://hartenstein.de44
KaiserslauternUniversity ofTechnology >> datastream-based computing
<<
•embedded System Design Crisis•the CS crisis•datastream-based computing•the Anti Machine Paradigm•application-specific distributed
memory•anti machine architectural resources•final remarkshttp://www.uni-kl.de
© 2003, [email protected] http://hartenstein.de45
KaiserslauternUniversity ofTechnology
computingin space
Computing in space and time
datastreams
y10( )
y20( )
y30( )
---
y1
y2
y3
---
x1
x2
x3
-
- -
computingin time
a12
a11 a21
a32
a31
a23 a33
a22
a13
placement
systolicarrays etc.
and other transformationsmigration by re-timing
this dichotomy iscompletely ignoredby our CS curricula
© 2003, [email protected] http://hartenstein.de46
KaiserslauternUniversity ofTechnology
2
General Stream-based Computing Systemheterogenous Array of rDPUs (reconf. data path units)
Scheduler
Mapper
expression treeDPU architectures
y
+*
x
a
1
simultaneousplacement& routing
3
+
++
+
***sh
*sh
sh sh
xf
xf
-
- datastreams
4
The same mapper for both:Reconfigurable,or hardwired
Kress DPSS [1995]
simulated
annealing
free form
pipe network
time
space
© 2003, [email protected] http://hartenstein.de47
KaiserslauternUniversity ofTechnology
flowware defines ....
time
port #
time
DPA
xxx
xxx
xxx
|
||
x x
x
x
x
x
x x
x
- -
-
input data streams
xx
x
x
x
x
xx
x
--
-
-
-
-
-
-
-
-
-
-
xxx
xxx
xxx
|
|
|
|
|
|
|
|
|
|
|
|
|
|output data streams
time
port #time
port #
... which data item at which time at which port
1980: data streams
(Kung, Leiserson)
1995: super systolic
rDPA (Kress)
1996+: SCCC (LANL), SCORE, ASPRC, Bee (UCB), ...
(tutorials and courses available on all this)
flowware history:
© 2003, [email protected] http://hartenstein.de48
KaiserslauternUniversity ofTechnology
control-procedural vs. data-procedural
The structural domain is primarily data-stream-based:
..... mostly not yet modelled that way: most flowware is hidden by its indirect
instruction-stream-based implementation
Flowware provides a (data-)procedural abstraction from the (data-stream-based) structural domain
Flowware converts „procedural vs. structural“ into „control-procedural vs. data-procedural“ ...
... a Troyan horse to introduce the structural domain to the procedural mind set of programmers
Flowware
© 2003, [email protected] http://hartenstein.de49
KaiserslauternUniversity ofTechnology
asM
Configware / Flowware Compilation
r. DataPath
Array
rDPA intermediate
high level source program
wrapper
configwareconfigware
mapper
flowwareflowware
scheduler
M M M M
M M M M
MM
MM
MM
MM
data streams
data sequencer
address generato
r
students should know
that also P & R is a
compilation technique
© 2003, [email protected] http://hartenstein.de50
KaiserslauternUniversity ofTechnology
>> the anti machine paradigm <<
•embedded System Design Crisis•the CS crisis•datastream-based Computing•the Anti Machine Paradigm•application-specific distributed
memory•anti machine architectural resources•final remarkshttp://www.uni-kl.de
© 2003, [email protected] http://hartenstein.de51
KaiserslauternUniversity ofTechnology
Why a dichotomy of machine paradigms?
data stream machine:
• bad message: caches do not help
• good message: no vN bottleneck
• caches not needed
stolen from Bob Colwell
CPU
caches, ...
vN bottleneckvN: unbalanced
The anti machine has novon Neumann bottleneck
© 2003, [email protected] http://hartenstein.de52
KaiserslauternUniversity ofTechnology
Terminology: DPU versus CPU ...
• DPU: data path unit• DPA: DPU array• GA: gate array• rDPU: reconfigurable DPU• rDPA: reconfigurable DPA• rGA: reconfigurable GA
• DPU is no CPU: there is nothing central - like in a DPA
DPUDPU
DPUinstructionsequencer
CPU
DPAr
r
© 2003, [email protected] http://hartenstein.de53
KaiserslauternUniversity ofTechnology
Machine paradigms
von Neumanninstruction
stream machineM
I/O
instructionsequencer
CPU
instructionstream
I/OMM MM M
(r)DPU
DPU
Software
I/OMM MM M
(r)DPA
memorydistributed memory architecture*
data stream
data-stream machine
M
DPU or rDPU
data addressgenerator(data sequencer)
memory
I/O
asM**
Flowware
(Configware)
(reconf.)
*) the new discipline came just in time:see Herz et al.: Proc. IEEE ICECS 2002
instruction stream+
CPU
- data stream
-DPU
+
memory
© 2003, [email protected] http://hartenstein.de54
KaiserslauternUniversity ofTechnology >> distributed memory <<
•embedded System Design Crisis•the CS crisis•datastream-based Computing•the Anti Machine Paradigm•application-specific distributed
memory•anti machine architectural resources•final remarkshttp://www.uni-kl.de
© 2003, [email protected] http://hartenstein.de55
KaiserslauternUniversity ofTechnology
Processor Memory Performance Gap
1
10
100
1000Performance
1980 1990 2000
µProc60%/yr..
DRAM7%/yr..
Processor-MemoryPerformance Gap:(grows 50% / year)
DRAM
CPU
© 2003, [email protected] http://hartenstein.de56
KaiserslauternUniversity ofTechnology
Just in time
The new distributed memory discipline:
just in time to implement the anti machine.
M. Herz et al. (invited): Memory Organization for Data-Stream-based
Reconfigurable Computing; Proc. ICECS 2002
key issues:power and performance optimization
© 2003, [email protected] http://hartenstein.de57
KaiserslauternUniversity ofTechnology
address generators for Flowware execution
asM
r. DataPath
Array
rDPA
M M M M
M M M M
MM
MM
MM
MM
data streams
address generato
r
© 2003, [email protected] http://hartenstein.de58
KaiserslauternUniversity ofTechnology
Distributed Memory
SA: scrambling and descrambling the data ?
Just in time: a new research area:
Application-specific distributed memory:
e. g. book by F. Catthoor et al. ...
Data address generators - 20 years research:
© 2003, [email protected] http://hartenstein.de59
KaiserslauternUniversity ofTechnology
Significance of Address Generators
• Address generators have the potential to reduce computation time significantly.
• In a grid-based design rule check a speed-up of more than 2000 has been achieved, compared to a VAX-11/750
• Dedicated address generators contributed a factor of 10 - avoiding memory cycles for address computation overhead
© 2003, [email protected] http://hartenstein.de60
KaiserslauternUniversity ofTechnology
Smart Address Generators
1983 The Structured Memory Access (SMA) Machine
1984 The GAG (generic address generator)
1989 Application-specific Address Generator (ASAG)
1990 The slider method: GAG of the MoM-2 machine
1991 The AGU
1994 The GAG of the MoM-3 machine
1997 The Texas Instruments TMS320C54x DSP
1997 Intersil HSP45240 Address Sequencer
1999 Adopt (IMEC)
© 2003, [email protected] http://hartenstein.de61
KaiserslauternUniversity ofTechnology
Adopt (from IMEC)
•cMMU synthesis environment:
•application-specific ACUs for array index reference
•ACU as a counter modified by multi-level logic filter
•ACU with ASUs from a Cathedral-3 library
•distributed ACU alleviates interconnect overhead (delay, power, area)
•nested loop minimization by algebraic transformations
•AE splitting/clustering
•AE multiplexing to obtain interleaved ASs
•other features
•customized MMU (cMMU) • address expression (AE)
•Address Sequence (AS)•Address Calculation Unit (ACU) • Application-Specific Unit (ASU)
© 2003, [email protected] http://hartenstein.de62
KaiserslauternUniversity ofTechnology
Synthesizable distributed memory architecture...
as Memory(data memory)
memory bank
memory bank
memory bank
memory bank
memory bank
...
...
Scheduler
address generators for the anti machine
rDPA“instructions”
Compiler
Sequencers(data stream
generator)
© 2003, [email protected] http://hartenstein.de63
KaiserslauternUniversity ofTechnology >> architectural resources <<
•embedded System Design Crisis•the CS crisis•datastream-based Computing•the Anti Machine Paradigm•application-specific distributed
memory•anti machine architectural resources•final remarkshttp://www.uni-kl.de
© 2003, [email protected] http://hartenstein.de64
KaiserslauternUniversity ofTechnology
GAG generic address generator Scheme
BaseSlider
B0
LimitSlider
L0
0B
[
AddressStepper
DA
A
DA
|| ||
L
]limit
all 3 are copiesof the same BSU*
stepper circuitGAU
*) Basic Slider Unit
© 2003, [email protected] http://hartenstein.de65
KaiserslauternUniversity ofTechnology
GAG Slider Model
LimitStepper
BaseStepper
AddressStepper
B0AL0
A
LimitStepper
BaseStepper
AddressStepper
B0AL0
A
sliders
B0B
[
0 L
]0L0
B0B
[
0 AD
AD
L
]0L0
GAUGenericAddress
GeneratorUnit
floor ceiling
© 2003, [email protected] http://hartenstein.de66
KaiserslauternUniversity ofTechnology GAG: Address Stepper
GAG =
AddressGenerator
Generic
+ / –
EscapeClause
EndDetect
StepCounter
=o
L A DA
inittag
AAddress
endExec
maxStepCount0B
Limit Base stepVector[] | |
DA LB0
[ ]|| ||limit
GAG: Address Stepper
© 2003, [email protected] http://hartenstein.de67
KaiserslauternUniversity ofTechnology
Generic Sequence Examples
a) b)
c)
d) e) f) g)
LimitSlider
BaseSlider
GAG
AddressStepper
B0DAL0
A
© 2003, [email protected] http://hartenstein.de68
KaiserslauternUniversity ofTechnology
ceiling
C
address
GAG Slider Operation Demo Example
yx
LB
L0B0AF
floor
LB
© 2003, [email protected] http://hartenstein.de69
KaiserslauternUniversity ofTechnology
3-by-3 tileJPEG zigzagscan pattern example
© 2003, [email protected] http://hartenstein.de70
KaiserslauternUniversity ofTechnology
implementation of a JPEG zigzag tile
constant sliders
constant sliders
© 2003, [email protected] http://hartenstein.de71
KaiserslauternUniversity ofTechnology
zigzag tile rotated 45o
© 2003, [email protected] http://hartenstein.de72
KaiserslauternUniversity ofTechnology
rotated zigzag tile scan pattern implementation
slidingsliders
slidingsliders
© 2003, [email protected] http://hartenstein.de73
KaiserslauternUniversity ofTechnology
3-by-3 tilerotated JPEG zigzag scan pattern example
higherlevelslider
higherlevel
slider
© 2003, [email protected] http://hartenstein.de74
KaiserslauternUniversity ofTechnology
GAG Complex Sequencer Implementation
LimitSlider
BaseSlider
GAU
AddressStepper
B0DAL0
A
LimitSlider
BaseSlider
GAU
AddressStepper
B0DAL0
A
LimitSlider
BaseSlider
GAU
AddressStepper
B0DAL0
A
GAUGAU
GAGGeneric Address Generator
SDS
GAG
VLIWstack
controller
© 2003, [email protected] http://hartenstein.de75
KaiserslauternUniversity ofTechnology
GAG Complex Sequencer Implementation (2)
LimitSlider
BaseSlider
GAU
AddressStepper
B0DAL0
A
LimitSlider
BaseSlider
GAU
AddressStepper
B0DAL0
A
LimitSlider
BaseSlider
GAU
AddressStepper
B0DAL0
A
LimitSlider
BaseSlider
GAU
AddressStepper
B0DAL0
A
LimitSlider
BaseSlider
GAU
AddressStepper
B0DAL0
A
LimitSlider
BaseSlider
GAU
AddressStepper
B0DAL0
A
© 2003, [email protected] http://hartenstein.de76
KaiserslauternUniversity ofTechnology
instruction stream-based Compilation Principles
scheduler
parser
source text
library
link/load instruction call placement
1-D memory space
execution order by location
© 2003, [email protected] http://hartenstein.de77
KaiserslauternUniversity ofTechnology
Antimachine: MoM architecture
x
y
handle positions
scan window
scan pattern (high level sequencing)
example
intra scan window accesses(low level sequencing)
Handle Position Generator
Scan Window Generator
handleposition
bank 0 1 • • • n
y-GAG x-GAG
memory accesses
© 2003, [email protected] http://hartenstein.de78
KaiserslauternUniversity ofTechnology
simple MoM* anti machine architecture
Scan Window
*) map-oriented machine
RAMrDPA
Smart memory interface
© 2003, [email protected] http://hartenstein.de79
KaiserslauternUniversity ofTechnology
MoM anti machine architecture
scan Windows
.
.
.
...
distributedmemory banksrDPA
Smart memory interface
© 2003, [email protected] http://hartenstein.de80
KaiserslauternUniversity ofTechnology
Linear Filter Application
b)
r
r r r
r
r/w r r
r
rr r
w / r r r
r
r r r
r
w/r r r
r
r r r Bank a
Bank a
Bank b
w r
r
r
scan step
© 2003, [email protected] http://hartenstein.de81
KaiserslauternUniversity ofTechnology
Scanline unrolling
r r
r/w r r
r
r r r
r/w r r
r/w r r
r r r
© 2003, [email protected] http://hartenstein.de82
KaiserslauternUniversity ofTechnology
90o Rotation of Scan Pattern
r r
rr
r
r
r
r
r
r
Bank a
Bank a
Bank b
Bank b
w wwr rr rr
r rr rrw ww
w w w
r
w
r
rr
r
r
r
r
w
r
r
w
Bank a
Bank a
Bank b
Bank b
scanwindowoverlaparea
r r/wr r/w r/w
r
r
r/w
r
rr
r
r
r
r/w
r
r
r/w
r
r
© 2003, [email protected] http://hartenstein.de83
KaiserslauternUniversity ofTechnology
Linear Filter Application
after inner scan line loop unrolling
final design
after scan line
unrolling
hardw. level access optim.
initial design
Parallelized Merged Buffer Linear Filter Applicationwith example image of x=22 by y=11 pixel
© 2003, [email protected] http://hartenstein.de84
KaiserslauternUniversity ofTechnology
Storage scheme manipulation by scan pattern transformations
ab
a'b'
memory bank no. 0memory bank no. 1memory bank no. 2memory bank no. 3
c)
© 2003, [email protected] http://hartenstein.de85
KaiserslauternUniversity ofTechnology CGFFT: Nested and Parallel Scan
Pattern
scanouter loop
patternHLScan is 3 steps [2, 0]
SP1 is 7 steps [0, 2]
SP23 is 7 steps [0, 1]
inner loopcompoundscanpatterns
3 in parallel
© 2003, [email protected] http://hartenstein.de86
KaiserslauternUniversity ofTechnology CGFFT: Parallel Scan Pattern Animation
© 2003, [email protected] http://hartenstein.de87
KaiserslauternUniversity ofTechnology
Scan window in real-time image processing(e. g. automotive)
© 2003, [email protected] http://hartenstein.de88
KaiserslauternUniversity ofTechnology
>>> final remarks
finalremarks
© 2003, [email protected] http://hartenstein.de89
KaiserslauternUniversity ofTechnology Antimatter Search ?
Antimatter Search
in EE & CS we do not need to search
© 2003, [email protected] http://hartenstein.de90
KaiserslauternUniversity ofTechnology
What is the trend ?
•vN is needed for embedded systems, OS, compilers, Sauerkraut software, non-performance-critical applications, others ….
•vN is obsolete for massive parallelism, except some special application areas
•Anti machine is the way to go for massive parallelism, also data-intensive applications
•Morphware is the way for high performance with short product life cycles, unstable standards
•Data-stream-based Computing is heading for mainstream
–1979 „data streams“ (Kung / Leiserson)
–1997 SCCC (LANL) Streams-C Configurabble Computing
–SCORE (UCB) Stream Computations Organized for Reconfigurable Execution
–ASPRC (UCB) Adapting Software Pipelining for Reconfigurable Computing
–2000 Bee (UCB), ...
–Most stream-based multimedia systems, etc.
–Many other areas ....
© 2003, [email protected] http://hartenstein.de91
KaiserslauternUniversity ofTechnology >> final remarks
<<
•embedded System Design Crisis•the CS crisis•datastream-based Computing•the Anti Machine Paradigm•application-specific distributed
memory•anti machine architectural resources•final remarkshttp://www.uni-kl.de
© 2003, [email protected] http://hartenstein.de92
KaiserslauternUniversity ofTechnology
The Situation in Computing Sciences
• Computing Sciences are in a severe crisis
• New fundamentals and R&D directions are inevitable
• All knowledge needed is readily available ...
• ... even from Computing Sciences
• But curricula are obsolete and have to be upgraded
• Silicon application and EDA provide useful concepts
• Reconfigurable Computing has the remedy
© 2003, [email protected] http://hartenstein.de93
KaiserslauternUniversity ofTechnology
roadmap
old CS lab course philosophy:given an application: implement it by a program
-/-new CS freshman lab course environment:Given an application:
a) implement it by writing a programb) implement it as a morphware prototypec) Partition it into P and Q
c.1) implement P by softwarec.2) implement Q by morphwarec.3) implement P / Q communication interface
© 2003, [email protected] http://hartenstein.de94
KaiserslauternUniversity ofTechnology
Algorithms and Data Structures
... have to go beyond pointers, queues, and stacks
Extend by includingalgorithmic issues in software /morphware/ hardware migration additional levels of parallelism: chaining, pipelining, systolic, super-systolic, wavefront arraysadditional data structures and storage organization: the new distributed memory discipline
© 2003, [email protected] http://hartenstein.de95
KaiserslauternUniversity ofTechnology
Computer Organization / Architecture
... have to go beyond von Neumann,
Extend by includingnested machines, address generators the anti machine paradigmExtended taxonomy of platforms: procedural, structural, hardwired, reconfigurable, zhybrid systems
© 2003, [email protected] http://hartenstein.de96
KaiserslauternUniversity ofTechnology
Languages and Compilers
... have to go beyond von Neumann,
Extend by includingConfigware / flowware compilers, Procedural / structural co-compilers (data-procedural) flowware languages
© 2003, [email protected] http://hartenstein.de97
KaiserslauternUniversity ofTechnology
Conclusion: all knowledge needed is available
•machine paradigm
•anti machine architectural resources
•sequencing methodology: hw & sw
•parallel memory IP core and module generator vendors
•anything else needed
•compilation techniques
•hw / sw partitioning methodology
•languages
© 2003, [email protected] http://hartenstein.de98
KaiserslauternUniversity ofTechnology
>>> thank you <<<<<
thank youfor yourpatience
© 2003, [email protected] http://hartenstein.de99
KaiserslauternUniversity ofTechnology
>>> END <<<
END
© 2003, [email protected] http://hartenstein.de100
KaiserslauternUniversity ofTechnology
JPEG zigzag scan pattern
x
y
EastScan is step by [1,0]end EastScan;
SouthScan isstep by [0,1]endSouthScan;
*> Declarations
NorthEastScan isloop 8 times until [*,1]step by [1,-1]endloopend NorthEastScan;
SouthWestScan isloop 8 times until [1,*]step by [-1,1]endloopend SouthWestScan;
HalfZigZag isEastScanloop 3 times SouthWestScanSouthScanNorthEastScanEastScanendloopend HalfZigZag;
goto PixMap[1,1]
HalfZigZag;SouthWestScanuturn (HalfZigZag)
HalfZigZag
data counterdata counter
data counterdata counter
2
1
3
4
HalfZigZag
© 2003, [email protected] http://hartenstein.de101
KaiserslauternUniversity ofTechnology
r r
r/w r r
r
r r r
r/w r r
r/w r r
r r r
after inner scan line loop unrolling
final design
after scan line
unrolling
hardw. level access optim.
initial design
rr
w/r r r
r
r r r Bank a
Bank a
Bank b
Storage scheme optimization: scanline unrolling
x
y
handle positions
scan window
scan pattern (high level sequencing)
example
intra scan window accesses(low level sequencing)
MoM anti machine architecture
Linear Filter Application
scan windowgenerator
Scan line unrolling
90o rotatedscan pattern
r r/wr r/w r/w
r
r
r/w
r
rr
r
r
r
r/w
r
r
r/w
r
r
scanpatternoverlap