View
247
Download
1
Category
Preview:
Citation preview
Reconfigurable Computing
Reconfigurable Architectures
Chapter 3.1
Prof. Dr.-Ing. Jürgen Teich
Lehrstuhl für Hardware-Software-Co-Design
Reconfigurable Computing
Gerald Estrin Fix-Plus Machine
Vision of a restructurable computer system
Pragmatic problem studies predict gains in computation
speeds in a variety of computational tasks when executed
on appropriate problem-oriented configurations of the
variable structure computer. The economic feasibility of
the system is based on utilization of essentially the same
hardware in a variety of special purpose structures. This
capability is achieved by programmed or physical
restructuring of a part of the hardware.
G. Estrin, B. Bussel, R. Turn, J Bibb (UCLA 1963)
Reconfigurable Computing
3
Gerald Estrin Fix-Plus Machine
Fixed plus Variable
structure computer Proposed by G. Estrin in 1959
Consist of three parts A high speed general purpose
computer (the fix part F).
A variable part (V) consisting of various size high speed digital substructureswhich can be reorganized in problem-oriented special purpose configurations.
The supervisory control (SC) coordinates operations between the fix module and the variable module.
Speed gain over IBM7090 (2.5 to 1000)
Reconfigurable Computing
4
Source: G. Estrin et al., Parallel Processing in a
Restructurable Computer System, IEEE
Transactions on Electronic Computers,
vol 12, no 5, pp. 747-755, 1963.
Gerald Estrin Fix-Plus Machine
The Fixed Part (F) Was initially an IBM 7090, but could be any general purpose computer
The Variable Part (V) Made upon a set of problem-specific optimized functional units in the
basic configuration (trigonometric functions, logarithm, exponentials, n-th
power, roots, complex arithmetic, hyperbolic, matrix operation)
Reconfigurable Computing
5
Two types of basic building blocks The first basic element contains four
amplifiers and associated input logic for signal inversion, amplification, or high-speed storage
The second basic block consists of ten diodes and four output drivers and is for combinatoric application
The basic blocks
Source: G. Estrin et al., Parallel Processing in a
Restructurable Computer System, IEEE Transactions
on Electronic Computers, vol 12, no 5, pp. 747-755, 1963.
Gerald Estrin Fix-Plus Machine
Reconfigurable Computing
6
*The mother board
*The wiring harness
The basic modules can be inserted into any of 36 positions on a mother board.
The connection between the modules is
established through a wiring harness
Function Reconfiguration means changing
some modules
Routing Reconfiguration means changing
parts of the wiring harness
*Source: G. Estrin et al., Parallel Processing in a
Restructurable Computer System, IEEE Transactions
on Electronic Computers, vol 12, no 5, pp. 747-755, 1963.
Gerald Estrin Fix-Plus Machine
Reconfigurable Computing
7
Estrin at work.
Substantial effort on
manual reconfiguration
Source: C. Bobda, Introduction to Reconfigurable Computing, Springer, 2007.
The Rammig Machine
Goal
Investigation of a system, which, with no manual or
mechanical interference, permits the building, changing,
processing and destruction of real (not simulated) digital
hardware
Franz J. Rammig (University of Dortmund 1977)
The concept resulted in the construction of a
hardware editor
Useful to observe a circuit under test
(Hardware Emulation)
Reconfigurable Computing
8
The Rammig Machine
Implementation Outputs of modules connected to
selectors and selector outputs connected to module inputs.
Software-controlled module interconnection
Two main problems to solve: Because the circuit is not hard-wired, a
distortion of the behaviour is possible during reconfiguration
The timing is controlled by the circuit instead of being dictated by an observation mechanism
A time-control must therefore be provided by delay circuits and inertial-delay circuits
Reconfigurable Computing
9
Source: F.J. Rammig, A concept for the editing of
hardware resulting in an automatic hardware-editor,
Design Automation Conference, pp. 187-193, 1977.
PALs and PLAs
• Pre-fabricated building block of many AND/OR gates (or NOR, NAND)
• "Personalized" by making or breaking connections between the gates
Reconfigurable Computing
11
Programmable Array Block Diagram for Sum of Products Form
Inputs
Dense array of AND gates Product
terms
Dense array of OR gates
Outputs
Reconfigurable Computing
12
Example:
Equations
Personality Matrix
Key to Success: Shared Product Terms
1 = asserted in term0 = negated in term- = does not participate
1 = term connected to output0 = no connection to output
Input Side:
Output Side:
Reuse of
terms
F 1
1
0
1
0
0
Outputs Inputs Product t erm A
1
-
1
-
1
B
1
0
-
0
-
C
-
1
0
0
-
F 0
0
0
0
1
1
F 2
1
0
0
1
0
F 3
0
1
0
0
1
A B
B C
A C
B C
A
F0 = A + B CF1 = A C + A BF2 = B C + A BF3 = B C + A
PALs and PLAs
PALs and PLAs
Reconfigurable Computing
13
Example Continued - Unprogrammed device
All possible connections are availablebefore programming
A B C
F0 F1 F2 F3
PALs and PLAs
Reconfigurable Computing
14
Example Continued -Programmed part Unwanted connections are "blown"
Note: some array structureswork by making connectionsrather than breaking them
A B C
F0 F1 F2 F3
AB
BC
AC
BC
A
PALs and PLAs
Reconfigurable Computing
15
Alternative representation for high fan-in structures
Short-hand notationso we don't have todraw all the wires!
x at junction indicatesa connection
Notation for implementation
F0 = A B + A B
F1 = C D + C D
A B C D
AB+AB CD+CD
AB
CD
CD
AB
Unprogrammed device
Programmed device
PALs and PLAs
Reconfigurable Computing
16
Design Example
F1 = A B C
F2 = A + B + C
F3 = A B C
F4 = A + B + C
F5 = A B C
F6 = A B C
Multiple functions of A, B, C
F1 F2 F3 F4 F5 F6
ABC
A
B
C
A
B
C
ABC
ABC
ABC
ABC
ABC
ABC
ABC
A B C
PALs and PLAs
Reconfigurable Computing
17
What is difference between Programmable Array Logic (PAL) andProgrammable Logic Array (PLA)?
PAL concept — implemented by Monolithic MemoriesAND array is programmable, OR array is fixed at fabrication
A given column of the OR arrayhas access to only a subset ofthe possible product terms
PLA concept — Both AND and OR arrays are programmable
PALs and PLAs
Reconfigurable Computing
18
Design Example: BCD to Gray Code Converter
Truth Table
K-maps
Minimized Functions:
A 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
B 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1
C 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
D 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
W 0 0 0 0 0 1 1 1 1 1 X X X X X X
X 0 0 0 0 1 1 0 0 0 0 X X X X X X
Y 0 0 1 1 1 1 1 1 0 0 X X X X X X
Z 0 1 1 0 0 0 0 1 1 0 X X X X X X
AB
CD 00 01 11 10
00
01
11
10
D
B
C
A
0 0 X 1
0 1 X 1
0 1 X X
0 1 X X
K-map for W
AB
CD 00 01 11 10
00
01
11
10
D
B
C
A
0 1 X 0
0 1 X 0
0 0 X X
0 0 X X
K-map for X
AB
CD 00 01 11 10
00
01
11
10
D
B
C
A
0 1 X 0
0 1 X 0
1 1 X X
1 1 X X
K-map for Y
AB
CD 00 01 11 10
00
01
11
10
D
B
C
A
0 0 X 1
1 0 X 0
0 1 X X
1 0 X X
K-map for Z
W = A + B D + B CX = B CY = B + CZ = A B C D + B C D + A D + B C D
PALs and PLAs
Reconfigurable Computing
19
Programmed PAL:
4 product terms per each OR gate
Minimized Functions:
W = A + B D + B CX = B CY = B + CZ = A B C D + B C D + A D + B C D
A B C D
A B C D
A
BD
BC
0
0
0
0
B
C
0
0
BC
BCD
AD
BCD
W X Y Z
W = A + B D + B CX = B CY = B + CZ = A B C D + B C D + A D + B C D
Complex Programmable Logic Devices
• Complex PLDs (CPLD) typically combine PAL combinational logic with Flip Flops
– Organized into logic blocks connected in an interconnect
matrix
– Combinational or registered output
• Usually enough logic for simple counters, state
machines, decoders, etc.
• CPLDs logic is not enough for complex operations
• FPGAs have much more logic than CPLDs
• e.g., Xilinx Coolrunner II, etc.
Reconfigurable Computing
20
Xilinx Coolrunner CPLD
Reconfigurable Computing
21
Function Block Interconnection matrixInterconnection matrix
Macrocells for data storage
Macrocells for data storage
Source: Xilinx, Inc. DS090: CoolRunner-II CPLD Family, 2008
Field Programmable Gate Arrays (FPGAs)
Introduced in 1985 by Xilinx
Roughly seen, an FPGA consists of: A set of programmable macro cells
A programmable interconnection network
Programmable input/outputs
Subparts of a (complex) function are implemented
in macro cells which are then connected to build
the complete function
The I/O can be programmed to drive the macro
cell's inputs or to be driven by the macro cell's
outputs
Unlike traditional application-specific integrated
circuit (ASIC), function is specified by the user
after the device is manufactured
Physical structure and programming method is
vendor-dependent
Reconfigurable Computing
22
Programmable
macro cell
Programmable I/O
Programmable routing
FPGA Structure
Typical organization
Symmetrical Array
2 D array of processing elements (PE)
embedded in an interconnection
network
Interconnection points at the
horizontal-vertical intersection
Row based
Rows of Processing elements
Horizontal routing via horizontal
channels
Channels divided in segments
Vertical connections via dedicated
vertical tracks (not on the graphic)
Reconfigurable Computing
23
Symmetrical Array
Row-based
FPGA Structure
Typical organization (cont)
Sea of gates
2 D array of processing elements
No space left aside the PEs for
routing
Connection is done on a separate
layer on top of the cells
Hierarchical
Hierarchically placed Macro cells
Low-level macro cells are grouped to
build the higher-level PEs
Reconfigurable Computing
24
Sea of Gates
Hierarchical
FPGA Programming Technologies
SRAM (LUT-based)
An SRAM is used to store all possible values of a function
Value of a function for a given input is retrieved using the inputs
as SRAM-address
SRAM implementing a function is
called a look-up table (LUT)
A new function is implemented by
writing new values into the LUT
SRAM-based FPGA can
therefore be reprogrammed (configured) on the fly
Since a LUT is volatile, a LUT configuration is lost when
switching off the system
Reconfigurable Computing
25
FPGA Programming Technologies
Anti-fuse An anti-fuse normally presents a high-impedance state
can be “fused” into a low-impedance state when
programmed by a high voltage
The anti-fuse used in each of type of FPGA from
different company differs in construction
small area
lower resistance and parasitic capacitance than
transistors
-> reduce delays in routing
No re-programming possible
Reconfigurable Computing
26
FPGA Programming Technologies
Poly-diffusion Anti-fuse: ACTEL PLICE programmable low-impedance circuit
element
Poly-silicon terminal
Oxide-Nitride-Oxide dielectric
Melting the dielectric establish connection
Metal Anti-fuse:Q-Logic Vialink 2 Metal terminal layers (Titanium-
Tungsten)
Programming points isolated by
amorphous Silicon film
Reconfigurable Computing
27
Source: Microsemi Corp.
Source: Quicklogic Corp.
FPGA Programming Technologies
EEPROM (Flash) The same technology as that
used in EPROM and EEPROM memories.
EPROMs can be erased, but only as a whole.
EEPROM can be selectively re-programmed in-circuit.
EPROM's resistors consume static power.
EEPROM requires more chip area and multiple voltage sources.
Reconfigurable Computing
28
LUT LUTs are used as function generators in
SRAM-based FPGAs
A k-inputs LUT can implement up to different functions
A k-input LUT has 2k SRAM locations
A function is implemented by writing all possible values that the function can take in the LUT
The inputs values are used to address the LUT and retrieve the value of the function corresponding to the input values
k22
Reconfigurable Computing
29
a XOR b a b
0 0 0
0 1 1
1 0 1
1 1 0
0110
ab a xor b
LUT
FPGA Function generators
LUT-Realization
A LUT is basically a multiplexer that evaluates the truth
table stored in the configuration SRAM cells (can be seen
as a one bit wide ROM).0
A0
A1
A2
A3
1
2
3
4
5
6
7
... ...
...
...
...
tradeoff
between
power
and
speed
config
data
config
enable
Vdd
optimized for
lowest power
L
configuration SRAM cell
0
1
F
0
1
F
FF
A0
A1
A2
A3
L
Q
configura-tion cell
stateflip-flop
connection toswitch matrix
FPGA Function generators
30
Reconfigurable Computing
Source: All images from Xilinx, Inc. Virtex-II Platform FPGA Handbook.
Configuration examples
FPGA Function generators
31
Reconfigurable Computing
A3 A2 A1 A0
0: 0 0 0 0 0 0 0 0 0
1: 0 0 0 1 0 0 0 1 0
2: 0 0 1 0 0 0 1 0 0
3: 0 0 1 1 0 1 1 1 0
4: 0 1 0 0 0 1 0 0 0
5: 0 1 0 1 0 1 0 1 0
6: 0 1 1 0 0 1 1 0 0
7: 0 1 1 1 0 1 1 1 0
8: 1 0 0 0 0 0 0 0 1
9: 1 0 0 1 0 0 0 1 1
A: 1 0 1 0 0 0 1 0 1
B: 1 0 1 1 0 1 1 1 1
C: 1 1 0 0 0 1 1 0 1
D: 1 1 0 1 0 1 1 1 1
E: 1 1 1 0 0 1 1 0 1
F: 1 1 1 1 1 1 1 1 1
FPGA Function generators
LUT Example: Implement the function
using:
2-input LUTs
3-input LUTs
4-input LUTs
Reconfigurable Computing
32
AF = ABD + BC BCD +
AB
DB
CD
A
B
C
F
ABD
BCD
ABC
CD
AB
FF
• On real FPGAs: a cluster of LUTs per switch matrix (e.g., eight LUTs and switch matrix form a configurable logic block on Xilinx FPGAs)
clock
0
1
F
0
1
F
A0
A1
A2
A3
FF
clock
0
1
F
0
1
F
A0
A1
A2
A3
FF
clock
clock
0
1
F
0
1
F
A0
A1
A2
A3
FF
clock
0
1
F
0
1
F
A0
A1
A2
A3
FF
clock
...
possible configuration
switch
matrix
LUT
switch matrixmultiplexer
enable
data_in
data_out
I/O element
I/O
pad
configuration
SRAM cell
FPGA Fabric
33
Reconfigurable Computing
• The routing is implemented using pass transistor logic:
• If we only count transistors, it costs more transistors for the
configuration SRAM cells than for the routing logic
0 1 0 0 00 1
0
1
F
A0
...
...
...
0
1
F
A1
A2
A3
FF
clock
LUT
switch matrix multiplexer
a) b)
FPGA Routing
34
Reconfigurable Computing
The ACTEL ACT3 Family (row-based)
35
Reconfigurable Computing
Row-based FPGA
Module rows separated by routing channels
MUX-based macro-cells
C-Module
4x1 MUX + 1 OR + 1 AND
S-Module
4x1 MUX + 1 OR + 1 AND
1 Flip Flop
I/O placed aside the device
Source: All images from Microsemi Corp. Accelerator Series FPGAs – ACT3 Family, 2012.
The ACTEL ACT3 Family (row-based)
36
Reconfigurable Computing
Channels are composed of several segmented routing tracks
Minimum length = module pair width
Maximum length = row width
Long segment if segment width > 3
Connections are anti-fuse based
Horizontal-to-vertical (XF)
Horizontal-to-horizontal (HF)
Vertical-to-vertical (VF)
Fast vertical connection (FF)
Tracks for module inputs are segmented by pass transistor (inactive during normal operation)
Vertical inputs span the channels above and below
Source: All images from Microsemi Corp. Accelerator Series FPGAs – ACT3 Family, 2012.
The ACTEL ACT3 Family (row-based)
37
Reconfigurable Computing
Module outputs have dedicated channels which extend vertically to two channels above and two channels below, except at the bottom and the top
Source: Microsemi Corp. Accelerator Series FPGAs – ACT3 Family, 2012.
The Xilinx Virtex Family (symmetrical array)
Symmetrical-array Based FPGA Macro cells are configurable logic
block (CLBs), placed on line column intersection.
Additional modules exist:
Block RAM for internal use
Digital clock manager (DCM) for user specific clock frequency generation)
Embedded multiplier or DSP units
Global clock Multiplexers
Input output block (IOB) for off-chip communication
Reconfigurable Computing
38
Source: Xilinx, Inc. Virtex-II Platform: Complete Datasheet.
Macro cells are CLBs. A CLB
contains 2 identical slices on
Virtex 6
Reconfigurable Computing
39
2 slices are split in two
columns of 1 slices each
2 slices are summarized in one
column
Bottom-left corner of FPGA
The Xilinx Virtex Family (symmetrical array)
Source: All images from Xilinx, Inc. Virtex-II Platform: Complete Datasheet.
On “Virtex 6” 1 slice contains:
4x 6-inputs LUT
8x FF for storing LUT results
MUX to feed LUT either to a FF or the the output
Carry in and carry out help to construct fast adder circuits using neighbour CLBs
Reconfigurable Computing
40
The Xilinx Virtex Family (symmetrical array)
Source: Xilinx, Inc. Virtex-II Platform: Complete Datasheet.
The Xilinx Virtex Family (symmetrical array)
A CLB accesses the general routing matrix via a switch matrix
Fast connection lines are used for local connections
A switch matrix connects CLB terminal on the routing resource using multiplexers
4 horizontal resources per CLB for on-chip tri-state buses
Each CLB has two tri-state drivers (TBUF) that can drive on chip buses
Each TBUF has its own control pin and its own input pin
Newer Virtex-Devices uses AND-OR based logic for buses, i.e., timing is more predictable
Reconfigurable Computing
41
Source: All images from Xilinx, Inc. Virtex-
II Platform: Complete Datasheet.
The Xilinx Virtex Family (symmetrical array)
IOB for off-chip communication
Programmability allows the use of an IOB by any CLB.
Connection can be input, output or bidirectional.
6 IOB latches for double data rate (DDR) transmission.
One of the DDR registers can be used as input, output or tri-state.
DDR accomplished by the two registers on each path clocked by rising or falling edge from different clock nets.
The two clock signals generated by the DCM.
Reconfigurable Computing
42
Source: Xilinx, Inc. Virtex-II Platform: Complete
Datasheet.
The Actel ProAsic Family (sea-of-gates)
Sea-of-gates style (sea-of-tiles)
Macro cells are EEPROM based tiles
Four levels of hierarchical routing resources.
Local resource connects a tile to one of its 8 neighbours
Long-lines resource provides routing for long distance and high fan-out (spans 1, 2 or 4 tiles). Runs both horizontal and vertical
Very long-line resource spans the entire device
Global network (clocks, reset)
Reconfigurable Computing
43
Source: All images from Microsemi Corp.
ProAsicPLUS Flash Family FPGAs Datasheet.
The Altera Flex family (hierarchical)
Hierarchical-based FPGA
Logic elements (LE) are grouped into Logic array block (LAB), on the higher level
10 LE / LAB for the FLEX8000
LAB arranged as array on the device
Reconfigurable Computing
44
An LE contains:
1 4-input LUT
1 FF
carry-in, carry-out
MUX
additional logicSource: All images from Altera, Inc.
FLEX 10k Embedded Programmable
Logic Device Family Datasheet.
The Altera Flex family (hierarchical)
FastTrack interconnect provides on-chip routing resource
Connections among LEs and adjacent LABs via local interconnect signals
Connection inside each row of LAB is done by a dedicated row interconnect
Each column of LAB is served by a dedicated column interconnect.
LEs can drive the row or column channels
Column interconnect can drive row interconnect.
A signal from the column interconnect must be routed to the row interconnect before entering an LAB
LEs can drive global signals (clocks, reset, asynchronous clear, high fanout, etc.)
Reconfigurable Computing
45
Source: Altera, Inc. FLEX 10k Embedded Programmable Logic
Device Family Datasheet.
The Altera Flex family (hierarchical)
Programmable IO Element (IOE) allows on-chip and off-chip programmable communication
An IOE can be programmed as input, output or bidirectional.
IOE receives data from adjacent interconnect (can be driven by row or column interconnect)
IOE receives its chip enable (ce) from an adjacent LE.
One pin per output element (OE) -> possible open drain emulation
Open drain emulation is provided by:
Driving the data input low
Toggling the OE of each IOE
Reconfigurable Computing
46
Source: Altera, Inc. FLEX 10k Embedded Programmable Logic
Device Family Datasheet.
Hybrid FPGAs
The Xilinx Virtex 5 FXT
Basic structure: Virtex 5
Additional features:
Up to 2 hard-core embedded IBM power pc 440 RISC processors with up to 400 MHz
DSP48E Slices with 25 x 18 complement multiplication and MAC unit
Dual-ported RAM
Integrated Endpoint Block for PCI Express Compliance
Embedded high speed serial RocketIO multi-gigabit transceivers
Reconfigurable Computing
47
Source: Xilinx Inc.: Virtex-II Pro™ Platform FPGAs: Functional Description
Hybrid FPGAs
The Altera Excalibur
Specific features:
One ARM922T 32 bit RISC processor running at 200 MHz
Embedded multipliers
Internal single and dual-ported RAM and SDRAM controller
Expansion bus interface for Flash-RAM connection
Embedded SignalTap logic analyzer
Reconfigurable Computing
48
Source: Altera, Inc. Excalibur Device Overview Datasheet.
Tabula Spacetime Architecture
Reconfigurable Computing
49
Tabula’s 3D Architecture
8 configuration planes
Reconfiguration @ 1.6 GHz
Within netlist reconfiguration
(uses forwarding registers
called „time via“)
8 folds @ 1.6 GHz
200 MHz user clock
400 MHz user clock
Source: T. R. Halfhill, Tabula‘s Time Machine. In Microprocessor Report, Reed Electronic Group, 2010.
Tabula Spacetime Architecture
Example: 32- bit adder; a) conventional, b) time-multiplexed
FA
a4
b4
a5
b5
a6
b6
a7
b7
a0
b0
a1
b1
a2
b2
a3
b3
FAFAFA '0'FAFAFAFAFA
a12
b12
a13
b13
a14
b14
a15
b15
a8
b8
a9
b9
a10
b10
a11
b11
FAFAFAFAFAFAFA
s4
s5
s6
s7
s0
s1
s2
s3
s12
s13
s14
s15
s8
s9
s10
s11
a)
FA
a0
b0
a1
b1
a2
b2
a3
b3
'0'FAFAFA
s0
s1
s2
s3
a4
b4
a5
b5
a6
b6
a7
b7
FA
s4
s5
s6
s7
FA
a8
b8
a9
b9
a10
b10
a11
b11
FAFA
s8
s9
s10
s11
FA
b12
a13
b13
a14
b14
a15
b15
FAFAFA
s12
s13
s14
s15
FA FA FA
FA
a12
time forwarding latch
b)
mc1
mc2
mc3
mc4
x
y
t
basic logic element
micro
configuration
Reconfigurable Computing
50
Source: D. Koch, Partial Reconfiguration on FPGAs: Architecture, Tools and Applications, Springer, 2013.
Tabula Spacetime Architecture
Example: 32- bit adder; c) scheduling
Each micro configuration has its own set of SRAM cells
(which requires area on the die; savings are possible in the logic)
Rapid reconfiguration consumes power (millions of configuration bits)
better suitable for latest CMOS processes
(where static power dominates dynamic power)
c)clk
mc1
mc2
mc3
mc4
configuration
execution
microcycle
Reconfigurable Computing
51
Source: Koch, D. Partial Reconfiguration on FPGAs: Architecture, Tools and Applications, Springer, 2013.
Recommended