Upload
-
View
54
Download
1
Embed Size (px)
DESCRIPTION
Frequently and Non frequently configurable devices; Rapid Prototyping;Applications of Reconfigurable Devices; Automation based text searching; Video processing; Pattern recognition; DPGA architecture and Applications; Time switched Register; TSFPGA
Citation preview
Savitribai Phule PUNE UNIVERSITY Reconfigurable Computing (504203) Module IV Notes
© All rights reserved - Not to be copied and Sold
Reconfigurable Devices for Rapid Prototyping, Frequently Reconfigurable systems
and Non-Frequently Reconfigurable systems
1. Rapid prototyping:
In this case, the reconfigurable device is used as an emulator for another digital device, usually an
ASIC.
Emulator in Embedded Systems refers to Hardware/Software that duplicates the functions of one
Embedded System (The Guest) in another Embedded System(The Host), different from first one,
so that the Emulated Behaviour closely resembles the behaviour of the Real system.
The emulation process allows to functionally test the correctness of the ASIC device to be
produced, sometimes in real operating and environmental conditions, before production.
The reconfigurable device is only reconfigured to emulate a new implementation of the ASIC
device.
Example: APTIX-System Explorer and the ITALTEL Flexbench systems
2. Non-frequently reconfigurable systems:
The Reconfigurable device is integrated in a system where it can be used as an application-specific
processor.
Such systems are used as prototyping platform, but can be used as running environment as well.
These systems are usually stand-alone Systems
The reconfiguration is used for testing and initialization at start-up and for upgrading purpose.
The device configuration is usually stored in an EEPROM of Flash from which it is downloaded
at start-up to reconfigure the device.
No configuration happens during operation.
Example: The RABBIT System, the Celoxica, RC100, RC200, RC300
3. Frequently reconfigurable systems:.
In this category of Reconfigurable Devices, the devices are frequently reconfigured.
These systems are usually coupled with a host processor, which is used to reconfigure the device
and control the complete system.
RD is used as an accelerator for time critical parts of applications.
The host processor accesses the RD using function calls.
The Reconfigurable part is usually a PCI board attached to the PCI-bus. The communication is
useful for configuration and data exchange.
Example: The Raptor 2000, the Celoxica RC 1000, RC2000
Compile-time reconfiguration, Run-time reconfiguration Depending on the time at which the reconfiguration sequence are defined, Reconfigurable Devices can be
classified as:
A. Static / Compile-time Reconfigurable Devices: The Computation and Configuration sequences as
well as the Data exchange are defined at compile time and never change during a computation.
This approach is more interesting for devices, which can only be fully reconfigured. However,
it can be applied to partial reconfigurable devices that are logically or physically partitioned in
a set of reconfigurable bins
B. Dynamic/ Run-time Reconfigurable Devices: The Computation and Configuration sequences are
not known at compile time. Request to implement a given task is known at run-time and should be
handled dynamically.
The reconfiguration process exchanged part of the device to accommodate the system to
changing operational and environmental conditions.
Run-time reconfiguration is a difficult process that must handle side-effect factors such as
defragmentation of the device and communication between newly placed modules
Savitribai Phule PUNE UNIVERSITY Reconfigurable Computing (504203) Module IV Notes
© All rights reserved - Not to be copied and Sold
Runtime Reconfigurable Systems & Their Hardware
The management of the reconfigurable device is usually done by a scheduler and a placer that can
be implemented as part of an operating system running on a processor .
The processor can either reside inside or outside the reconfigurable chip.
The scheduler manages the tasks and decides when a task should be executed.
The tasks that are available as configuration data in a database are characterized through their
bounding box and their run-time. The bounding box defines the area that a task occupies on the
device.
The scheduler determines which task should be executed on the RPU and then gives the task to the
placer that will try to place it on the device, i.e. allocate a set of resources for the implementation
of that task.
If the placer is not able to find a site for the new task, then it will be sent back to the scheduler that
can then decide to send it later and to send another task to the placer. In this case, we say that the
task is rejected.
The host CPU is used for device configuration and data transfer.
Usually, the reconfigurable device and the host processor communicates through a bus that is used
for the data transfer between the processor and the reconfigurable device
The RPU acts like a coprocessor with varying instruction sets accessible by the processor in a
function call.
No matter if we are dealing with a compiled-time or run-time reconfigurable system, the computation
and reconfiguration flow is usually the one shown in figure below(next page):
1. At the beginning of a computation, the host processor configures the reconfigurable device.
2. Then it downloads the segment of the data to be processed by the RPU3 and give the start
signal to the RPU.
3. The host and the RPU can then process in parallel on their segments of data. At the end of its
computation, the host reads the finish signal of the RPU.
4. At this point, the data (computation result) can be collected from the RPU memory by the
processor. The RPU can also send the data directly to an external sink. In the computation
flow presented above, the RPU is configured only once. However, in frequently reconfigured
systems, several configurations might be done.
Savitribai Phule PUNE UNIVERSITY Reconfigurable Computing (504203) Module IV Notes
© All rights reserved - Not to be copied and Sold
Architectures for Reconfigurable computing: TSFPGA, DPGA, Matrix;
<MATRIX Refer file 2>
DPGA
Realm of Application for DPGA: DPGAs, as with any general-purpose computing device supporting the rapid selection among instructions, are
beneficial in cases where only a limited amount of functionality is needed at any point in time, and where it is
necessary to rapidly switch among the possible functions needed.
We look at several, commonly arising situations where multicontext devices are preferable to single-context devices,
including:
i. Tasks with limited throughput requirements
ii. Latency limited tasks
iii. Time or data varying logical functions
i. Limited Throughput Requirements
Relative Processing Speeds Most designs are composed of several sub-components or sub-tasks, each performing a
task necessary to complete the entire application. The overall performance of the design is limited by the processing
throughput of the slowest device. If the performance of the slowest device is fixed, there is no need for the other
devices in the system to process at substantially higher throughputs.
In these situations, reuse of the active silicon area on the non-bottleneck components can improve performance or
lower costs. If we are getting sufficient performance out of the bottleneck resource, then we may be able to reduce
cost by sharing the gates on the non-bottleneck resources between multiple “components” of the original design.
Fixed Functional Requirements Many applications have fixed functional requirements. Input processing on sensor
data, display processing, or video processing all have task defined processing rates which are fixed. In many
applications, processing faster than the sample or display rate is not necessary or useful. Once we achieve the desired
rate, the rest of the “capacity” of the device is not required for the function. With reuse of active silicon, the residual
processing capacity can be employed on other computations.
I/O Latency and Bandwidth Device I/O bandwidth often acts as a system bottleneck, limiting the rate at which data
can be delivered to a part. When data throughput is limited by I/O bandwidth, we can reuse the internal resources to
provide a larger, effective, internal gate capacity. This reuse decrease the total number of devices required in the
system.
Savitribai Phule PUNE UNIVERSITY Reconfigurable Computing (504203) Module IV Notes
© All rights reserved - Not to be copied and Sold
ii. Latency Limited Designs
Some designs are limited by latency not throughput. Here, high throughput may be unimportant. Often it is irrelevant
how quickly we can begin processing the next datum if that time is shorter than the latency through the design. This
is particularly true of applications which must be serialized for correctness (e.g. database updates, resource
allocation/deallocation, adaptive feedback control).
By reusing gates and wires, we can use device capacity to implement these latency limited operations with less
resources than would be required without reuse. This will allow us to use smaller devices to implement a function or
to place more functionality onto each device.
iii. Data Dependent Functional Requirements Another characteristic of finite state machines is that the computational task varies over time and as a function of the
input data. At any single point in time, only a small subset of the total computational graph is needed. In a spatial
implementation, all of the functionality must be implemented simultaneously, even though only small subsets are
ever used at once.
Many tasks may perform quite different computations based on the kind of data they receive. A network interface
may handle packets differently based on packet type. A computational function may handle new data differently
based on its range. Data objects of different types may require widely different handling. Rather than providing
separate, active resources for each of these mutually exclusive cases, a multicontext device can use a minimum
amount of active resources, selecting the proper operational behavior as needed.
Architecture
The basic architecture for DPGA is as show in figure.
Each array element is a conventional 4-input lookup table (4-LUT).
Small collections of array elements, in (this case 4X4 arrays), are grouped together into subarrays. These
subarrays are then tiled to compose the entire array.
Crossbars between subarrays serve to route inter-subarray connections. A single, 2-bit, global context
identifier is distributed throughout the array to select the configuration for use.
Additionally, programming lines are distributed to read and write configuration memories.
Savitribai Phule PUNE UNIVERSITY Reconfigurable Computing (504203) Module IV Notes
© All rights reserved - Not to be copied and Sold
DRAM Memory
The basic memory primitive is a 4X32 bit DRAM array which provides four context configurations for
both the LUT and interconnection network .
The memory cell is a standard three transistor DRAM cell.
Notably, the context memory cells are built entirely out of N-well devices, allowing the memory array to
be packed densely, avoiding the large cost for N-well to P-well separation.
Array Element
The array element is a 4-LUT which includes an optional flip-flop on its output
Each array element contains a context memory array.
This is th 4X32 bit memory described above. 16 bits provide the LUT programming, 12 configure the
four8-input multiplexors which select each input to the 4-LUT, and one selects the optional flip-flop.
The remaining three memory bits are presently unused.
Savitribai Phule PUNE UNIVERSITY Reconfigurable Computing (504203) Module IV Notes
© All rights reserved - Not to be copied and Sold
Subarrays
The subarray organizes the lowest level of the interconnect hierarchy.
Each array element output is run vertically and horizontally across the entire span of the subarray
Each array element can, in turn, select as an input the output of any array element in its subarray
which shares the same row or column.
This topology allows a reasonably high degree of local connectivity.
This leaf topology is limited to moderately small subarrays since it ultimately does not scale.
Non-Local Interconnects In addition to the local outputs which run across each row and column, a number of non-local lines
are also allocated to each row and column. The non-local lines are driven by the global interconnect.
Each LUT can then pick inputs from among the lines which cross its array element. Each row and
column supports four non-local lines. Each array element could thus pick its inputs from eight global
lines, six row and column neighbor outputs, and its own output. Each input is configured with an 8:1
selector.(Ref fig 2)
Global Interconnect
Between each subarray, a pair of crossbars route the subarray outputs from one subarray into the non-
local inputs of the adjacent subarray.
Note that all array element outputs are available on all four sides of the subarray.
This means that each crossbar is a 16X8 crossbar which routes 8 of the 16 outputs to the neighboring
subarray’s 8 inputs on that side.
TSFPGA Active interconnect area consumed most of the space on traditional, single-context FPGAs. In DPGA, we saw
that adding small, local, context memories allowed us to reuse active area and achieve smaller task
implementations. Even in these multicontext devices, we saw that interconnect consumed most of the area.
Time-Switched Field Programmable Gate Array (TSFPGA),is a multicontext device designed explicitly
around the idea of time-switching the internal interconnect in order to implement more effective connectivity
with less physical interconnect.
The trick employed in Time Switched Register is to have each logical input load its value from the active
interconnect at just the right time. As we have seen, multicontext evaluation typically involves execution of a
series of microcycles. A subset of the task is evaluated on each microcycle, and only that subset requires
active resources in each microcycle. We call each microcycle a timestep and, conceptually at least, number
them from zero up to the total number of microcycles required to complete the task. If we broadcast the
current timestep, each input can simply load its value when its programmed load time matches the current
timestep.
Savitribai Phule PUNE UNIVERSITY Reconfigurable Computing (504203) Module IV Notes
© All rights reserved - Not to be copied and Sold
Figure shows a 4-LUT with this input arrangement which we will call the Time-Switched Input Register. Each
LUT input can load any value which appears on its input line in any of the last i cycles. The timestep value is
log2i bits wide, as is the comparator. With this scheme, if the entire computation completes in i timesteps, all
retiming is accomplished by simply loading the LUT inputs at the appropriate time – i.e. loading each input
just when its source value has been produced and spatially routed to the destination input.
With this input structure, logical LUT evaluation time is now decoupled from input arrival time. This
decoupling was not true in FPGAs, DPGAs, or even iDPGAs. This decoupling of the input arrival time and
LUT evaluation time allows us to remove the simultaneous constraints.
Switched Interconnect – Folding For instance, we could map pairs of LUTs together such that they share input sets. This, coupled with cutting
the number of array outputs (Nsa_out ) in half, will cut the number of crossbar outputs in half and hence halve
the subarray crossbar size. For full connectivity, it may now takes us two microcycles to route the
connections, delivering the inputs to half the LUTs and half the array outputs in each cycle. In this particular
case we have performed output folding by sharing crossbar outputs. Notice that the time-switched input
register allows us to get away with this folding by latching and holding values on the correct microcycle. The
input register also allows the non-local subarray outputs to be transmitted over two cycles.
We can also perform input folding. With input folding, we pair LUTs so that they share a single LUT output.
Here we cut the number of array inputs ( Nsa_in) in half, as well. The array crossbar now requires only half as
many inputs as before and is, consequently, also half as large in this case.
Again, the latched inputs allow us to load each LUT input value only on the microcycle on which the
associated value is actually being routed through the crossbar. For input folding, we must add an effective pre-
crossbar multiplexor so that we can select among the sources which share a single crossbar input
Architecture The basic TSFPGA building block is the subarray tile (See Figure) which contains a collection of LUTs and a
central switching crossbar.
LUTs share output connections to the crossbar and input connections from the crossbar in the folded
manner.
Communication within the subarray can occur in one TSFPGA clock cycle.
Savitribai Phule PUNE UNIVERSITY Reconfigurable Computing (504203) Module IV Notes
© All rights reserved - Not to be copied and Sold
Array Element
The TSFPGA array element is made up of a number of LUTs which share the same crossbar outputs and
input (See Figure below).
The LUT output into the crossbar is selected based on the routing context programming.
The LUT input values are stored in time-switched input registers.
The inputs to the array element are run to all LUT input registers. When the current timestep matches the
programmed load time, the input register is enabled to load the value on the array-element input. When
multiple LUTs in an array element take the same signal as input, they may be loaded simultaneously.
Savitribai Phule PUNE UNIVERSITY Reconfigurable Computing (504203) Module IV Notes
© All rights reserved - Not to be copied and Sold
Crossbar
The primary switching element is the subarray crossbar.
Each crossbar input is selected from a collection of subarray network inputs and subarray LUT outputs via by a
pre-crossbar multiplexor.
Subarray inputs are registered prior to the precrossbar multiplexor and outputs are registered immediately after
the crossbar, either on the LUT inputs or before traversing network wires.
Each registered, crossbar output is routed in several directions to provide connections to other subarrays or chip
I/O.
The network in TSFPGA is folded such that the single subarray crossbar performs all major switching roles:
1. output crossbar – routing data from LUT outputs to destinations or intermediate switching crossbars
2. routing crossbar – routing data through the network between source and destination subarrays
3. input crossbar – receiving data from the network and routing it to the appropriate destination LUT input
Intra-Subarray Switching
Communication within the subarray is simple and takes one clock cycle per LUT evaluation and interconnect.
Once a LUT has all of its inputs loaded, the LUT output can be selected as an input to the crossbar, and the
LUT’s consumers within the subarray may be selected as crossbar outputs.
At the end of the cycle, the LUT’s value is loaded into the consumers’ input registers, making the value
available for use on the next cycle.
Inter-Subarray Switching
A subarray may be connected to other subarrays on a component. A number of subarray outputs are run to each
subarray in the same row and column.
Routing data within the same row or column involves:
1. Route LUT output through crossbar to the outputs headed for the destination subarray.
2. Traverse the wire between subarrays.
3. Select network input with source value as a crossbar source and route through the crossbar to the destination
LUT input.
When data needs to traverse both row and column:
1. Route LUT output to first dimension destination (row, column).
2. Traverse first dimension interconnect.
3. Switch output in second dimension (column, row).
4. Traverse second dimension interconnect.
5. Switch to LUT input and load.
I/O Connections
I/O connections are treated like hierarchical network lines and are routed into and out of the subarrays in a similar
manner.
Each input has an associated subarray through which it may enter the switched network. Similarly, each output is
associated with the crossbar output of some subarray.
Device outputs are composed of time-switched input registers and load values from the network at designated
timesteps like LUT inputs.
Array Control
Two “instruction” values are used to control the operation of each subarray on a per clock cycle basis, timestep
and routing context).
The routing context serves as an instruction pointer into the subarray’s routing memory. It selects the
configuration of the crossbar and pre-crossbar multiplexors on each cycle.
timestep denotes time events and indicates when values should be loaded from shared lines.
Savitribai Phule PUNE UNIVERSITY Reconfigurable Computing (504203) Module IV Notes
© All rights reserved - Not to be copied and Sold
Applications of reconfigurable computing: Various hardware implementations of Pattern Matching such as the Sliding Windows Approach, Automaton-Based Text Searching. Video Streaming;
Pattern Matching Pattern matching can be defined as the process of checking if a character string is part of a longer sequence
of characters.
Pattern matching is used in a large variety of fields in computer science. In text processing programs such
as Microsoft Word, pattern matching is used in the search function. The purpose is to match the keyword
being searched against a sequence of characters that build the complete text.
In database information retrieval, the content of different fields of a given database entry are matched
against the sequence of characters that build the user request. Speech recognition and other pattern
recognition tools also use pattern matching as basic operations on top of which complex intelligent
functions may be built, to better classify the audio sequences.
Other applications using pattern matching are:
dictionary implementation, spam avoidance, network intrusion detection and content surveillance.
The first advantage of using the reconfigurable device here is the inherent fine-grained parallelism that
characterizes the search. Many different words can be searched for in parallel by matching the input text
against a set of words on different paths. The second advantage is the possibility of quickly exchanging the
list of searched words by means or reconfiguration.
We define the capacity of a search engine as the number of words that can be searched for in parallel. A
large capacity also means a high complexity of the function to be implemented in hardware, which in turn
means a large amount of resources. The goal in implementing a search engine in hardware is to have a
maximal hardware utilization, i.e. as many words as possible that can be searched for in parallel.
The different hardware implementation of pattern matching are :
1. Sliding Window Approach
2. Hash Tabled Based Text Searching
3. Automation Based Text Searching
The Sliding Windows Approach
One approach for text searching is the so-called sliding window. In the 1-keyword version, the target word
is stored in one register, each character being stored in one register field consisting of a byte. The length of
the register is equal to the length of the word it contains. The text is streamed through a separate shift
register, whose length is the same as that of the keyword. For each character of the given keyword stored as
a byte in one register field, an 8-bit comparator is used to compare this character with the corresponding
character of the text, which streams through the shift register. A hit occurs when all the comparators return
the value true.
The sliding window can be extended to check a match of multiple patterns in parallel. Each target word
will then be stored in one register and will have as many comparators as required.
The main advantage of this approach resides in the reconfiguration. Because each keyword is stored in an
independent register, the reconfiguration can happen without affecting the other words, thus providing the
possibility to gradually and quickly modify the dictionary.
Savitribai Phule PUNE UNIVERSITY Reconfigurable Computing (504203) Module IV Notes
© All rights reserved - Not to be copied and Sold
Automation Based Text Searching:
It is well known that any regular grammar can be recognized by a deterministic finite state machine (FSM).
In an automaton-based search algorithm, a finite state machine is built on the basis of the target words. The
target words define a regular grammar that is compiled in an automaton acting as a recognizer for that
grammar. When scanning a text, the automaton changes its state with the appearance of characters. Upon
reaching an end state, a hit occurs and the corresponding word is set to be found.
The advantage of the FSM-based search machine is the elimination of the preprocessing step done in many
other methods to remove stop words (such as ‘the’, ‘to’, ‘for’ etc., which does not affect the meaning of
statements) from documents.
When taking into consideration the common prefix of words, it is possible to save a considerable amount of
flip flops. For this purpose, the search machine could be implemented as an automaton recognizer,
common to all the target words. As shown in figure below, the resulting structure is a tree in which a path
from the root to a leaf determines the appearance of a corresponding key word in the streamed text. Words
that share a common prefix use a common path from the root corresponding to the length of their common
prefix. A split occurs at the node where the common prefix ends.
Savitribai Phule PUNE UNIVERSITY Reconfigurable Computing (504203) Module IV Notes
© All rights reserved - Not to be copied and Sold
In the hardware implementation of a group of words with a common prefix, common flip flops will be used
to implement the common path (figure 9.3a). For each character in the target alphabet, only one comparator
is needed. The comparison occurs in this case, once at a particular location in the device. Incoming
characters can be directly sent to a set of all possible comparators. Upon matching a particular one, a
corresponding signal is sent to all the flip flops, which need the result of that comparison. This method will
reduce the number of comparators needed almost by the sum of the length of all target words.
The overall structure of the search engine previously explained is given in shown above. It consists of an
array of comparators to decode the characters of the FSM alphabet, a state decoder that moves the state
machine in the next states and a preprocessor to map incoming 8-bit characters to 5-bit characters, thus
mapping all the characters to lower cases.
As case insensitivity is considered in most application in information retrieval, the preprocessor is designed
to map upper and lower case characters to the binary code of 1 to 26 and the rest of character to 0.
Characters are streamed to the device in ASCII notation. An incoming 8-bit character triggers the clock and
is mapped to a 5-bit character that is sent to the comparator array. All the 5-bit signals are distributed to all
the comparators that operate in parallel to check if a match of the incoming character with an alphabet
character occurs. If a match occurs for the comparator k, the output signal chark is set to one and all the
others are set to zero. If no match occurs, all the output signal are set to be zero.
Savitribai Phule PUNE UNIVERSITY Reconfigurable Computing (504203) Module IV Notes
© All rights reserved - Not to be copied and Sold
Video Streaming Video streaming is the process of performing computations on video data, which are streamed through a given
system, picture after picture.
Two main reasons are always stated as motivation for implementation video streaming in FPGAs:
1. the performance, which results from using the exploitation of the inherent parallelism in the target
applications to build a tailored hardware architecture, and
2. the adaptivity, which can be used to adjust the overall system by exchanging parts of the computing
blocks with better adapted ones.
Most of the video capture modules provide the frames to the system on a pixel by pixel manner, leading to a
serial computation on the incoming pixels. As many algorithms need the neighbourhood of a pixel to compute
its new value, a complete block must be stored and processed for each pixel. Capturing the neighbourhood of
a pixel is often done using a sliding window data structure with varying size.
A given set of buffers (FIFO) is used to update the windows. The number of FIFOs vary according to the size
of the window. In each step, a pixel is read from the memory and placed in the lower left cell of the window.
Up to the upper right pixel that is disposed, i.e. outputted, all the pixels in the right part of the window are
placed at the queue of the FIFOs one level higher.
The processing part of the video is a normal image processing algorithm combining some of the basic operators
like:
Median filtering, Basic morphological operations, Convolution, Edge detection.
The architecture for a video streaming system is usually built on a modular and chained basis.
The chain consist of a set of modules, each of which is specialized for a given computation. The first module on
the chain is in charge of capturing the video frames, while the last module output the processed frames. Output
can be rendered on a monitor or can be sent as compressed data over a network.
Between the first and the last modules, several computational modules can be used according to the algorithm
implemented. The frames are alternately written to the RAM1 and RAM2 by the capture module. The second
module collects the pictures from the RAM1 or RAM2, if this is not in use by the first module, builds the
sliding windows and passes it to the third module, which processes the pixel and saves it in its own memory or
directly passes it to the next module.
Savitribai Phule PUNE UNIVERSITY Reconfigurable Computing (504203) Module IV Notes
© All rights reserved - Not to be copied and Sold
This architecture presents a pipelined computation in which the computational blocks are the modules that
process the data frames. RAMs are used to temporally store frames between two modules, thus allowing a
frame to stream from RAM to RAM and the processed pictures to the output.
A module can be implemented in hardware or in software according to the goals. While software provides a
great degree of flexibility, it is usually not fast enough to carry the challenging computations required in video
applications. ASICs can be used to implement the computational demanding parts; however, ASIC does not
provide the flexibility needed in many systems. On a reconfigurable platform, the flexibility of a processor can
be combined with the efficiency of ASICs to build a high-performance flexible system.