Frequently and Non frequently configurable devices; Applications of Reconfigurable Devices

Savitribai Phule PUNE UNIVERSITY Reconfigurable Computing (504203) Module IV Notes

© All rights reserved - Not to be copied and Sold

Reconfigurable Devices for Rapid Prototyping, Frequently Reconfigurable systems

and Non-Frequently Reconfigurable systems

1. Rapid prototyping:

In this case, the reconfigurable device is used as an emulator for another digital device, usually an

ASIC.

Emulator in Embedded Systems refers to Hardware/Software that duplicates the functions of one

Embedded System (The Guest) in another Embedded System(The Host), different from first one,

so that the Emulated Behaviour closely resembles the behaviour of the Real system.

The emulation process allows to functionally test the correctness of the ASIC device to be

produced, sometimes in real operating and environmental conditions, before production.

The reconfigurable device is only reconfigured to emulate a new implementation of the ASIC

device.

Example: APTIX-System Explorer and the ITALTEL Flexbench systems

2. Non-frequently reconfigurable systems:

The Reconfigurable device is integrated in a system where it can be used as an application-specific

processor.

Such systems are used as prototyping platform, but can be used as running environment as well.

These systems are usually stand-alone Systems

The reconfiguration is used for testing and initialization at start-up and for upgrading purpose.

The device configuration is usually stored in an EEPROM of Flash from which it is downloaded

at start-up to reconfigure the device.

No configuration happens during operation.

Example: The RABBIT System, the Celoxica, RC100, RC200, RC300

3. Frequently reconfigurable systems:.

In this category of Reconfigurable Devices, the devices are frequently reconfigured.

These systems are usually coupled with a host processor, which is used to reconfigure the device

and control the complete system.

RD is used as an accelerator for time critical parts of applications.

The host processor accesses the RD using function calls.

The Reconfigurable part is usually a PCI board attached to the PCI-bus. The communication is

useful for configuration and data exchange.

Example: The Raptor 2000, the Celoxica RC 1000, RC2000

Compile-time reconfiguration, Run-time reconfiguration Depending on the time at which the reconfiguration sequence are defined, Reconfigurable Devices can be

classified as:

A. Static / Compile-time Reconfigurable Devices: The Computation and Configuration sequences as

well as the Data exchange are defined at compile time and never change during a computation.

This approach is more interesting for devices, which can only be fully reconfigured. However,

it can be applied to partial reconfigurable devices that are logically or physically partitioned in

a set of reconfigurable bins

B. Dynamic/ Run-time Reconfigurable Devices: The Computation and Configuration sequences are

not known at compile time. Request to implement a given task is known at run-time and should be

handled dynamically.

The reconfiguration process exchanged part of the device to accommodate the system to

changing operational and environmental conditions.

Run-time reconfiguration is a difficult process that must handle side-effect factors such as

defragmentation of the device and communication between newly placed modules



Runtime Reconfigurable Systems & Their Hardware

The management of the reconfigurable device is usually done by a scheduler and a placer that can

be implemented as part of an operating system running on a processor .

The processor can either reside inside or outside the reconfigurable chip.

The scheduler manages the tasks and decides when a task should be executed.

The tasks that are available as configuration data in a database are characterized through their

bounding box and their run-time. The bounding box defines the area that a task occupies on the

device.

The scheduler determines which task should be executed on the RPU and then gives the task to the

placer that will try to place it on the device, i.e. allocate a set of resources for the implementation

of that task.

If the placer is not able to find a site for the new task, then it will be sent back to the scheduler that

can then decide to send it later and to send another task to the placer. In this case, we say that the

task is rejected.

The host CPU is used for device configuration and data transfer.

Usually, the reconfigurable device and the host processor communicates through a bus that is used

for the data transfer between the processor and the reconfigurable device

The RPU acts like a coprocessor with varying instruction sets accessible by the processor in a

function call.

No matter if we are dealing with a compiled-time or run-time reconfigurable system, the computation

and reconfiguration flow is usually the one shown in figure below(next page):

1. At the beginning of a computation, the host processor configures the reconfigurable device.

2. Then it downloads the segment of the data to be processed by the RPU3 and give the start

signal to the RPU.

3. The host and the RPU can then process in parallel on their segments of data. At the end of its

computation, the host reads the finish signal of the RPU.

4. At this point, the data (computation result) can be collected from the RPU memory by the

processor. The RPU can also send the data directly to an external sink. In the computation

flow presented above, the RPU is configured only once. However, in frequently reconfigured

systems, several configurations might be done.



Architectures for Reconfigurable computing: TSFPGA, DPGA, Matrix;

<MATRIX Refer file 2>

DPGA

Realm of Application for DPGA: DPGAs, as with any general-purpose computing device supporting the rapid selection among instructions, are

beneficial in cases where only a limited amount of functionality is needed at any point in time, and where it is

necessary to rapidly switch among the possible functions needed.

We look at several, commonly arising situations where multicontext devices are preferable to single-context devices,

including:

i. Tasks with limited throughput requirements

ii. Latency limited tasks

iii. Time or data varying logical functions

i. Limited Throughput Requirements

Relative Processing Speeds Most designs are composed of several sub-components or sub-tasks, each performing a

task necessary to complete the entire application. The overall performance of the design is limited by the processing

throughput of the slowest device. If the performance of the slowest device is fixed, there is no need for the other

devices in the system to process at substantially higher throughputs.

In these situations, reuse of the active silicon area on the non-bottleneck components can improve performance or

lower costs. If we are getting sufficient performance out of the bottleneck resource, then we may be able to reduce

cost by sharing the gates on the non-bottleneck resources between multiple “components” of the original design.

Fixed Functional Requirements Many applications have fixed functional requirements. Input processing on sensor

data, display processing, or video processing all have task defined processing rates which are fixed. In many

applications, processing faster than the sample or display rate is not necessary or useful. Once we achieve the desired

rate, the rest of the “capacity” of the device is not required for the function. With reuse of active silicon, the residual

processing capacity can be employed on other computations.

I/O Latency and Bandwidth Device I/O bandwidth often acts as a system bottleneck, limiting the rate at which data

can be delivered to a part. When data throughput is limited by I/O bandwidth, we can reuse the internal resources to

provide a larger, effective, internal gate capacity. This reuse decrease the total number of devices required in the

system.



ii. Latency Limited Designs

Some designs are limited by latency not throughput. Here, high throughput may be unimportant. Often it is irrelevant

how quickly we can begin processing the next datum if that time is shorter than the latency through the design. This

is particularly true of applications which must be serialized for correctness (e.g. database updates, resource

allocation/deallocation, adaptive feedback control).

By reusing gates and wires, we can use device capacity to implement these latency limited operations with less

resources than would be required without reuse. This will allow us to use smaller devices to implement a function or

to place more functionality onto each device.

iii. Data Dependent Functional Requirements Another characteristic of finite state machines is that the computational task varies over time and as a function of the

input data. At any single point in time, only a small subset of the total computational graph is needed. In a spatial

implementation, all of the functionality must be implemented simultaneously, even though only small subsets are

ever used at once.

Many tasks may perform quite different computations based on the kind of data they receive. A network interface

may handle packets differently based on packet type. A computational function may handle new data differently

based on its range. Data objects of different types may require widely different handling. Rather than providing

separate, active resources for each of these mutually exclusive cases, a multicontext device can use a minimum

amount of active resources, selecting the proper operational behavior as needed.

Architecture

The basic architecture for DPGA is as show in figure.

Each array element is a conventional 4-input lookup table (4-LUT).

Small collections of array elements, in (this case 4X4 arrays), are grouped together into subarrays. These

subarrays are then tiled to compose the entire array.

Crossbars between subarrays serve to route inter-subarray connections. A single, 2-bit, global context

identifier is distributed throughout the array to select the configuration for use.

Additionally, programming lines are distributed to read and write configuration memories.



DRAM Memory

The basic memory primitive is a 4X32 bit DRAM array which provides four context configurations for

both the LUT and interconnection network .

The memory cell is a standard three transistor DRAM cell.

Notably, the context memory cells are built entirely out of N-well devices, allowing the memory array to

be packed densely, avoiding the large cost for N-well to P-well separation.

Array Element

The array element is a 4-LUT which includes an optional flip-flop on its output

Each array element contains a context memory array.

This is th 4X32 bit memory described above. 16 bits provide the LUT programming, 12 configure the

four8-input multiplexors which select each input to the 4-LUT, and one selects the optional flip-flop.

The remaining three memory bits are presently unused.



Subarrays

The subarray organizes the lowest level of the interconnect hierarchy.

Each array element output is run vertically and horizontally across the entire span of the subarray

Each array element can, in turn, select as an input the output of any array element in its subarray

which shares the same row or column.

This topology allows a reasonably high degree of local connectivity.

This leaf topology is limited to moderately small subarrays since it ultimately does not scale.

Non-Local Interconnects In addition to the local outputs which run across each row and column, a number of non-local lines

are also allocated to each row and column. The non-local lines are driven by the global interconnect.

Each LUT can then pick inputs from among the lines which cross its array element. Each row and

column supports four non-local lines. Each array element could thus pick its inputs from eight global

lines, six row and column neighbor outputs, and its own output. Each input is configured with an 8:1

selector.(Ref fig 2)

Global Interconnect

Between each subarray, a pair of crossbars route the subarray outputs from one subarray into the non-

local inputs of the adjacent subarray.

Note that all array element outputs are available on all four sides of the subarray.

This means that each crossbar is a 16X8 crossbar which routes 8 of the 16 outputs to the neighboring

subarray’s 8 inputs on that side.

TSFPGA Active interconnect area consumed most of the space on traditional, single-context FPGAs. In DPGA, we saw

that adding small, local, context memories allowed us to reuse active area and achieve smaller task

implementations. Even in these multicontext devices, we saw that interconnect consumed most of the area.

Time-Switched Field Programmable Gate Array (TSFPGA),is a multicontext device designed explicitly

around the idea of time-switching the internal interconnect in order to implement more effective connectivity

with less physical interconnect.

The trick employed in Time Switched Register is to have each logical input load its value from the active

interconnect at just the right time. As we have seen, multicontext evaluation typically involves execution of a

series of microcycles. A subset of the task is evaluated on each microcycle, and only that subset requires

active resources in each microcycle. We call each microcycle a timestep and, conceptually at least, number

them from zero up to the total number of microcycles required to complete the task. If we broadcast the

current timestep, each input can simply load its value when its programmed load time matches the current

timestep.



Figure shows a 4-LUT with this input arrangement which we will call the Time-Switched Input Register. Each

LUT input can load any value which appears on its input line in any of the last i cycles. The timestep value is

log2i bits wide, as is the comparator. With this scheme, if the entire computation completes in i timesteps, all

retiming is accomplished by simply loading the LUT inputs at the appropriate time – i.e. loading each input

just when its source value has been produced and spatially routed to the destination input.

With this input structure, logical LUT evaluation time is now decoupled from input arrival time. This

decoupling was not true in FPGAs, DPGAs, or even iDPGAs. This decoupling of the input arrival time and

LUT evaluation time allows us to remove the simultaneous constraints.

Switched Interconnect – Folding For instance, we could map pairs of LUTs together such that they share input sets. This, coupled with cutting

the number of array outputs (Nsa_out ) in half, will cut the number of crossbar outputs in half and hence halve

the subarray crossbar size. For full connectivity, it may now takes us two microcycles to route the

connections, delivering the inputs to half the LUTs and half the array outputs in each cycle. In this particular

case we have performed output folding by sharing crossbar outputs. Notice that the time-switched input

register allows us to get away with this folding by latching and holding values on the correct microcycle. The

input register also allows the non-local subarray outputs to be transmitted over two cycles.

We can also perform input folding. With input folding, we pair LUTs so that they share a single LUT output.

Here we cut the number of array inputs ( Nsa_in) in half, as well. The array crossbar now requires only half as

many inputs as before and is, consequently, also half as large in this case.

Again, the latched inputs allow us to load each LUT input value only on the microcycle on which the

associated value is actually being routed through the crossbar. For input folding, we must add an effective pre-

crossbar multiplexor so that we can select among the sources which share a single crossbar input

Architecture The basic TSFPGA building block is the subarray tile (See Figure) which contains a collection of LUTs and a

central switching crossbar.

LUTs share output connections to the crossbar and input connections from the crossbar in the folded

manner.

Communication within the subarray can occur in one TSFPGA clock cycle.



Array Element

The TSFPGA array element is made up of a number of LUTs which share the same crossbar outputs and

input (See Figure below).

The LUT output into the crossbar is selected based on the routing context programming.

The LUT input values are stored in time-switched input registers.

The inputs to the array element are run to all LUT input registers. When the current timestep matches the

programmed load time, the input register is enabled to load the value on the array-element input. When

multiple LUTs in an array element take the same signal as input, they may be loaded simultaneously.



Crossbar

The primary switching element is the subarray crossbar.

Each crossbar input is selected from a collection of subarray network inputs and subarray LUT outputs via by a

pre-crossbar multiplexor.

Subarray inputs are registered prior to the precrossbar multiplexor and outputs are registered immediately after

the crossbar, either on the LUT inputs or before traversing network wires.

Each registered, crossbar output is routed in several directions to provide connections to other subarrays or chip

I/O.

The network in TSFPGA is folded such that the single subarray crossbar performs all major switching roles:

1. output crossbar – routing data from LUT outputs to destinations or intermediate switching crossbars

2. routing crossbar – routing data through the network between source and destination subarrays

3. input crossbar – receiving data from the network and routing it to the appropriate destination LUT input

Intra-Subarray Switching

Communication within the subarray is simple and takes one clock cycle per LUT evaluation and interconnect.

Once a LUT has all of its inputs loaded, the LUT output can be selected as an input to the crossbar, and the

LUT’s consumers within the subarray may be selected as crossbar outputs.

At the end of the cycle, the LUT’s value is loaded into the consumers’ input registers, making the value

available for use on the next cycle.

Inter-Subarray Switching

A subarray may be connected to other subarrays on a component. A number of subarray outputs are run to each

subarray in the same row and column.

Routing data within the same row or column involves:

1. Route LUT output through crossbar to the outputs headed for the destination subarray.

2. Traverse the wire between subarrays.

3. Select network input with source value as a crossbar source and route through the crossbar to the destination

LUT input.

When data needs to traverse both row and column:

1. Route LUT output to first dimension destination (row, column).

2. Traverse first dimension interconnect.

3. Switch output in second dimension (column, row).

4. Traverse second dimension interconnect.

5. Switch to LUT input and load.

I/O Connections

I/O connections are treated like hierarchical network lines and are routed into and out of the subarrays in a similar

manner.

Each input has an associated subarray through which it may enter the switched network. Similarly, each output is

associated with the crossbar output of some subarray.

Device outputs are composed of time-switched input registers and load values from the network at designated

timesteps like LUT inputs.

Array Control

Two “instruction” values are used to control the operation of each subarray on a per clock cycle basis, timestep

and routing context).

The routing context serves as an instruction pointer into the subarray’s routing memory. It selects the

configuration of the crossbar and pre-crossbar multiplexors on each cycle.

timestep denotes time events and indicates when values should be loaded from shared lines.



Applications of reconfigurable computing: Various hardware implementations of Pattern Matching such as the Sliding Windows Approach, Automaton-Based Text Searching. Video Streaming;

Pattern Matching Pattern matching can be defined as the process of checking if a character string is part of a longer sequence

of characters.

Pattern matching is used in a large variety of fields in computer science. In text processing programs such

as Microsoft Word, pattern matching is used in the search function. The purpose is to match the keyword

being searched against a sequence of characters that build the complete text.

In database information retrieval, the content of different fields of a given database entry are matched

against the sequence of characters that build the user request. Speech recognition and other pattern

recognition tools also use pattern matching as basic operations on top of which complex intelligent

functions may be built, to better classify the audio sequences.

Other applications using pattern matching are:

dictionary implementation, spam avoidance, network intrusion detection and content surveillance.

The first advantage of using the reconfigurable device here is the inherent fine-grained parallelism that

characterizes the search. Many different words can be searched for in parallel by matching the input text

against a set of words on different paths. The second advantage is the possibility of quickly exchanging the

list of searched words by means or reconfiguration.

We define the capacity of a search engine as the number of words that can be searched for in parallel. A

large capacity also means a high complexity of the function to be implemented in hardware, which in turn

means a large amount of resources. The goal in implementing a search engine in hardware is to have a

maximal hardware utilization, i.e. as many words as possible that can be searched for in parallel.

The different hardware implementation of pattern matching are :

1. Sliding Window Approach

2. Hash Tabled Based Text Searching

3. Automation Based Text Searching

The Sliding Windows Approach

One approach for text searching is the so-called sliding window. In the 1-keyword version, the target word

is stored in one register, each character being stored in one register field consisting of a byte. The length of

the register is equal to the length of the word it contains. The text is streamed through a separate shift

register, whose length is the same as that of the keyword. For each character of the given keyword stored as

a byte in one register field, an 8-bit comparator is used to compare this character with the corresponding

character of the text, which streams through the shift register. A hit occurs when all the comparators return

the value true.

The sliding window can be extended to check a match of multiple patterns in parallel. Each target word

will then be stored in one register and will have as many comparators as required.

The main advantage of this approach resides in the reconfiguration. Because each keyword is stored in an

independent register, the reconfiguration can happen without affecting the other words, thus providing the

possibility to gradually and quickly modify the dictionary.



Automation Based Text Searching:

It is well known that any regular grammar can be recognized by a deterministic finite state machine (FSM).

In an automaton-based search algorithm, a finite state machine is built on the basis of the target words. The

target words define a regular grammar that is compiled in an automaton acting as a recognizer for that

grammar. When scanning a text, the automaton changes its state with the appearance of characters. Upon

reaching an end state, a hit occurs and the corresponding word is set to be found.

The advantage of the FSM-based search machine is the elimination of the preprocessing step done in many

other methods to remove stop words (such as ‘the’, ‘to’, ‘for’ etc., which does not affect the meaning of

statements) from documents.

When taking into consideration the common prefix of words, it is possible to save a considerable amount of

flip flops. For this purpose, the search machine could be implemented as an automaton recognizer,

common to all the target words. As shown in figure below, the resulting structure is a tree in which a path

from the root to a leaf determines the appearance of a corresponding key word in the streamed text. Words

that share a common prefix use a common path from the root corresponding to the length of their common

prefix. A split occurs at the node where the common prefix ends.



In the hardware implementation of a group of words with a common prefix, common flip flops will be used

to implement the common path (figure 9.3a). For each character in the target alphabet, only one comparator

is needed. The comparison occurs in this case, once at a particular location in the device. Incoming

characters can be directly sent to a set of all possible comparators. Upon matching a particular one, a

corresponding signal is sent to all the flip flops, which need the result of that comparison. This method will

reduce the number of comparators needed almost by the sum of the length of all target words.

The overall structure of the search engine previously explained is given in shown above. It consists of an

array of comparators to decode the characters of the FSM alphabet, a state decoder that moves the state

machine in the next states and a preprocessor to map incoming 8-bit characters to 5-bit characters, thus

mapping all the characters to lower cases.

As case insensitivity is considered in most application in information retrieval, the preprocessor is designed

to map upper and lower case characters to the binary code of 1 to 26 and the rest of character to 0.

Characters are streamed to the device in ASCII notation. An incoming 8-bit character triggers the clock and

is mapped to a 5-bit character that is sent to the comparator array. All the 5-bit signals are distributed to all

the comparators that operate in parallel to check if a match of the incoming character with an alphabet

character occurs. If a match occurs for the comparator k, the output signal chark is set to one and all the

others are set to zero. If no match occurs, all the output signal are set to be zero.



Video Streaming Video streaming is the process of performing computations on video data, which are streamed through a given

system, picture after picture.

Two main reasons are always stated as motivation for implementation video streaming in FPGAs:

1. the performance, which results from using the exploitation of the inherent parallelism in the target

applications to build a tailored hardware architecture, and

2. the adaptivity, which can be used to adjust the overall system by exchanging parts of the computing

blocks with better adapted ones.

Most of the video capture modules provide the frames to the system on a pixel by pixel manner, leading to a

serial computation on the incoming pixels. As many algorithms need the neighbourhood of a pixel to compute

its new value, a complete block must be stored and processed for each pixel. Capturing the neighbourhood of

a pixel is often done using a sliding window data structure with varying size.

A given set of buffers (FIFO) is used to update the windows. The number of FIFOs vary according to the size

of the window. In each step, a pixel is read from the memory and placed in the lower left cell of the window.

Up to the upper right pixel that is disposed, i.e. outputted, all the pixels in the right part of the window are

placed at the queue of the FIFOs one level higher.

The processing part of the video is a normal image processing algorithm combining some of the basic operators

like:

Median filtering, Basic morphological operations, Convolution, Edge detection.

The architecture for a video streaming system is usually built on a modular and chained basis.

The chain consist of a set of modules, each of which is specialized for a given computation. The first module on

the chain is in charge of capturing the video frames, while the last module output the processed frames. Output

can be rendered on a monitor or can be sent as compressed data over a network.

Between the first and the last modules, several computational modules can be used according to the algorithm

implemented. The frames are alternately written to the RAM1 and RAM2 by the capture module. The second

module collects the pictures from the RAM1 or RAM2, if this is not in use by the first module, builds the

sliding windows and passes it to the third module, which processes the pixel and saves it in its own memory or

directly passes it to the next module.



This architecture presents a pipelined computation in which the computational blocks are the modules that

process the data frames. RAMs are used to temporally store frames between two modules, thus allowing a

frame to stream from RAM to RAM and the processed pictures to the output.

A module can be implemented in hardware or in software according to the goals. While software provides a

great degree of flexibility, it is usually not fast enough to carry the challenging computations required in video

applications. ASICs can be used to implement the computational demanding parts; however, ASIC does not

provide the flexibility needed in many systems. On a reconfigurable platform, the flexibility of a processor can

be combined with the efficiency of ASICs to build a high-performance flexible system.

Documents

Frequently and Non frequently configurable devices; Applications of Reconfigurable Devices