Upload
idea-udl
View
0
Download
0
Embed Size (px)
Citation preview
Temporal partitioning methodology optimizing FPGA resources
for dynamically reconfigurable embedded real-time system
C. Tanougast*, Y. Berviller, P. Brunet, S. Weber, H. Rabah
Laboratoire d’Instrumentation Electronique de Nancy, Universite de Nancy 1, BP 239, Vandoeuvre Les Nancy 54506, France
Received 2 September 2002; revised 22 November 2002; accepted 27 November 2002
Abstract
In this paper we present a new temporal partitioning methodology used for the data-path part of an algorithm for the reconfigurable
embedded system design. This temporal partitioning uses an assessing trade-offs in time constraint, design size and field programmable gate
arrays device parameters (circuit speed, reconfiguration time). The originality of our method is that we use the dynamic reconfiguration in
order to minimize the number of cells needed to implement the data-path of an application under a time constraint. Our method consists, by
taking into account the used technology, in evaluating the algorithm area and operators execution time from data flow graph. Thus, we
deduce the right number of reconfigurations and the algorithm partitioning for Run-Time Reconfiguration implementation. This method
allows avoiding an oversizing of implementation resources needed. This optimizing approach can be useful for the design of an embedded
device or system. Our approach is illustrated by various reconfigurable implementations of real time image processing data-path.
q 2003 Elsevier Science B.V. All rights reserved.
Keywords: Field programmable gate arrays; Embedded system; Temporal partitioning; Run-time reconfiguration; Dynamically reconfigurable systems; Image
processing; Design implementation; High-level synthesis
1. Introduction
Since the introduction of field programmable gate arrays
(FPGA), the process of digital systems design has changed
radically [1,2]. Indeed, FPGAs occupy an increasingly
significant place in the realization of real time applications
and has allowed the appearance of a new paradigm:
hardware as flexible as programming.
The dynamically reconfigurable computing consists in
the successive execution of a sequence of algorithms on the
same device. The objective is to swap different algorithms
on the same hardware structure, by reconfiguring the FPGA
array in hardware several times in a constrained time and
with a defined partitioning and scheduling [3,4]. Dynamic
reconfiguration offers important benefits for the implemen-
tation of designs. Several architectures have been designed
and have validated the dynamically reconfigurable comput-
ing concept for the real time processing [5–9]. However,
the optimal decomposition (partitioning) of an algorithm by
exploiting the run-time reconfiguration (RTR) is a domain
in which many works are left. The works in the domain of
temporal partitioning and logic synthesis exploiting the
dynamic reconfiguration generally focus on the application
development approach [10]. Thus, firstly we observe that
the efficiency obtained is not always optimum with respect
to the available spatio-temporal resources. Secondly, the
choice of the number of partitions is never specified.
Thirdly, this can be improved by a judicious temporal
partitioning [11].
We discuss here the partitioning problem for the RTR. In
the task of implementing an algorithm on reconfigurable
hardware, we can distinguish two approaches [10]. The
most common is what we call the application development
approach and the other is what we call the system design
approach. In the first case, we have to fit an algorithm with
an optional time constraint in an existing system made from
a host CPU connected to a reconfigurable logic array. In this
case, the goal of an optimal implementation is to minimize
one or more of the following criteria: processing time,
memory bandwidth, number of reconfigurations and power
consumption. In the second case, we have to implement an
0141-933/03/$ - see front matter q 2003 Elsevier Science B.V. All rights reserved.
PII: S0 14 1 -9 33 1 (0 2) 00 1 02 -3
Microprocessors and Microsystems 27 (2003) 115–130
www.elsevier.com/locate/micpro
* Corresponding author. Tel.: þ33-383-6841-59; fax: þ33-383-6841-53.
E-mail addresses: [email protected] (C. Tanougast),
[email protected] (Y. Berviller), [email protected] (P.
Brunet), [email protected] (S. Weber), [email protected] (H.
Rabah).
algorithm with a required time constraint on a system
throughout the design exploration phase. The design
parameter is the size of the logic array that is used to
implement the data-path part of the algorithm. Here an
optimal implementation is the one that leads to the minimal
area of the reconfigurable array.
Previous advanced works in the field of temporal
partitioning and synthesis for RTR architecture [12–19]
focus on application development approach targeting
already designed reconfigurable architecture. These meth-
odologies are used in the domain of existing reconfigurable
accelerators or reconfigurable processors. All these
approaches assume the existence of a resources constraint.
The most important thing here is that the number of
reconfigurable resources is a predefined constant
(implementation constraint). In this strategy, the associated
tools capture the algorithm and the characteristics of the
target platform on which the algorithm will be implemented.
In this case, the goal is to minimize the processing time and/
or the memory bandwidth requirement. Moreover, all these
approaches are not capable of solving practical DRL
(dynamically reconfigurable logic) synthesis problems yet.
These techniques use in general simplified models of
dynamically reconfigurable systems, which often ignore
the impact of routing or reconfiguration resource sharing in
order to reduce complexity of a DRL design space search.
Among them, there is the GARP project [12]. The goal
of GARP is the hardware acceleration of loops in a C
program by the use of the data path synthesis tool GAMA
[13] and the GARP reconfigurable processor. GARP is a
processor tightly coupled to a custom FPGA-like array
and designed specially to speed-up the execution of
general case loops. The logic array has a DMA feature
and is tailored to implement 32 bits wide arithmetic and
logic operations with the control logic, all this allows to
minimize the reconfiguration overhead. GAMA is a fast
mapping and placement tool for the data-path implemen-
tation in FPGAs. It is based on a library of patterns for all
possible data-path operators. The SPARCS project [14,15]
is a CAD tool suite tailored for applications development
on multi-FPGAs reconfigurable computing architectures.
Such architectures need both spatial and temporal
partitioning, a genetic algorithm is used to solve the
spatial partitioning problem. The main cost function used
here is the data memory bandwidth. Other works propose
a strategy to automate the design process that considers all
possible optimizations (partitioning and task scheduling)
that can be carried out from a particular reconfigurable
system [16,17]. Shirazi et al. and Luk et al. [18,19]
proposes both a model and a methodology to take
advantages of common operators in successive partitions.
A simple model for specifying, visualizing and developing
designs that contains reconfigurable elements in run-time
has been proposed. This judicious approach allows
reducing the configuration time and thus the application
execution time. But additional logic resources (area) are
required to realize an implementation with this approach.
Furthermore this model does not include timing aspects in
order to satisfy the real time and it does not specify the
partitioning of the implementation. Indeed, the algorithm
partitioning must be previously known to determine the
elements that do not need to be reconfigured for the next
step. However, this concept is interesting to use for DRL
designs simulation as Dynamic circuit switching (DCS)
[20]. This simulation uses virtual multiplexors, demulti-
plexors and switches, which are implemented to simulate
the dynamic configuration design.
These interesting works do not pursue the same goal
as we do. In priority, we try to find the minimal area that
allows meeting the time-constraint. This is different from
searching the minimal memory bandwidth or execution
time which allows meeting the resources constraint.
Here, we propose a temporal partitioning that uses
dynamic reconfiguration of FPGA (also called DRL
Scheduling) to minimize the implementation logic area.
Each partition corresponds to a temporal floorplanning
for DRL embedded systems (Fig. 1) [21]. We search the
minimal floorplan area that implements successively a
particular algorithm. This approach improves the per-
formance and efficiency of the design implementation. In
contrast to previous work, our aim is to obtain, from an
algorithm description, a target technology and implemen-
tation constraints, the characteristics of the platform to
design or to use. This allows avoiding an oversizing of
implementation resources. For example, by summarizing
the sparse information found in some articles [22–24],
we can assume the following. Suppose we have to
implement a design requiring P equivalent gates and
taking an area SFC of silicon in the case of a full custom
ASIC design. Then we will need about 10 £ SFC in the
case of a standard cell ASIC approach and about 100 £
SFC if we decide to use an FPGA. But the large
advantage of the FPGA is, of course, its great flexibility
and the speed of the associated design flow. This is
probably the main reason to include a FPGA array on
System on Chip platforms. Suppose that a design is
requiring that 10% of the gates must be implemented as
full custom, 80% as standard cell ASIC and 10% in
FPGA cells. By roughly estimating the areas, we come to
the following results: The FPGA array will require more
than 55% of the die area, the standard cell part more
than 44% and the full custom part less than 1%. In such
a case it could make sense to try to reduce the equivalent
gate count needed to implement the FPGA part of the
application. This is interesting because the regularity of
the FPGA part of the mask leads to a quite easy
modularity of the platform with respect to this parameter.
Embedded systems design can take several advantages
of the use our approach based on RTR FPGAs. The most
obvious is the possibility to frequently update the digital
hardware functions. But we can also use the dynamic
resources allocation feature to instantiate each operator
C. Tanougast et al. / Microprocessors and Microsystems 27 (2003) 115–130116
only for the minimal required time. This permits to
enhance the silicon efficiency by reducing the reconfigur-
able array’s area [25,26]. Our goal is the definition of a
RTR methodology, included in the architectural design
flow, which allows minimizing the FPGA resources
needed for the implementation of a time constrained
algorithm. So, the challenge is double. Firstly to find
trade-offs between flexibility and algorithm implemen-
tation efficiency through the programmable logic-array.
Secondly to obtain computer-aided design techniques for
optimal synthesis which include the dynamic reconfigura-
tion in an implementation.
The rest of this paper is organized as follows. Section 2
provides a formal definition of our partitioning problem. In
Section 3 we present the partitioning strategy. In Section 4
we illustrate the application of our method with two image-
processing algorithms. In the first example, we carry out the
method in an automatic way such as we will describe it. In
the second example, we apply our method while showing
the possibility of further evolution. The aim is to make
possible improvements of the suggested method. In Sections
5 and 6 we discuss the approach compared to architectural
synthesis, conclude and present future works.
2. Problem formulation
Currently the partitioning for RTR is often made at
boundaries of algorithm operators. This means that, for
example the partitioning of the image processing algorithm
is made at image operators like filters, edge detection
operators, and so on. In contrast with this decomposition, we
work on the global algorithm independently of image
operators. Our method is based on elementary arithmetic
and logic operators of the algorithm (adders, subtractor,
multiplexers, registers etc.) of the algorithm. The analysis of
the operators leads to a register transfer level (RTL)
decomposition of the DFG. This partitioning is then
independent of high-level operators (macro-operators).
The RTR partitioning for the real time application could
be classified as a spatio-temporal problem. That is as a time
constrained problem with dynamic resource allocations in
contrast with the scheduling for RTR [27]. Then, we make
the following formulations about the application. Firstly, the
algorithm can be modeled as an acyclic data flow graph
(DFG) denoted here by GðV ;EÞ where the set of vertices
V ¼ {O1;O2;…;Om} corresponds to the arithmetic and
logical operators and the set of directed edges E ¼
{e1; e2;…ep} represents the data dependencies between
operations. Secondly, The application has a critical time-
constraint T. The problem to solve is the following:
For a given FPGA family we have to find the set
{P1;P2;…Pn} of sub graphs of G such as:
[n
i¼1
Pi ¼ G; ð1Þ
and that allows to execute the algorithm by meeting the
time-constraint T and the data dependencies modeled by E
and requires the minimal amount of FPGA cells. The
number of FPGA cells used, which is an approximation of
the area of the array, is given by Eq. (2), where Pi is one
among the n partitions.
maxi[{1…n}
ðAreaðPiÞÞ: ð2Þ
The FPGA resources needed by a partition i is given by
Eq. (3), where Mi is the number of elementary operators in
partition Pi and AreaðOkÞ is the amount of resources needed
Fig. 1. Temporal partitioning with a minimized floorplan area.
C. Tanougast et al. / Microprocessors and Microsystems 27 (2003) 115–130 117
by operator Ok:
AreaðPiÞ ¼X
k[{1…Mi}
AreaðOkÞ: ð3Þ
3. Temporal partitioning
A typical search sequence of temporal partitions includes
the following steps:
1. Definition of the constraints, the type of the design (use
dynamic configurable platform fixed-resources or recon-
figurable embedded design): time constraint, data-block
size, bandwidth bottleneck, memory size, consumption)
and the target technology.
2. DFG capture using design entry.
3. Determination of temporal partitioning.
4. Generation of the configuration bitstream of each
temporal floorplan for the final design.
Here, we are only interested in dynamically configurable
embedded design. In our case, the temporal partitioning
methodology in the design flow is depicted in Fig. 2. Our
method, which determines the minimal floorplan area, is
structured on three parts. In the first, from technology target
library and constraint parameters, we compute an approxi-
mation of the number of partitions, and then we deduce
boundaries of each temporal floorplans. Finally we refine
when it is possible the final partitioning. Constraint
parameters are used to adjust temporal partitioning solution
until all design constraints are met. This method can be seen
as a heuristic approach.
3.1. Number of partitions
In order to reduce the search domain we first estimate the
minimum number of partitions that we can make and the
resources by partition. To do this, we use an operator library
that is target dependent. This library allows association of
two attributes to each vertex of the graph G. These attributes
are ti and AreaðOiÞ; respectively, the maximal path delay
and the number of elementary FPGA cells needed for
operator Oi: The symbol i represents the index of a
particular operator in the DFG. These two quantities are
functions of the size (number of bits) of the data to process.
If we know the size of the initial data to process, it is easy to
deduce the size at each node by a ‘software execution’ of the
graph with the maximal value for the input data.
Furthermore, we make the following assumptions:
The data to process are grouped in blocks of N data.
The number of operations to apply to each data in a block
is deterministic (i.e. not data dependent).
There are pipeline registers between all nodes of the graph.
The reconfiguration time of the FPGA technology used
can be approximated by a linear function of the area of the
functional units being downloaded. Which is realistic in
practice. The configuration speed is a constant and a
characteristic of the FPGA (see Section 4).
Thus, the minimal operating time period tomax is given
by:
tomax ¼ maxi[{1…m}
ðtiÞ ð4Þ
Where {1…m} is the set of all operators of data-path G and
ti is execution time of operator Oi: The total number C of
Fig. 2. General outline of the temporal partitioning in the design flow.
C. Tanougast et al. / Microprocessors and Microsystems 27 (2003) 115–130118
cells used by the application is given by:
C ¼X
i[{1…m}
AreaðOiÞ ð5Þ
The main constraint is the need for real time processing. We
suppose that the algorithms are partitioned in n steps
corresponding to n execution–reconfiguration pairs. For a
given FPGA family characterized by a reconfiguration speed
V, the working frequency of each step must verify the
following inequality:
Xn
i¼1
si þ N·Xn
i¼1
tei # T 21
V·Xn
i¼1
AreaðPiÞ ð6Þ
Where AreaðPiÞ is the number of FPGA’s logic cells needed
by Pi; N is the number of data to process, n the number of
reconfigurations ðn [ INÞ; T is the upper limit of processing
time (time constraint), si corresponds to the prologue and
epilogue of pipeline in the partition Pi and tei is the
elementary processing time of a data in the ith step. In one
partition, the elementary processing time corresponds to the
longest execution time among all operators present in the
partition Pi: So, we have:
tei ¼ maxk[{1…Mi}
ðtkÞ ð7Þ
Using Eqs. (4)–(6), we obtain the minimum number
of partitions n as given by Eq. (8) and the corresponding
optimal size Cn (number of cells) of each partition by Eq. (9).
n ¼ RoundT
ðN þ sÞ·tomax þC
V
0BB@
1CCA ð8Þ
Where V is the configuration speed expressed by cells/s, C the
number of cells required to implement the entire DFG, N the
number of data in a block, s the total latency cycles of all
data-path and tomax the propagation delay of the slowest
operator in the DFG. This time is fixed by the maximum time
between two successive vertices of graph G thanks to the
fully pipelined process.
Cn ¼C
nð9Þ
The first expression of the denominator in Eq. (8) is the
effective processing time of one data block and the second
expression is the consumed time to load the n configurations
(total reconfiguration time of G).
In most application domain like image processing (see
Section 4), we can neglect the impact of the latency time in
comparison with the processing time (N q number of
pipeline stages (s)). So, we can approximate Eq. (8) by Eq.
(10).
n < RoundT
N·tomax þC
V
0BB@
1CCA ð10Þ
A limit in the use of dynamic reconfiguration of FPGA
has been exhibited [28], because the impact of
reconfiguration time depends on the size of the data
block to treat. Two conditions must be satisfied to
allow dynamic configuration. First, the sample fre-
quency of data must be less than the maximum
hardware computation frequency. Secondly, the data
must be computed in blocks with a significant size to
reduce the reconfiguration overhead. In this case,
silicon reduction can be achieved, but the major
advantage of dynamic reconfiguration is the possibility
of changing the algorithms (which can be data
dependent) in real time and to optimize implementation
area of designs.
From the analysis of the value of n, we can extract some
information on the parameters characterizing of the system
needed to realize the implementation. We describe these
different cases below:
1) n . 1: That means it is possible to realize a RTR
partitioning. In this case, an optimization is possible and
allows obtaining a reduction of the logic area of implemen-
tation with the technology used. n corresponds to the number
of partitions that we are sure to obtain.
2) n # 1: In this case, with the RTR partitioning it is not
possible to ensure the time constraint. Here, only static
implementation with or without parallel processing allows
respecting the time constraint.
If n ¼ 1: This case means that only static implementation
is sufficient to meet the constraints with the used
technology.
If n , 1: This means that to ensure the constraint it is
necessary to realize a static implementation with a proces-
sing parallelism. 1/n gives the degree of processing
parallelism. In practice 1/n [ IN. In this case, it is necessary
to know if the target technology is interesting for the
implementation of the application.
The pseudo algorithm of the determination of n and Cn is
given below. We annotate the DFG by using the operator
library. We cover all the nodes in the DFG, we accumulate
the area and we search the maximal execution time among
the operators of the DFG.
V, N, T ( constants//Capture constant parameters of/of
target and constraints
G ( DFG//DFG capture of the application
C ( 0//Total area variable,
TO ( 0//Maximal operator time variable
for each node Ni in G
TO ( max (TO, Ni.ti)//return current max
//of execution time
C ( C þ Ni.Area//add area of current/node
end for
n ( T/[(N TO) þ rt( )]//compute n and Cn
Cn ( C/n
C. Tanougast et al. / Microprocessors and Microsystems 27 (2003) 115–130 119
3.2. Initial partitioning
A pseudo algorithm of the partitioning scheme is given
below.
G ( data flow graph of the application
P1;P2;…;Pn ( empty partitions
for i in {1…n}
C ( 0
while C , Cn
append (Pi; First_Leave(G))
C ( C þ First_Leave(G).Area
remove (G, First_Leave(G))
end while
end for
We cover again the graph from the leaves to the root(s)
by accumulating the sizes of the covered nodes until the sum
is as close as possible to Cn: These covered vertices make
the first partition. We remove the corresponding nodes from
the graph (command ‘remove (G, First_Leave(G))’). The
goal of this function is to avoid considering twice the same
node and to ensure the data dependencies. Then, we iterate
the covering until the remaining graph is empty. The
partitioning is then finished.
There is a great degree of freedom in the implementation
of the First_Leave( ) function, because there are usually
many leaves in a DFG. The only strong constraint is that the
choice must be made in order to guarantee the data
dependencies across the whole partition.
The choice of a node among the leaves of the DFG can be
carried out in several ways. Since the leaves are terminal
nodes, they meet the data dependency constraints. So, the
reading of the leaves of the DFG can be random or ordered
for example. In our case it is ordered. We consider as a two-
dimensional table containing parameters relating to the
operators of the DFG. First_Leave( ) is carried out in the
reading order of the table containing the operator arguments
of the DFG (left to right) (Fig. 3). In our case, the pseudo
algorithm of First_Leave( ) function is given below.
M[ ][ ] ( G; DFG capture
for M[i ][ j ] in M
if M[i ][ j ] – 0
first_leave(G).area ( M[i ][ j ].area;
first_leave(G).tk ( M[i ][ j ].tk;
end if
end for
3.3. Refinement after implementation
After placement and routing of each partition that was
obtained in the initial phase we are able to compute the
exact processing time. It is also possible to take into account
the value of the synthesized frequency close to the maximal
processing frequency for each partition.
The analysis of the gap between the total processing time
(configuration and execution) and the time constraint permits
to make a decision about the partitioning. If it is necessary to
reduce the number of partitions or possible to increase it, we
return to the step described in Section 3.2 with a new value for
n. Else the partitioning is considered as a final one (see Fig. 2).
In this case, if time remains after the execution of all
partitions, but it is not enough to add one, we decrease this
time by reducing the working frequency in one or more
partitions. Thus, we limit the consumption of the application.
This heuristic partitioning is the best that we can obtain
with this straightforward method. However, we do not have
enough criteria to compare our heuristic approach to the
existing methods based for example on mathematical
resolution. For this reason we cannot conclude that our
method always leads to an optimal solution.
4. Applications to image processing
In this section, we illustrate and model parameters to
apply our method in the field of the images processing. This
application area is a good choice for our approach because
the processing are generally characterized by regular
operators (data-path) and the data are naturally organized
in blocks (the images). Large blocks of data to process leads
to a reduced overhead of reconfiguration time. Otherwise,
the time dedicated to a processing is mainly allocated for
reconfigurations. Moreover, there are many low level
processing algorithms that can be modeled by a data flow
graph and the time constraint is usually the image
acquisition period. We assume that the images are taken
at a rate of 25 per second with a spatial resolution of
512 £ 512 pixels and each pixel gray level is an eight bits
value. Then, we have a time constraint of 40 ms by image to
assure the real time processing.
4.1. Algorithms
We illustrate our method with two image processing
algorithms. The first algorithm used here is a 3 £ 3 median
Fig. 3. Labeling and reading of the DFG.
C. Tanougast et al. / Microprocessors and Microsystems 27 (2003) 115–130120
filter followed by an edge detector and its general view is
given on Fig. 4.
In this example, we consider a separable median filter
[29] and a Sobel operator. The median filter provides the
median value of three vertical successive horizontal median
values. Each horizontal median value is simply the median
value of three successive pixels in a line. This filter allows to
eliminate the impulsion noise while preserving the edges
quality. The principle of the implementation is to sort the
pixels in the 3 by 3 neighborhood by their gray level value
and then to use only the median value (the one in 5th
position over 9). This operator is constituted of eight bits
comparators and multiplexers. A Sobel operator achieves
the gradient computation. This corresponds to a convolution
of the image by successive application of two mono-
dimensional filters. These filters are the vertical and
horizontal Sobel operator, respectively. The final gradient
value of the central pixel is the maximum absolute value
from vertical and horizontal gradient. The line delays are
made with components external to the FPGA (Fig. 4).
The second algorithm is an estimator of the normal
optical flow and is frequently used in an apparent motion
estimator in a Log–Polar images sequence [30]. This
normal optical flow estimator algorithm is composed of
gaussian and averaging filters, followed by temporal and
spatial derivatives and arithmetic divider. The general view
of this data-path algorithm is given on Fig. 5.
4.2. Temporal partitioning
4.2.1. Reconfiguration speed evaluation
The FPGA family targeted in these examples is the Atmel
AT40k family of dynamically reconfigurable FPGA. These
devices can be totally configured in lesser than 1 ms. The
reconfiguration speed is expressed in area by time units. In
practice it is a proportionality constant between the configur-
ation time and the number of used logic cells in the FPGA. We
express the reconfiguration speed by the following equation:
V ¼Cmax
TG
ð11Þ
Cmax is thenumberof logiccells in theFPGA.TG is the timefor
a full reconfiguration of the FPGA. Each configuration time
depends on the quantity of logic cells used for each step. We
express thearea in termsof logiccellsof theusedFPGA.Inour
case, AT40K20’s capacity of 819 Cells leads to a total
reconfiguration time lower than 0.6 ms at 33 MHz with 8 bits
of configuration data [31] (see Section 4.3). These FPGAs
have aconfiguration speed of about 1365 cells per millisecond
and have a partial reconfiguration mode (called ‘mode 4’).
4.2.2. Dataflow graph annotation
The analysis of the FPGA datasheet allows us to obtain
the characteristics given in Table 1 for some operator types
which process data that have a size of D bits [31]. In this
table Tcell is the propagation delay of one cell; Trout is the
intra operator routing delay and Tsetup is the flip–flop setup
time. The same considerations will apply to others
dynamically reconfigurable device technologies or systems.
From the characteristics given in the datasheet [31], we
obtain the following values as a first estimation for the
execution time of usual elementary operators (Table 2).
Fig. 4. General view of the images edge detector.
Fig. 5. General view of the normal optical flow estimator.
C. Tanougast et al. / Microprocessors and Microsystems 27 (2003) 115–130 121
In practice, there is a linear relationship between the
estimated individual execution time and the real execution
time. Indeed, when a operator is included in daisy chain of
operator there is a routing time needed between two
successive nodes. Fig. 6 shows a plot of the estimated
individual execution time versus the real execution time for
some different usual low-level operators. Those operators
have been implemented individually in the FPGA array
between registers. Then, we have measured and rounded at
the nearest integer the obtained real execution times (Fig. 6).
From this observation, the estimated individual
execution time allows obtaining an approximation of
the real execution times of the operators contained in the
data-path. With the FPGA technology used, this coeffi-
cient has a value of about 1.5 (see Fig. 6). The results
are more exact as the algorithm is regular such as the
data-path (simple cascade of the operators). The evalu-
ation of the routing in the general case is a more difficult
task. The execution time after implementation of a
regular graph does not depend on the type of operator. A
weighting coefficient binds the real execution time with
the estimated one. This coefficient approximates the
routing delay between operators based on the estimated
execution time. So, these routing delays are a constant
portion of the duration of each step. This is realistic and
easy to exploit analytically.
With these estimations and by taking into account the
increase of data size caused by processing, we can annotate
the dataflow graph. Then, we can deduce the number and the
characteristics of all the operators. Table 3 gives these data
for the first algorithm example. In this table, the execution
time is an estimation of the real execution time.
From these data, we deduce the number of partitions
needed to implement a dedicated data-path in an optimized
way. Thus, for the edges detector, among all operators of the
data-path we can see that the slowest operator is an eight-bit
comparator and that we have to reconfigure 467 cells. Using
Eq. (10), we obtain a value of three for n. The size of each
partition (Cn) should be about 156 cells. Table 4 summarizes
the estimation for a RTR implementation of the algorithm.
For the data-path part of the second algorithm (normal
optical flow), we can estimate again the logic resources
required for the implementation (see Table 5). Thus, among
all operators of the data-path, we can see that the slowest
operator is a fifteen bit subtractor and that we have to
reconfigure 863 cells. Using Eq. (10), we obtain a value
superior to three for n. Here, it is possible to implement this
global data-path with about 264 cells in each partition (Cn).
Table 6 summarizes the estimation for a RTR implemen-
tation of the algorithm.
In practice, the partitioning of the data-path must also
take into account the memory bandwidth bottleneck. That is
why, the best practical partitioning needs to keep the data
throughput in accordance with the performances of the used
memory.
4.2.3. Optimizing memory bandwidth
The first aim of our methodology is to create partitions
with area as homogeneous as possible. If this is the wish of a
designer, he must also avoid going over technology
constraints like memory bandwidth. To help the designer,
the tool provides the data size requirement for each level of
the DFG. By the way, we are able to see where the
partitioning might produce a failure or an optimal solution.
Moreover, in order to automate this refinement, the program
can defer or anticipate the partitioning of the DFG to make it
where the memory bandwidth requirement is minimal. The
search is restricted to a neighborhood of the theoretical
partitioning points. This neighborhood is adjustable to keep
the compromise between homogeneous areas and memory
Table 1
Usual operator characterization (AT40k)
D bits operator Number
of cells
Estimated
execution time
Multiplication
or division by 2k
0 0
Adder or subtractor D þ 1 D(Tcell þ Trout) þ Tsetup
Multiplexer D Tcell þ Tsetup
Comparator 2D (2D-1)(Tcell) þ 2Trout þ Tsetup
Absolute value
(two’s complement)
D 2 1 D(Tcell þ Trout) þ Tsetup
Additional synchronization
register
D Tcell þ Tsetup
Table 2
Estimated execution time of some eight bit operators in AT40k technology
Eight bits operators Estimated execution
time (ns)
Comparator 27.34
Multiplexer 5
Absolute value 22.07
Adder, Subtractor 16.46
Combinatory logic with
inter-propagation logic cell
17
Combinatory logic without
inter-propagation logic cell
5
Fig. 6. Estimated time versus real execution time of some operators in
AT40k technology.
C. Tanougast et al. / Microprocessors and Microsystems 27 (2003) 115–130122
bandwidth minimization (Fig. 7). The size of the memory
bandwidth (data size and real working frequency) along the
DFG is known between each operator lines. When we create
the DFG, edges receive a size argument that represents the
data size of the bus between two nodes (operators). So,
when we add all edges size arguments crossing nodes lines,
we obtain the total data size in use at this point of the DFG.
By applying the method described in Section 3, we
obtain a first partitioning for our two examples. Fig. 8 shows
the boundaries of each partition for the edge detector. Fig. 9
represents the first partitioning of the normal optical flow
estimator.
4.3. Implementation results
4.3.1. Hardware platform
In order to illustrate our method, we tested this
partitioning methodology on the ARDOISE1 architecture
[6,28]. This platform is constituted of three identical
modules: two frame grabber modules and a dynamic
calculator module. Each Ardoise module is constituted of
one FPGA and two 1 MB SRAM memory banks used as
draft memory. Here, we only use an Ardoise module
(Fig. 10). It is made of a dynamically configurable
AT40K20 FPGA and contains two local memories. This
FPGA can be configured in less than 1 ms and include an
equivalent of 20 K gates. Each local memory is connected to
the FPGA with 32 data bits. The configuration can be made
from a Flash memory and through a ‘controller’ FPGA.
This FPGA module (computing unit) is dedicated to
perform intensive computing. The temporal partitions will
be implemented in this FPGA. Partial results will be stored
in the local memories while computing and reconfiguring.
For this purpose, computing unit receives the input pixels
from the image acquisition system, process and writes them
into one of its local memories. Next, after a new
configuration it reads and processes the data from the last
memory and writes results in the other local memory. The
duration of this procedure is the duration of current image
processing which should not exceed the frame period.
Our method is not aimed to target such architectures with
resources constraint. Nevertheless, the results obtained in
terms of used resources and working frequency are still
valid for any AT40k-like array.
4.3.2. Implementation results
Table 7 summaries the implementation results of the
edge detector algorithm. We notice that a dynamic
execution in three steps can be achieved in real time. This
is in accordance with our estimation (Table 4). We can note
that a fourth partition is not feasible, because the allowed
maximal operator execution time would be less than 34 ns.
If we analyze the remaining time we find that one more
partition does not allow the real time processing. The
maximal number of cells by partition allows to determine
the functional density gain factor obtained by the RTR
Table 3
Number and characteristics of the operators of the edges detector (on
AT40K)
Operators Quantity Size
(bits)
Area
(cells)
Execution time
(ns)
Comparator 7 8 16 41
Multiplexer 9 8 8 8
Absolute value 2 11 10 34
Subtractor 1 8 9 25
1 10 11 30.5
Adder 1 8 9 25
2 9 10 27.5
1 10 11 30.5
Multiplication by 2 2 8 0 Routing
9 0 Routing
Division by 4 2 11 0 Routing
Register (pipeline or delay) 13 8 8
4 9 9
5 10 10 8
1 11 11
Table 4
Resources estimation for the image edge detector
Total area
(cells)
Operator
execution
time (ns)
Step
number (n)
Area by
step (cells)
Reconfiguration
time by step (ms)
467 41 3 156 114
Table 5
Number and characteristics of the operators of the normal optical flow
estimator (on AT40K)
Operators Quantity Size
(bits)
Area
(cells)
Execution time
(ns)
Multiplexer 7 15 15 8
Absolute value 2 9 8 30
Subtractor 2 9 10 27.5
8 15 16 44.3
Adder 2 9 10 27.5
1 10 11 30.5
2 11 12 33.5
Multiplication by 2 3 8 0 Routing
Multiplication by 128 1 8 0 Routing
Division by 2 7 9–15 0 Routing
Division by 8 1 12 0 Routing
Register (pipeline or delay) 57 1 1 8
16 8 8 8
6 9 9 8
2 10 10 8
2 11 11 8
2 12 12 8
2 13 13 8
2 14 14 8
12 15 15 8
1 This project involves ten French research labs which include our
laboratory and is supported by the French agency for education, research
and technology.
C. Tanougast et al. / Microprocessors and Microsystems 27 (2003) 115–130 123
implementation [25,26]. In this example, the gain factor in
term of functional density is approximately 3.
Fig. 8 represents each partition successively
implemented in the reconfigurable array of FPGA resources
for the edge detector.
Table 8 summaries the implementation results of the
normal optical flow estimator. Here, we obtain a gain in
functional density of about 2.44 with respect to the global
implementation of this data-path (static implementation) for
real time processing. We also note that a static implemen-
tation of this algorithm is not possible with our Ardoise
module. The available logic area is insufficient (only 819
cells). So, this approach can be used on a platform with
insufficient fixed-size FPGA resources. Fig. 9 represents
each partition implemented successively in the reconfigur-
able array for the normal optical flow estimator.
Here it is possible to refine the partitioning of this
algorithm (see Fig. 2). We can note that a fourth
partition is feasible because the remaining time after
processing, in comparison with the time constraint, is
greater that 12 ms. So, if we consider one more partition
with the slowest operator execution time estimate of the
data-path (44.3 ns, see Table 6), we obtain a processing
time of 11.6 ms and a remaining time of 1.2 ms for the
reconfiguration. Thus, a partitioning in four steps is
possible.
In this case, the size of each partition (Cn) should be
about 216 cells. We keep the same partition number one
described in the first implementation in order to obtain the
same processing time. Moreover, the resources of this step
are close to the new size estimated for each partition (Cn).
Then, we have decomposed the two last partitions in three
parts. The divider has been split in two parts in order to
homogenize the number of resources for each partition. This
way to proceed allows real time processing. Table 9
summaries the implementation results of the normal flow
optical estimator algorithm with four partitions.
We notice that dynamic execution with four steps can be
achieved in real time. This is in concordance with our
estimation. We obtain a total processing time of about 38 ms
and a functional density near to 3.
However, an implementation by partitioning in five steps
leads to a critical time very harsh for real time operation.
Moreover, with these partitions, we need a bigger memory
to store the intermediate results than for the last partitioning.
Indeed, there are many ways to partition the algorithm with
our strategy. Obviously, the best solution is to find the
partitioning that leads to the same number of cells used in
each step by taking into account memory bandwidth (search
minimal memory bandwidth) (see Section 4.2). In our
applications we needed a maximum bandwidth of 30 bits.
Ardoise module provides a local bandwidth of 32 bits that is
sufficient to carry out the theoretical partitioning. Never-
theless we have to adopt a final partitioning that has the
lowest bandwidth and that preserves the homogeneity of the
resources in each partition. Fig. 11 represents the four
partitions successively implemented in the reconfigurable
array for the normal optical flow estimator.
From our analysis of the data flow graph; we deduced
resources requirement and speed of the various operators.
This leads to determine the total processing time, from
which we deduce the optimized partitioning. This allows a
RTR implementation of the DFG that enhances the
functional capacity. However, in order to know the real
cost of our method, it is necessary to take into account the
memory size needed to store the intermediate results, the
resources needed by the read and write counters (pointers),
the configuration controller and the small associated state
machine.
4.3.3. Dynamic configuration and memory controller
In order to know the needed real resources and to
compare our method with an architectural synthesis (AS)
(see Section 5), it is necessary to quantify the cost of
reconfiguration in terms of control and memory manage-
ment. This controller can read/write data in local memories
and allows loading the configuration of the next temporal
floorplan in the FPGA. In our case, this controller is a static
Fig. 7. Final refined partitioning which take into account memory bandwidth.
Table 6
Resources estimation for the normal optical flow estimator
Total
area (cells)
Operator
execution
time (ns)
Step
number (n)
Area
by step (cells)
Reconfiguration
time by step (ms)
863 44.3 3.27 264 200
C. Tanougast et al. / Microprocessors and Microsystems 27 (2003) 115–130124
design module. This controller is mapped in the first
partition and it remains during all the processing steps. The
use of partial reconfiguration allows keeping the controller
in place while it manages the execution of the application in
the remaining space of the FPGA.
Generally, if we have enough memory bandwidth, we
can estimate the cost of the control part in the following
way. The memory resources must be able to store two
images (we assume a constant flow processing): memory
size of 256 Kpixels. We have supposed that the memory
access time is shorter than the slowest operator execution
time. That is why, it is preferable that the images are stored
in on-chip memory in order to not affect the execution
performance of the partitions.
Fig. 9. Partitioning used to implement the normal optical flow estimator.
Fig. 8. Partitioning used to implement the images edge detector dataflow graph.
C. Tanougast et al. / Microprocessors and Microsystems 27 (2003) 115–130 125
The controller needs two counters to address the
memories, a state machine for the control of the RTR and
the management of the memories for read or write access
(Fig. 12).
In our case, the controller consists in two 18 bit counters
(images of 5122 pixels), a state machine with five states, a 4
bit register to capture the number of partitions (we assume a
number of reconfiguration lower than 16), a counter
indicating the number of partitions, a 4 bit comparator and
a toggle flip–flop to indicate to which alternate buffer
memory we have to read and write. With the targeted FPGA
structure (see Section 4), the logic area of the controller in
each configuration stage requires a number of resources
lower than 50 logical cells. So, it is needed to add 50 cells to
each partition area given in Section 4.3.2.
4.3.4. Summary of needed resources
The resources characteristics of the platform to use or to
design for the first example algorithm are: Memory size:
256 Kwords of 30 bits, computing area (computing
reconfigurable area þ controller area): 209 cells, memory
bandwidth: 19 bits with a minimal read/write frequency of
24.8 MHz. For the second algorithm, the memory size and
the controller logic area are the same than for the previous
algorithm. However, in this case, the computing area needs
344 cells, memory bandwidth: 30 bits with a minimal
read/write frequency of 25.8 MHz.
5. Discussion
The objective of the Algorithm/Architecture Adequation
is to find best an efficient matching between an algorithm
and an architecture. The aim is to realize an optimal
implementation that satisfies the constraints (real time, logic
area, etc.). In this Section, we illustrate an analysis of the
gains and costs of the use of dynamic reconfiguration. We
can compare our method to the more classical architectural
synthesis, which is based on the reuse of operator by adding
control at different time [32,33]. Indeed, the goal of the two
approaches is the minimization of hardware resources with
respect to time constraint. If we consider our method with
respect to an architectural synthesis, the RTR can be
illustrated by Fig. 13.
Fig. 13 represents the needed area with respect to the
processing time for RTR. The optimization with AS can
divide by two the logical resources at the cost of an
additional cycle time for the processing. This observation
applies only if we do not take in account the controller and
memory resources that are needed for AS. These additional
resources are difficult to estimate in AS because they depend
heavily on the algorithm. In the case of RTR, if we consider
critical times identical to those of AS, we notice the same
optimization level. Nevertheless, the RTR control and
memory management (hidden or not in the FPGA) logical
resources are easy to deduce. Moreover, in contrast with AS
which sees its control increasing with the level of re-use of
the operators, these control resources remain relatively
Fig. 10. Ardoise module.
Table 7
Implementation results of edges detector with an AT40K
Partition
number
Number
of cells
Operator
execution
time (ns)
Partition
reconfiguration
time (ms)
Partition
processing
time (ms)
1 152 40.1 111 10.5
2 156 40.3 114 10.6
3 159 36.7 116 9.6
Table 8
Implementation results of the optical flow estimator with an AT40K
Partition
number
Number
of cells
Operator
execution
time (ns)
Partition
reconfiguration
time (ms)
Partition
processing
time (ms)
1 209 27.1 160 7. 1
2 354 38.7 260 10.15
3 336 37.8 250 9.91
Table 9
Implementation results of the normal optical flow estimator after refined
partitioning
Partition
number
Number
of cells
Operator
execution
time (ns)
Partition
reconfiguration
time (ms)
Partition
processing
time (ms)
1 209 27.1 160 7.1
2 241 38.7 180 10.15
3 248 38.7 180 10.15
4 294 37.8 190 9.91
C. Tanougast et al. / Microprocessors and Microsystems 27 (2003) 115–130126
Fig. 11. Implementation of the normal optical flow estimator with four partitions.
C.
Ta
no
uga
stet
al.
/M
icrop
rocesso
rsa
nd
Micro
systems
27
(20
03
)1
15
–1
30
12
7
identical whatever the number of partitions may be. On the
other hand, additional time for the reconfigurations of the
FPGA between each partition is added to the processing
time. However, with the use of partial reconfiguration, this
time is also constant for a given algorithm. Thus, according
to the application and the time constraint, it is interesting to
search when it is better to use RTR or AS during an
implementation on FPGA.
We can do several observations. When architectural
synthesis is applied, the operators must be dimensioned
for the largest data size even if such a size is rarely
processed (generally only after many processing passes).
Similarly, even if an operator is not frequently used it
must be present (and thus consumes resources) for the
whole processing duration. These drawbacks, which do no
more exist for a run-time-reconfigurable architecture,
generate an increase in logical resources needs. Further-
more, the resources reuse can lead to increased routing
delay if compared to a fully spatial data path, and thus
decrease the global architecture efficiency. But, if we use
the dynamic resources allocation features of FPGAs, we
instantiate only the needed operators at each instant
(temporal locality [10]) and assure that the relative
placement of operators is optimal for the current
processing (functional locality [10]).
Nevertheless, this approach has also some costs. Firstly, if
we consider the silicon area, an FPGA needs between five and
ten times more silicon than a standard cell ASIC (ideal target
for architectural synthesis) at the same equivalent gates count
and with lower speed. But this cost is not too important if we
consider the ability to make big modifications of the hardware
functions without any change of the hardware part. Secondly,
in terms of memory throughput, with respect to a fully static
implementation, our approach requires an increase of a factor
of at least the number of partitions n. And thirdly, in terms of
power consumption, both approaches are equivalent if we
neglect both the over clocking needed to compensate for
reconfiguration durations and consumptions outside the
FPGA. Indeed, in a first approximation, power consumption
scales linearly with processing frequency and functional area
(number of toggling nodes), and we multiply the first by n and
divide the second by n.
6. Conclusion and future works
We proposed a method for the temporal partitioning of a
dataflow graph that permits to minimize the array size of a
FPGA by using the dynamic reconfiguration feature. This
approach increases the silicon efficiency by processing at
the maximally allowed frequency on the smallest area and
which satisfies the real time constraint. The method is based,
among other steps, on an estimation of the number of
possible partitions by use of a characterized (speed and area)
library of operators for the target FPGA. Our method takes
into account the memory bandwidth bottleneck, memory
size for the storage of the intermediate results. This
approach allows obtaining the main parameters characteriz-
ing the architecture model, which implements a particular
algorithm from the constraints. This is very interesting for
the development and optimization of embedded design. This
methodology for RTR implementation of a design is
adapted to reflect layout constraints imposed by the target
technology and allows avoiding an oversizing of the
resources needed. We illustrated the method by applying
Fig. 12. General view of configuration and memories controller.
Fig. 13. Evolution of spatio-temporal resources in RTR implementation.
C. Tanougast et al. / Microprocessors and Microsystems 27 (2003) 115–130128
it on two images processing algorithms and by real
implementation on the Atmel AT40K FPGA target
technology.
Our approach must be compared with other more
powerful optimizations tools such as simulated annealing,
genetic algorithms and so on. Currently we work on a
more accurate resources estimation which takes into
account the memory management part of the data path
and also checks if the available memory bandwidth is
sufficient. At this time, we started to computerize
(automate) the partition search procedure, called
DAGARD which means, in French, automatic partition-
ing of DFG for dynamically configurable systems, which
is roughly a graph covering function. Currently, only the
partition estimation part and boundaries generation
between each partitions is automated. Our perspectives
are an automatic synthesis of memory and configuration
controller by our tool. Indeed, from the partitions
computation, it is easy to automatically generate a
fixed-resources controller for a particular implementation.
Moreover, it is necessary to take into account the power
consumption that is an important parameter for
embedded systems design. This estimation depends on
the target technology and the working environment like
working frequency, resources needed for each partitions
[34,35]. However, it is mainly on the data-path part
(regular processing), that it is necessary to judge the
‘adequation’ of RTR for an implementation, because a
controller is more effectively produced by a program-
mable sequencer like a microprocessor. We also study
the possibilities to include an automatic architectural
solutions exploration for the implementation of arithmetic
operators.
Another future work is to find when it is better to use the
dynamic configuration or architectural synthesis for the
optimization of an implementation (see Section 5).
References
[1] S. Hauck, The role of FPGA in reprogrammable systems, Proceedings
of IEEE 86 (4) (1998) 615–638.
[2] S. Brown, J. Rose, FPGA and CPLD architectures: a tutorial, IEEE
Design and Test of Computers 13 (2) (1996) 42–57.
[3] S.A. Guccione, D. Levi, The advantages of run-time reconfigura-
tion, in: J. Schewel, P.M. Athanas, S.A. Guccione, S. Ludwig, J.T.
McHenry (Eds.), Reconfigurable Technology: FPGAs for Comput-
ing and Applications, Proceedings of SPIE 3844, SPIE—The
International Society for Optical Engineering, Bellingham, WA,
1999, pp. 87–92.
[4] P. Lysaght, J. Dunlop, Dynamic reconfiguration of FPGAs, in: W.
Moore, W. Luk (Eds.), More FPGA’s, Oxford, Abingdon, England,
1994, pp. 82–94.
[5] S.C. Goldstein, H. Schmit, M. Budiu, S. Cadambi, M. Moe, R. Taylor,
PipeRench: a reconfiguration architecture and compiler, IEEE
Computer April (2000).
[6] L. Kessal, D. Demigny, N. Boudouani, R. Bourguiba, Reconfigurable
hardware for real time image processing, Proceedings of the
International Conference on Image Processing, IEEE ICIP, Vancou-
ver 3 (2000) 159–173.
[7] B. Kastrup, A. Bink, J. Hoogerbrugge, ConCISe: a compiler-driven
CPLD-based instruction set accelerator, in: K.L. Pocek, J.M. Arnold
(Eds.), Proceedings of the Seventh Annual IEEE Symposium on Field-
Programmable Custom Computing Machines, 1999, pp. 92–101.
[8] J.E. Vuillemin, P. Bertin, D. Roncin, M. Shand, H.H. Touati, P.
Boucard, Programmable active memories: reconfigurable systems
come of age, IEEE Transactions on VLSI Systems 4 (1) (1996)
56–69.
[9] M.J. Wirthlin, B.L. Hutchings, A dynamic instruction set computer,
IEEE Symposium on FPGA’s Custom Computing Machines, Napa,
CA April (1995) 99–107.
[10] X. Zhang, K.W. Ng, A review of high-level synthesis for dynamically
reconfigurable FPGA’s, Microprocessors and Microsystems 24 (2000)
199–211. Elsevier.
[11] C. Tanougast, Methodologie de partitionnement applicable aux
systemes sur puce a base de FPGA, pour l’implantation en
reconfiguration dynamique d’algorithmes flot de donnees, PhD
Thesis, Faculte des Sciences, Universite de Nancy I, 2001.
[12] T.J. Callahan, J. Hauser, J. Wawrzynek, The GARP architecture and C
compiler, IEEE Computer 33 (4) (2000) 62–69.
[13] T.J. Callahan, P. Chong, A. DeHon, J. Wawrzynek, Fast module
mapping and placement for data paths in FPGA’s, Proceedings of
the ACM/SIGDA Sixth International Symposium on Field
Programmable Gate Arrays, Monterey, CA February (1998)
123–132.
[14] S.V. Srinivasan, S. Govindarajan, R. Vemuri, Fine-grained and
coarse-grained behavioral partitioning with effective utilization of
memory and design space exploration for multi-FPGA architectures,
IEEE Transactions on VLSI Systems 9 (1) (2001) 140–158.
[15] M. Kaul, R. Vemuri, Optimal temporal partitioning and synthesis
for reconfigurable architectures, International Symposium on Field-
Programmable Custom Computing Machines April (1998)
312–313.
[16] R. Maestre, F. Kurdahi, M. Fernadez, R. Hermida, N. Bagherzadeh, H.
Singh, A framework for reconfigurable computing: task scheduling
and context manegement, IEEE Transactions on VLSI Systems 9 (6)
(2001) 858–873.
[17] M. Karthikeya, P. Gajjala, B. Dinesh, Temporal partitioning and
scheduling data flow graphs for reconfigurable computer, IEEE
Transactions on Computers 48 (6) (1999) 579–590.
[18] N. Shirazi, W. Luk, P.Y.K. Cheung, Automating production of run-
time reconfiguration designs, in: K.L. Pocek, J. Arnold (Eds.),
Proceedings of IEEE Symposium on FPGA’s Custom Computing
Machines, IEEE Computer Society Press, 1998, pp. 147–156.
[19] W. Luk, N. Shirazi, P.Y.K. Cheung, Modeling and optimizing run-
time reconfiguration systems, in: K.L. Pocek, J. Arnold (Eds.),
Proceedings of IEEE Symposium on FPGA’s Custom Computing
Machines, IEEE Computer Society Press, 1996, pp. 167–176.
[20] P. Lysaght, J. Stockwood, A simulation tool for dynamically
reconfigurable field programmable gare arrays, IEEE Transactions
on VLSI Systems 4 (3) (1996) 381–390.
[21] M. Valsiko, DYNASTY: a temporal floorplanning based CAD
framework for dynamically reconfigurable logic systems, in: P.
Lysaght, J. Irvine, R. Hartenstein (Eds.), Lecture Notes in Computer
Science, vol. 1673, 1999, pp. 124–133, Glasgow, UK.
[22] A. Cataldo, Hybrid architecture embeds Xilinx FPGA core into IBM
ASICs, EE Times Jun 24, 2002, http://www.eetimes.com/story/
OEG20020624S0016.
[23] J. Becker, R. Hartenstein, M. Herz, U. Nageldinger, Parallelization in
co-compilation for configurable accelerators, proceedings of Asia and
South Pacific Design Automation Conference, ASP-DAC’98 Yoko-
hama, Japan (1998).
C. Tanougast et al. / Microprocessors and Microsystems 27 (2003) 115–130 129
[24] A. Dehon, Comparing computing machines, Proceedings of the SPIE
3526 (1998) 124–133.
[25] M.J. Wirthlin, B.L. Hutchings, Improving functional density using
run-time circuit reconfiguration, IEEE Transactions on VLSI Systems
6 (2) (1998) 247–256.
[26] J.G. Eldredge, B.L. Hutchings, Run-time reconfiguration: a method
for enhancing the functional density of SRAM-based FPGA’s,,
Journal of VLSI Signal Processing 12 (1996) 67–86.
[27] M. Vasilko, D. Ait-Boudaoud, Scheduling for dynamically reconfi-
gurable FPGAs, In Proceeding of International Workshop on Logic
and Architecture Synthesis, IFIP TC10 WG10.5, Grenoble, France
(1995) 328–336.
[28] D. Demigny, L. Kessal, R. Bourguiba, N. Boudouani, How to use high
speed reconfigurable fpga for real time image processing?, Proceed-
ings of IEEE Conference on Computer Architecture for Machine
Perception, IEEE Circuit and Systems, Padova September (2000)
240–246.
[29] N. Demassieux, Architecture VLSI pour le traitement d’images: Une
contribution a l’etude du traitement materiel de l’information, PhD
Thesis, Ecole nationale superieure des telecommunications (ENST),
1991.
[30] C. Tanougast, Y. Berviller, S. Weber, Optimization of motion
estimator for run-time reconfiguration implementation, in: J. Rolim
(Ed.), Parallel and Distributed Processing, Lecture Notes in Computer
Science, vol. 1800, Springer, 2000, pp. 959–965.
[31] Atmel AT40k Datasheet.
[32] N. Togawa, M. Ienaga, M. Yanagisawa, T. Ohtsuki, An area/time
optimizing algorithm in high-level synthesis for control-base hard-
wares, Proceedings of the ASP-DAC 2000, IEEE circuits and systems,
Yokohama, Japan (2000) 309–312.
[33] A. Sharma, R. Jain, Estimating architectural resources and perform-
ance for high-level synthesis applications, IEEE Transactions on
VLSI Systems 1 (2) (1993) 175–190.
[34] A. Bogliolo, R. Corgnati, E. Macii, M. Poncino, Parameterized RTL
power models for soft macros, IEEE Transactions on VLSI Systems 9
(6) (2001) 880–887.
[35] A. Garcia, W. Burleson, J.L. Danger, Power modelling in field
programmable gate arrays (FPGA), in: P. Lysaght, J. Irvine, R.
Hartenstein (Eds.), Lecture Notes in Computer Science, vol. 1673,
1999, pp. 396–404, Glasgow, UK.
Camel Tanougast received his PhD
degree in Microelectronic and Electronic
Instrumentation from the University of
Nancy 1, France in 2001. Currently is a
researcher in Electronic Instrumentation
Laboratory of Nancy (LIEN). His research
interests include: design and implemen-
tation real time processing architecture,
FPGA design and the Terrestrial Digital
Television (DVB-T).
Yves Berviller received the PhD degree
in electronic in 1998 from the Henri
Poincare University, Nancy, France. He
is currently assistant professor with Henri
Poincare University. His research inter-
ests include computing vision, system on
chip development and research, FPGA
design and the Terrestrial Digital Televi-
sion (DVB-T).
Philippe Brunet received an MSc
degree from the University of
Dijon, France in 2001. Currently,
he is a PhD research student in
Electronic Engineering at the Elec-
tronic Instrumentation Laboratory
of Nancy (LIEN), University of
Nancy 1. His main interest con-
cerns Design FPGA and computing
vision.
Serge Weber received the PhD. Degree in
electronic in 1986, from the University of
Nancy (France). In 1988 he joined the
Electronics Laboratory of Nancy (LIEN)
as an Associate Professor. Since Septem-
ber 1997 he is Professor and Manager of
the Electronic Architecture group at LIEN
(University Henri Poincare Nancy I). His
research interests are on reconfigurable
and parallel architectures for image and
signal processing or for intelligent sensors.
Hassan Rabah received the PhD
degree in electronic in 1992 from the
Henri Poincare University, Nancy,
France. He is currently assistant pro-
fessor with Henri Poincare University.
His research interests include system
on chip development and research,
digital signal processing and sensor
applications.
C. Tanougast et al. / Microprocessors and Microsystems 27 (2003) 115–130130