10
J. Becker, M. Platzner, S. Vernalde (Eds.): FPL 2004, LNCS 3203, pp. 617–626, 2004. © Springer-Verlag Berlin Heidelberg 2004 Real Time Optical Flow Processing System Javier Díaz, Eduardo Ros, Sonia Mota, Richard Carrillo, and Rodrigo Agis Departamento de Arquitectura y Tecnología de Computadores, E.T.S.I. Informática, Universidad de Granada, Periodista Daniel Saucedo Aranda s/n, 18071 Granada, Spain {jdiaz, eros, smota, rcarrillo, ragis}@atc.ugr.es Abstract. We describe an optical flow processing system that works as a virtual motion sensor. It is based on an FPGA device; this enables the easy change of configuring parameters to adapt the sensor to different motion speeds, light conditions and other environment factors. We call it virtual sensor because it consists on a conventional camera as front-end and a processing FPGA device which embeds the frame grabber, the optical flow algorithm implementation, the output module and some configuring and storage circuitry. To the best of our knowledge this paper represents the first description of a fully working optical flow processing system that includes accuracy and processing speed measurements to evaluate the platform performance. 1 Introduction Optical flow algorithms have been widely described in the literature. Some authors have addressed a comparative study of their accuracy on synthetic sequences [1]. Their evaluation using real world sequences is difficult to address because the real optical flow of such sequences is unknown. We have chosen to implement a classical gradient model based on work done by Lucas & Kanade (L&K) [1, 2]. Several authors have pointed out the good trade-off between accuracy and efficiency of this model that is an important factor to decide which approach is more suitable to be implemented as a real time processing system. For example, in [1] L&K’s algorithm provides very accurate results. Liu et al. [3] evaluate the efficiency vs. accuracy trade- off of different optical flow approaches and outline that L&K is a good candidate. Finally, McCane et al. [4] also give L&K a good score and conclude that this approach requires an affordable computational power. This has motivated some other authors to focus on the L&K algorithm [5, 6]. In this paper we describe the hardware implementation of the L&K algorithm. There are other authors that have described recently the hardware implementations of optical flow algorithms [7, 8, 9], but most of them do not provide results to evaluate the performance of the system, i.e. the accuracy and the computation speed. This approach is a fully working system at conventional camera frame rates of 30 Hz, with images sizes of 320x240 pixels. To the best of our knowledge, this is the first description of such a system and therefore represents the state of art in this area.

Real time optical flow processing system

  • Upload
    ugr

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

J. Becker, M. Platzner, S. Vernalde (Eds.): FPL 2004, LNCS 3203, pp. 617–626, 2004.© Springer-Verlag Berlin Heidelberg 2004

Real Time Optical Flow Processing System

Javier Díaz, Eduardo Ros, Sonia Mota, Richard Carrillo, and Rodrigo Agis

Departamento de Arquitectura y Tecnología de Computadores, E.T.S.I. Informática,Universidad de Granada, Periodista Daniel Saucedo Aranda s/n, 18071 Granada, Spain

jdiaz, eros, smota, rcarrillo, [email protected]

Abstract. We describe an optical flow processing system that works as a virtualmotion sensor. It is based on an FPGA device; this enables the easy change ofconfiguring parameters to adapt the sensor to different motion speeds, lightconditions and other environment factors. We call it virtual sensor because itconsists on a conventional camera as front-end and a processing FPGA devicewhich embeds the frame grabber, the optical flow algorithm implementation,the output module and some configuring and storage circuitry. To the best ofour knowledge this paper represents the first description of a fully workingoptical flow processing system that includes accuracy and processing speedmeasurements to evaluate the platform performance.

1 Introduction

Optical flow algorithms have been widely described in the literature. Some authorshave addressed a comparative study of their accuracy on synthetic sequences [1].Their evaluation using real world sequences is difficult to address because the realoptical flow of such sequences is unknown. We have chosen to implement a classicalgradient model based on work done by Lucas & Kanade (L&K) [1, 2]. Severalauthors have pointed out the good trade-off between accuracy and efficiency of thismodel that is an important factor to decide which approach is more suitable to beimplemented as a real time processing system. For example, in [1] L&K’s algorithmprovides very accurate results. Liu et al. [3] evaluate the efficiency vs. accuracy trade-off of different optical flow approaches and outline that L&K is a good candidate.Finally, McCane et al. [4] also give L&K a good score and conclude that thisapproach requires an affordable computational power. This has motivated some otherauthors to focus on the L&K algorithm [5, 6].

In this paper we describe the hardware implementation of the L&K algorithm.There are other authors that have described recently the hardware implementations ofoptical flow algorithms [7, 8, 9], but most of them do not provide results to evaluatethe performance of the system, i.e. the accuracy and the computation speed. Thisapproach is a fully working system at conventional camera frame rates of 30 Hz, withimages sizes of 320x240 pixels. To the best of our knowledge, this is the firstdescription of such a system and therefore represents the state of art in this area.

618 J. Díaz et al.

2 Optical Flow Model

Although the original algorithm was proposed as a method to estimate disparity mapin stereo pair images [2], we have implemented the Barron’s description of L&Kalgorithm that can be applied to optical flow computation [1]. Several modificationshave been added to improve the hardware implementation feasibility. Instead of usingtemporal FIR filters, we have implemented IIR filters as described in [10]. A FIRapproach is also feasible in the used prototyping platform because it includes fourindependent memory banks. But a IIR approach is much more easy to configure fordifferent time constants (only a single coefficient needs to be modified). On the otherhand, the FIR approach requires variable memory resources and memory accessesdepending on the constant of the temporal filter.

Other modification is adopted to provide estimations when the aperture problemappears. In this situation, as described in [11], we can add a small modification thatallows to provide an estimation in the maximum gradient direction.

In the following equations, we describe briefly the computations in which is basedthe L&K approach. We will refer to these computational stages when describing thesystem architecture. A detailed description of the L&K model is provided in [1,2].The algorithm belongs to the gradient based techniques that are characterized bygradient search performed on extracted spatial and temporal derivatives. Making theassumption of constant luminance values across the time, L&K method constructs aflow estimation based on first order derivatives of the image. By least squares fitting,the model extracts the motion estimation based on the hypothesis of similarity ofvelocitiy values in a neighbourhood of a central pixel. W(x) is a window that weightsthe constraints with higher weights near the centre of the spatial neighbourhood Ω.

The known solution to this problem is:

[ ] bWAAWAv TT 212 −= (1)

[ ][ ]

[ ])(),....,(

)(),....,(

)(),....,(

1

21

1

ntt

n

xIxIb

xWxWdiagW

xIxIA

−=

=∇∇=

(2)

An inherent limitation of these models appears in blank wall or aperture problemsituations. In these cases, the problem has no solution (matrix ATW2A is notinvertible) and the model can not provide any motion estimation. For this reason, wehave added a small constant α to the matrix diagonal according to [9] that allows us toestimate the normal velocity field in situations where 2-D velocity can not beextracted due to the lack of contrast information. Therefore, the term of equation (1)is computed with expression (3).

+

+=

∑∑

∑∑

∈∈

∈∈

Ωx

2y

2

Ωxyx

2

Ωxyx

2

Ωx

2x

2

2T

αIWIIW

IIWαIW

AWA (3)

Summarizing, we have to compute the 2x2 matrix of equation (3), its inverse andthe 2x1 matrix indicated in expression (4).

Real Time Optical Flow Processing System 619

−==

Ωxty

2

Ωxtx

2

2T

IIW

IIW

bWAB (4)

The Gaussian smoothing is the pre-processing stage before the computation of theimage derivatives in the matrices elements of equations (3) and (4). It reduces imagenoise and generates higher correlation between adjacent pixels. Typically, Gaussianspatio-temporal filters of 2 pixels variance plus a temporal derivative of 5 pixels areused. All the temporal operations require storage of 15 images for the entire process.This is hardly affordable in embedded systems; therefore, as indicated in [10], a moreefficient implementation can be implemented using IIR temporal recursive smoothingand derivative filters. In this way, the temporal storage requirement is reduced to 3frames, and the computation time improved, at cost of a slightly reduced accuracy.For an extensive discussion about how to design the IIR filters see [10].

3 Hardware Implementation

Nowadays software real-time computation of simple optical flow algorithms for smallimages is possible due to the outstanding computational power of the PCs. Thedrawback is that is difficult to adapt these systems to be used as embedded solutions.In the presented approach the motion computation chip can be regarded as part of asmart sensor. Alternatively, several hardware technologies can be used to implementan embedded system. The use of specific integrated circuits (ASIC) can achievesufficient power to allow real-time computation but they are an expensive option.DSPs represent a valid alternative but if we need considerable computational power,the DSP solution is not powerful enough and a multiprocessor scheme needs to bedesigned [12]. The solution we propose is based on the use of programmable logiccircuits (FPGAs). These circuits allow us to design a customized DSP circuit in asingle chip of high computational power due to an intensive use of their intrinsicparallelism and pipeline resources. As we will show in later sections, the solution wepropose uses this technology to implement a real-time hardware device capable ofworking as a PC coprocessor or smart sensor in embedded applications.

For our design we have used the RC1000-PP board from Celoxica [13] andHandel-C [14] as the hardware specification language. This board is connected to thePC by a PCI bus and it can be used as hardware accelerator board or as prototypingboard. It contains a 2 million gates Virtex-E FPGA and four 2 MB SRAM memorybanks accessible in parallel.

3.1 System Implementation Overview

The efficient implementation of the algorithm onto a FPGA device requires theefficient exploitation of the intrinsic processing parallelism of this kind of device. Weuse segmented pipeline architecture as shown in Fig. 1.

620 J. Díaz et al.

Fig. 1. Coarse pipeline processing architecture.

The basic computational stages in Fig. 1 can be summarised as follows:S0. Frame-Grabber receives the pixels from the camera and stores them in one ofthe external memory banks using a double-buffer technique to avoidtemporization problems.S1. Spatial Gaussian filters smoothing stage.S2. IIR temporal filter affords temporal derivative and spatio-temporal smoothedimages.S3. Spatial derivatives stage.S4. Construction of least-square matrices of equations (2) and (3).S5. Custom floating-point unit. Final velocity estimation need the computation ofa matrix inversion, which requires a division operation. At this stage theresolution of the incoming data bits is significant and expensive arithmeticoperations are required. Fixed point arithmetic becomes then too expensive andtherefore we have designed a customized floating-point unit.

The computation bit-depth increases throughout the pipeline structure. Forexample, for a high precision system, with low degradation, we use 8 bits in the twofirst stages, 12 bits in the third and fourth stages, 24 in the construction of least-squarematrices and 25 bits for the floating-point unit although a less hardware expensiveapproach has been tested with good results. The computation of the least-squarematrices (S4) is the most expensive stage in computational resources. Differentparallelism strategies can be adopted at this point.

Basic parameters of the pipeline structure are latency (L) and the maximumnumber of cycles (MNC) required in the longest stage, which is the limiting factor ofthe computing speed. The circuit scheme gives us a basic relationship between theMNC and the system frequency clock (fclk) to know the computing speed in pixels persecond (pps), i.e. pps=fclk/MNC.

Due to the expensive requirements of the stages 5 and 6, the following subsectionfocus on their implementation and architectural design strategy.

Real Time Optical Flow Processing System 621

3.2 Least Square Matrices Construction

This is a critical stage where the trade-off between efficiency and cost can be widelystudied. Equations (2) and (3) require the generation of five products: Ix

2, Iy

2, IxIy, IxIt,IyIt. Then we have to make a weighted sum in a window (W) over a neighbourhood ofsize w x by w y. Due to memory limitations we save the Ix, Iy, and It values instead of thefive crossed products. Therefore, the operations to do are: a) products computation forall the elements in a neighbourhood. We need to do 5 x w x x w y multiplications. b)Row convolutions operation. We have 5 x w y convolutions to do and c) Columnconvolutions operation. It requires the computation of 5 convolutions.The scheme of these operations can be seen in the Fig. 3.

Fig. 3. Leas squares matrices circuit builder for a 3x3 neighbourhood.

This is an important stage where we can bias the trade-off between efficiency andhardware cost. The important parameters to choose are: neighbourhood weighted sumarea, number of multiplication units and number of row-column convolution units.

For example, if we use a 3x3 neighbourhood, we can use between 1 to 45multipliers, 1 to 15 row convolutions unit and 1 to 5 column convolution units. Thischoice allows us to compute the weighted sum values in one clock cycle with a highlyparallel hardware unit or to compute it in a sequential way. Results using differentconfigurations are shown in section 4.

3.3 Final Velocity Calculation Using a Custom Floating Point Unit

At this stage the expression (1) is computed. Until now, the arithmetic operationshave been done using integer or fix point arithmetic with truncation operations.Convolution operations work well with this representation but when the bit depth istoo high, a floating point data representation become better suited for hardwareimplementation. This is done with a customized superscalar floating point unit whose

622 J. Díaz et al.

architecture is illustrated in Fig. 4. Since at the previous stage, a high bit-depth (24bits) is used to preserve the computation accuracy, it is a very expensive stage interms of hardware resources. Therefore it is a critical stage that highly affects theaccuracy vs. processing speed trade-off.

This stage computes the inverse of a matrix and the multiplication of a (2x2)matrix by a (2x1) vector. This calculus involves the basic arithmetic operations:subtraction, multiplication and division. The hardware structure of the unit developedto compute this is shown in Fig. 4.

Fig. 4. Floating-point unit scheme

When arithmetic operations are done with large bit depth, the signal delaysassociated to carry lines degrades the global system performance decreasing themaximum system frequency. To avoid this, pipeline arithmetic operators or sequentialiterative operators can be used. The first one allows us to make the computation infew (1 or 2) clock cycles after a given latency at an expensive cost in terms ofhardware resources. The second option takes several clock cycles therefore, degradingthe MNC of the system, but allows us to use the same hardware for each iteration. Wedefine a system which uses 1 cycle floating point hardware circuits because thisworks at the desired maximum clock frequency (without becoming the limiting stage)for all the operations but the division, because it is a difficult operation. We have useda hardware sequential divisor instead a pipelined divisor that needs 21 cycles tocompute the division of 25 bits floating numbers therefore, the MNC is too high and ithighly limits the system pipeline performance. The chosen solution uses up to 3–waysdivision units and, depending on the system performance required, we can synthesizemore or less ways. Each floating-point unit needs: one to five fix-points to floatingpoint converter units; one to six 25 bits floating point multipliers; one to threesubtractors; one to two divisor units. If n-ways divisor scheme is used, then we use nto 2n divisor units. Results using different configurations are shown in section 4.

4 Hardware Resources Consumption Study

The system has been designed in a very modular way. The parallelism and the bitaccuracy at the different stages can be easily modified. Due to the high level ofabstraction that Handel-C provides [14] it is easy to manage the parallelism of thecomputing circuits and the bit-depth at the different stages. In table 1 is summarizedthe hardware resources of the different stages using a XCV2000E-6 Virtex FPGA fora concrete implementation (called HSHQ in the following section).

Real Time Optical Flow Processing System 623

Table 1. Detail sub-circuits hardware requirements for a Virtex XCV2000E-6. Note that thesum (% of the device in the first column) is larger than 100%, this can be explained becausethese data have been obtained by partial compilations and the synthesis tool makes a wide useof the available resources. When the whole design is compiled it consumes 99% of the device.

Number of slices /(% of the device) /equivalent gates

Computingcycles

ISE maximumClock frequency

(MHz)

Memoryrequirements

/ (% of thedevice)

SpatialGaussian

(17 taps)

220 / (1%) /270,175

8 29.2 16 / (10%)

IIR filter 134 / (1%) /51,971 7 38.5 3 / (1%)

Spatialderivative

convolution

287 / (1%) /121,296

7 28.0 7 / (4%)

Least squarematrices

construction

15,288 / (79%) /642,705

10 20.3 24 / (15%)

Superscalarfloating point

unit

5,720 / (29%) /90,993

10 17.4 0

The last two stages have the larger MNC values. Note that a lower MNC arepossible for other stages but there is no reason to improve them due to the otherexisting limiting stages. The results of the Xilinx timing analyser are not alwaysaccurate. In fact, it can underestimate the speed at which a circuit can run, leading tothe place and route tool to take much longer than it needs to; i.e. the maximumfrequency allowed by the system has been experimentally measured and it is 10-20MHz higher than the very conservative results given by ISE. This arises because theanalyser looks at the static logic path, rather than the dynamic one (see [15]) andbecause of that we measure experimentally the maximum working frequency.

One important topic is the system configuration possibilities. We have evaluatedseveral configurations to explore different trade-offs between accuracy, hardware costand computing speed. In all these configurations we have used the same basicarchitecture but with different parallelism levels. Table 2 summarises the mainproperties of the different configurations. The ones using a 5x5 average window ofthe least-square-matrix neighbourhood we call high quality (HQ) approaches, and theones using a 3x3 window, medium quality (MQ). Other modifiable parameters are thesmoothing and spatial derivative filter sizes. HQ and MQ models include 5-pixelderivative filters and 9-pixel Gaussians. A low cost (LQ) version uses 3-pixelderivatives and a Gaussian filter of the same size.

If we fix the optical flow quality of the system, another factor to take into accountis the performance vs. hardware cost trade-off. If the system works with maximumparallelism the MNC is 10. Lower cost approaches are possible if we reduce theparallelism level, thus increasing MNC. For example, we implemented a high-speed(HS) version with MNC=10 cycles using a three-way division unit and maximumparallelism. A slower version was implemented reducing the parallelism. We call this

624 J. Díaz et al.

version medium speed (MS). Finally, we implemented a low-speed (LS) version.Table 2 summarises the performance of the systems and hardware costs.

Table 2. Performance and hardware cost of different configurations in a Virtex 2000-E FPGA(2 million gates and 640 Kbits of internal memory). (Kpps kilopixels per second, Fps frames per second). All the performance values were measured using a system clock frequencyof fclk=27MHz. These measurements (Kpps and Fps) are underestimations because thecomputing time measured also include data transmission to the prototyping board.

Version% device

occupation

%on-chipmemory

KppsImage

resolution

Fps(fclk=27M

Hz)

Max. fclk

(MHz)

HSHQ 991731

1776

160x120320x240

9524

35

HSMQ 651631

1776

160x120320x240

9724

35

MSMQ 43 16 625 160x120 33 35LSLQ 36 8 400 120x90 38 35

It is important to note that in our experiments data transmission to the prototypingboard using PCI bus takes about 40% of the total processing time, and thereforehigher frame rates are expected using a direct connection between the camera and theFPGA. For instance, as explained in section 1, the theoretical bit-through of theHSHQ is 2700Kpps. This topic is widely discussed in [16].

Until now, we have shown the system flexibility and the trade-off betweennumber of gates and system performance. Other important topic is the scalability atthe level of functional units. All our results make the assumption that only onecomputational unit is used. Local image processing algorithm can take advantage ofthe FPGA splitting possibilities. We can synthesize some computational units in thesame FPGA or in several of them and compute larger images in real time. If amemory buffer is used, it is straightforward to assign a small area to eachcomputational unit and run it in parallel. The computational power is then increasedby a factor equal to the number of computational units running in parallel. Within thepipeline computing structure, the scalability principles have been used in the floatingpoint unit design where we have implemented a three ways superscalar division unit.This has been done to reduce the number of cycles required by this stage from 21 to 7,therefore obtaining a well balanced pipeline computing architecture.

5 Performance Evaluation

As commented in the introduction, the accuracy of the optic flow computation of realworld sequences is difficult to assess because the real flow of these sequences isunknown. Therefore to evaluate the accuracy of our design that depends on the bit-depth of the different stages, we have adopted the test scheme and the syntheticsequence from the comparative study done by Barron et al. [1]. The results using theHSHQ approach are summarized in Table 3.

In the first row of Table 3 is compared the accuracy of the L&K algorithmcomputed by a standard PC using double precision variables and adopting the IIR

Real Time Optical Flow Processing System 625

filters using the error measure proposed in [17,18]. The second row includes theperformance obtained by our hardware implementation. It can be seen that theaccuracy is reasonable taking into account that fixed point variables and restricted bitdepths are used in this approach.

Table 3. Yosemite sequence results using Fleet angle error measure [17,18].

Model AverageError

Standarddeviation

Density%

Parameters

LK IIR software vs.real flow

15.91 º 11.5 º 100λmin=0, σxy=0.8,

τ=2, α=1Hardware implementation

vs. real flow18.30 º 15.8 º 100

λmin=0, σxy=0.8,τ=2, α=1

We also have compared the performance of the software and the hardwareimplementations using sinusoidal grating sequences. We used different stimulusfrequencies (f0=0.02 and f0=0.05) and velocities (V=0.25 ppf and V=1 ppf). Withthese tests the hardware obtained results very similar to those of the softwareapproach (less than 5% of error in the speed calculation). The softwareimplementation (standard C) runs at a 30 fps of 160x120 pixels on a AMD 1800+.

6 Conclusions and Future Work

The system described here shows how an optical flow estimation circuit can beimplemented using an FPGA platform as a specific purpose DSP to achieve real-timecomputation. The paper describes a scalable architecture that can work with largeimage data at video-frame rate.

Table 3 summarises the results of a comparison between the software andhardware results using the Yosemite sequence test and unthresholded results. It can beseen that the performance of the hardware is only slightly worse (2.48º increment oferror) than the software version with data precision of 64 bits. The results of thehardware implementation described in this paper are in the range of other softwareapproaches considered in the study of Barron et al. [1]. Therefore, the performance ofthe hardware is of reasonable quality provided that it computes in real time (at a speedof 1776 Kpps).

In the future, we plan address two main goals. The first one is to study the bit-depth needed for different applications. Although this hardware approach is beingused with real-world sequences with satisfactory results, it uses a high depth that canbe reduced significantly. The second goal is to use a multiscale computation to detectfaster motion properly. Classical gradient models estimate velocities well for speedsslower than one pixel per frame but faster motion produces temporal aliasing. Thebasic solution consists in computing motion at higher frame rates (that needs specialand expensive cameras). Alternatively, using multiscale approaches the same camerascan be used.

626 J. Díaz et al.

Acknowledgments. This work has been supported by the V EU research frameworkfunds through the European Projects ECOVISION (IST-2001-32114).

References

[1] J. Barron, D. Fleet, S. Beauchemin: Performance of optical flow techniques. Internat. J.Computer Vision, Vol. 12, nº.1, pp 43-77, 1994.

[2] B. Lucas & T. Kanade: An iterative image registration technique with an applications tostereo vision. Proc DARPA Image Understanding Workshop, pp. 121-130, 1984.

[3] H. Liu , T.H. Hong , M. Herman , T. Camus and R. Chellappa: Accuracy vs. EfficiencyTrade-offs in Optical Flow Algorithms. Computer Vision and Image Understanding. Vol.72 , Issue 3 (Dec) pp. 271 – 286, 1998.

[4] B. McCane, K. Novins, D. Crannitch and B. Galvin: On Benchmarking Optical Flow.Computer Vision and Image Understanding. Vol. 84, pp 126–143, 2001.

[5] S. Baker and I. Matthews: Lucas-Kanade 20 Years On: A Unifying Framework.International Journal of Computer Vision, Vol. 56, nº.3, pp. 221 – 255, March, 2004.

[6] Gamal, A. El, Optical Flow Estimation using High Frame Rate Sequences, Proceedingsof the 2001 International Conference on Image Processing, Vol. 2, pp 925-928, 2001.

[7] P. Cobos, F. Monasterio: FPGA implementation of the Horn & Shunk Optical FlowAlgorithm for Motion Detection in real time Images. Proc of the XIII Design of Circuitsand Integrated Systems Conference, pp. 616-621, 1998.

[8] P. Cobos, F. Monasterio: FPGA implementation of Camus correlation Optical flowalgorithm for real time images. Proc of Int. Conf. on Vision Interface, pp. 7-9, 2001.

[9] S. Maya-Rueda, M. Arias-Estrada: FPGA Processor for Real-Time Optical FlowComputation. Lecture Notes in Computer Science, Vol. 2778, pp. 1103-1016, 2003.

[10] D. J. Fleet, K. Langley. Recursive filters for optical flow. IEEE Transactions on PatternAnalysis and Machine Intelligence. Vol. 17, N. 1, pp. 61-67, 1995.

[11] E P Simoncelli and E H Adelson and D J Heeger. Probability distributions of optical flow.IEEE Conf on Computer Vision and Pattern Recognition, Mauii, Hawaii. June 1991.

[12] T. Rowekamp, M. Platzner, and L. Peters: Specialized Architectures for Optical FlowComputation: A Performance Comparison of ASIC, DSP, and Multi-DSP. ProcICSPAT'97, 1997.

[13] www.celoxica.com[14] Handel-C language referent manual. Celoxica 2003.[15] Celoxica application note AN 68 v1.1: Timing analysis. Timing Analysis and

Optimisation of Handel-C Designs for Xilinx Chips.[16] D. Benitez, Performance of reconfigurable architectures for image-processing

applications. J. of Systems Architecture: the euromicro journal, vol 49 (4-6), pp. 193-210,2003.

[17] D.J Fleet, and A. D. Jepson: Computation of Component Image Velocity from LocalPhase Information, International Journal of Computer Vision, Vol. 5, N.1, pp. 77-104,1990.

[18] D. J. Fleet, Measurement of Image Velocity. Engineering and Computer Science. KluwerAcademic Publishers, 1992.