Vision-Based Egomotion Estimation on FPGA for Unmanned ... › bitstream › 10044 › 1... · Vision-Based Egomotion Estimation on FPGA for Unmanned Aerial Vehicle Navigation Maria

1

Vision-Based Egomotion Estimation on FPGA forUnmanned Aerial Vehicle Navigation

Maria E. Angelopoulou, Member, IEEE, and Christos-Savvas Bouganis, Member, IEEE

Abstract—The use of Unmanned Aerial Vehicles (UAVs) incommercial and warfare activities has intensified over the lastdecade. One of the main challenges is to enable UAVs tobecome as autonomous as possible. A vital component towardsthis direction is the robust and accurate estimation of theegomotion of the UAV. Egomotion estimation can be enhanced byequipping the UAV with a video camera, which enables a vision-based egomotion estimation. However, the high computationalrequirements of vision-based egomotion algorithms, combinedwith the real-time performance and low power consumptionrequirements that are related to such an application, cannot bemet by general-purpose processing units. This work presents asystem architecture that employs a Field Programmable GateArray (FPGA) as the main processing platform connected toa low-power CPU that targets the problem of vision-basedegomotion estimation in a UAV. The performance evaluation ofthe proposed system, using real data captured by a UAV’s on-board camera, demonstrates the ability of the system to renderaccurate estimation of the egomotion parameters, meeting at thesame time the real-time requirements imposed by the application.

Index Terms—Unmanned Aerial Vehicle (UAV), navigation,Field Programmable Gate Array (FPGA), egomotion estimation,optical flow

I. INTRODUCTION

UNMANNED AERIAL VEHICLES (UAVs) are highlysuitable when aerial operations, such as inspection and

surveillance, are required, and the presence of a pilot isdangerous, impossible or simply expensive [1]. This pertainsto a wide range of applications, including search and rescue,aerial mapping and defence operations.

Early UAVs were deployed under minimum autonomy andwere mainly involved in non-complex tasks. Modern UAVsaim at higher levels of autonomy, where the UAV can followa pre-specified path defined through waypoints, performingflight stabilisation and other low-level operations. An integralpart of an autonomous UAV is its navigation system andits supporting subsystems. The navigation system utilisesinformation from various subsystems in order to estimate theUAV location and orientation. Such subsystems are a GPS-based subsystem, which provides the absolute coordinates ofthe UAV at any point in time, and a sub-system that allowsthe estimation of relative translation and rotation of the UAVusing gyroscopes and accelerometers. Fusing information fromboth of these subsystems allows an enhanced estimate of theUAV position and orientation, as well as enhanced robustnessin case one of the subsystems fails.

c©2013 IEEE. Personal use of this material is permitted. However, permis-sion to use this material for any other purposes must be obtained from theIEEE by sending an email to [email protected]

In order to enhance further the estimation of the UAVposition and orientation and its robustness to failures, asubsystem that computes the above information based onvisual cues can be incorporated. Even though such subsystemshave been utilised in land-based robots with great success[2], their inclusion in a UAV presents different challengesand specifications. The implementation of such a vision-based3D motion estimation system entails strict specifications ona number of system parameters. These include the weightand size of the processing platform, its power consumption,the limited error margin that can be tolerated for the output3D motion vectors and rotation parameters and the real-timethroughput requirements, which are due to the high operationalspeed of the UAV.

Specifically, the vision-based estimation of UAV motion inthe 3D space is performed as follows. The collection of therequired visual data is performed with an on-board digitalcamera that is looking downwards, and which is capable ofcapturing video at high frame-rate and high spatial resolu-tion (Fig. 1). In the captured video frames, the points of the 3Dscene are projected on the 2D image plane. Thus, the relativemotion between adjacent frames is described with a set of 2Dmotion vectors that comprise a 2D motion field, known asthe optical flow map. To extract the 3D motion from the 2Doptical flow map, a process of solving the inverse problem isperformed. The process of computing the motion of the vehiclein the 3D space is known as egomotion estimation [3].

This work proposes a system architecture capable of per-forming real-time vision-based UAV egomotion estimation,which is implemented on a Field Programmable Gate Ar-ray (FPGA) that is coupled with a low-power CPU. Theoverall system is fully compliant with the target specifications.An FPGA namely offers a compact, lightweight and low-power processing platform, which can be programmed toperform a dedicated task. In image processing applications,the above feature typically renders large processing gains,thus making the FPGA appropriate for UAV-mounting and forperforming egomotion estimation tasks. While low-level andcomputationally intensive processing is executed on the FPGA,the utilised low-power CPU is responsible for the high-levelcontrol of the system.

In the following sections, a detailed description of oursystem is presented, and the various design choices are anal-ysed. Key characteristics of FPGA design, which optimise theutilisation of such devices and lead to large performance gains,are exploited. These include the exploitation of the parallelstructure of the algorithm and its mapping to such device, aswell as use of custom number representation and word-length

2

(a)

(b)

Fig. 1. (a) The Barnard Microsystems InView unmanned aircraft that isemployed for the tests is equipped with an on-board digital camera. (b) Thecaptured video sequence is used for UAV egomotion estimation purposes.

optimisation for efficient utilisation of the available resources.The above enable the implementation of an efficient, real-timesystem that renders highly accurate outputs, while coping withthe strict timing constraints of real-time navigation.

This paper is structured as follows. Sect. II gives a briefaccount of vision-based egomotion and presents the systemspecifications and assumptions. A review of the related liter-ature is presented in Sect. III, along with a justification ofour algorithmic choices. Sect. IV presents an overview of oursystem, and Sections V-VII describe the individual units of theproposed real-time hardware architecture. Sect. VIII focuseson the performance evaluation of the proposed system, andSect. IX concludes the paper.

II. BACKGROUND

A. System Specifications

The target system aims to estimate the UAV’s 3D motionusing solely vision cues, based on the frames of a continuousvideo stream that is captured by an on-board digital camera.As fully autonomous navigation is targeted, all processing isperformed on board of the UAV. The target UAV (Fig. 1) isequipped with an Atom processor that controls the high-levelUAV functions and an FPGA coprocessor that takes as inputthe output of an on-board camera (Fig. 2). The vehicle has awingspan of 4 m and a weight of 19.5 kg.

The accuracy in estimating the 3D motion parameters de-pends on the frame-rate of the captured images, their spatialresolution, the height of the UAV and its ground velocity atthat time. Without loss of generality, we assume a UAV thatflies in a straight line, horizontally, at a height H (in m) witha velocity v (in m/sec). Moreover, let’s assume that the systemutilises a camera system that produces frames with frame-ratea frames/sec and spatial resolution w×h pixels, having a focallength f (in mm). Finally, let’s assume an image sensor withpixel width wp mm. Then, a pixel displacement of dp betweenthe two frames gives rise to a ground displacement dxground

for the UAV as follows:

TABLE IRELATION OF ALTITUDE AND GROUND DISPLACEMENT OF A UAV

Altitude (m) Ground-Truth Displacement (m)10 0.0017

100 0.01711000 0.1714

dxground =wp

f× v

a×H × dp.

The above expression indicates that the higher the operatingframe-rate and the spatial resolution of the images are, themore accurate the estimation of the UAV’s motion parameterswould be, and/or would allow the UAV to fly with a highervelocity providing the same accuracy in the parameter estima-tion. Please note that the focal length has also impact on theview angle of the camera and, as consequence, on how muchinformation is captured by the system. High values of focallength would be advisable when such a system is to be utilisedby UAVs that fly at high altitude, whereas low values - andlarger field of view - are required when the UAV is intended tofly at low altitude. Assuming a common camera system withparameters a = 25 frames/sec, wp = 0.006 mm and f = 35mm, Table I demonstrates the ground displacement of a UAV,which flies with a ground velocity of 25 m/sec, as a functionof its altitude, when an 1-pixel displacement (dp = 1) occursbetween consecutive frames.

B. Overview of Vision-Based Egomotion Estimation

The egomotion unit estimates the 3D camera motionthrough the information that is extracted from the captured2D frames. Initially, the 2D optical flow map, which includesthe 2D projection of the inter-frame motion across the imageplane, is constructed. The construction can be either denseor sparse. To perform egomotion estimation, a sparse opticalflow is usually constructed, as it leads to less computationallyexpensive and more robust results. In constructing such asparse optical flow map, a feature selection process, alsoreferred to as feature detection, is firstly performed, followedby a feature tracking module that calculates the 2D motionvectors [4]. Thus, a set of distinctive features are selected(first step) and the tracking operations (second step) areperformed only on those features. The egomotion unit utilisesthe extracted information regarding the displacement of thetracked features between the two frames, in order to computethe 3D motion parameters of the system.

Figure 2 illustrates the high-level block diagram of theproposed vision-based navigation system. It consists of theOptical Flow Calculation and the Egomotion Estimation units.As sparse optical flow is considered, the former is furtherdivided in two processing blocks, namely the Feature Selec-tion (FS) [4] and the Feature Tracking (FT) blocks [5], [6].All processing is performed on a pair of two adjacent frames,frame k and frame k + 1. The FS block identifies on thereference frame, namely frame k+1, the N distinctive featuresxn = [xn yn]T , for n ∈ [1, N ]. Then, the FT block tracks

3

EgomotionEstimation

Optical Flow Calculation

FeatureTracking

FeatureSelection

Frame k+1

1-FrameDelay

Frame k

Fig. 2. Overview of the system.

these N features on the adjacent frame k, thus producing themotion vectors x′n = [x′n y

′n]T , for n ∈ [1, N ].

III. ALGORITHMIC CONSIDERATIONS

A. Optical Flow Estimation

The high computational load of optical flow estimationmakes it inappropriate for implementation on general-purposeprocessors, when real-time applications with strict timing,power and size constraints are targeted. Thus, alternativesolutions such as implementing optical flow calculation onGraphics Processing Units (GPUs) [7], [8] and specializedimage processors [9] that can increase the processing speedcompared to a CPU-based-only system have been investigatedfor their application to UAVs.

Recently, FPGA-based solutions to the optical flow estima-tion problem have been proposed [10]–[14] as they can offerhigher processing speeds compared to GPU-based implemen-tations for a fraction of the power. FPGA implementations ofthe Lucas and Kanade feature tracker [15], [16] are presentedin [17] and [18]. In [12], the Horn and Schunck optical flowalgorithm [19] is implemented, while [13] presents a tensor-based optical flow [20] FPGA implementation.

Contrary to the above works that construct a dense opticalflow map, this work considers a sparse optical flow map. Thisdecreases the number of features that are considered, thusreducing the number of required computations, and excludesnon-distinctive features that can potentially disorientate thetracker and subsequently the final egomotion estimation.

1) Feature Selection: A number of well-established featuredetectors can be found in the literature. They are mainlydivided into two major classes. The detectors of the first classevaluate the image pixels as possible members of the selectedfeature set, based on the value of a corner response functionacross the image. The most well-established detectors of thisclass include the Harris detector [21] and the Shi and Tomasimethod [4], in addition to the difference of Gaussians functionthat is applied in scale-space within the SIFT framework [22].The detectors of the second class examine a small imagepatch in order to check if it is similar enough to a corner.These include the Smallest Univalue Segment AssimilatingNucleus (SUSAN) [23] approach, the method of [24] thatconsiders a Bresenham circle around each pixel, the approachof [25], which is based on training a three-layer neuralnetwork, and the popular Features from Accelerated SegmentTest (FAST) methodology [26]–[28] that was proposed re-cently. The detectors of this class typically achieve high speeddue to their computational simplicity. However, they are notrobust to the presence of noise.

Targeting a real-life application, robustness to noise issignificantly important for a vision-based egomotion system,and thus a feature detector from the first class is employed.Among the detectors of this class, the Shi and Tomasi methodis chosen, which is computationally simpler than [21] and[22] and suitable for hardware implementation, as will bedemonstrated in the following sections. In fact, the Shi andTomasi detector derives from a simplification of the Harriscorner detector [4], but still typically exhibits higher perfor-mance [28].

The Shi and Tomasi framework is next briefly presented.Let F be the frame on which feature selection is applied, andlet Fx and Fy denote the derivatives of F with respect to thex and y axes. The method of [4] considers a (2 × ωf + 1)2

feature selection window, where ωf ∈ N, and computes foreach pixel a 2× 2 G matrix, as follows:

G =px+ωf∑

x=px−ωf

py+ωf∑y=py−ωf

(F2

x(x, y) Fx(x, y)Fy(x, y)Fx(x, y)Fy(x, y) F2

y(x, y)

)(1)

where (px, py) denote the pixel coordinates. A minimumeigenvalue λmin of the G matrix is then calculated at eachpixel, as follows:

λmin =g00 + g11

2−√

4g01g10 + (g00 − g11)2

2(2)

where gij denotes the i-th row and j-th column entry of matrixG. The calculated λmin values are then sorted in a descendingorder, and the pixels that correspond to the N larger λmin

values are finally selected as the N features that the methodreturns.

2) Feature Matching: Most of the aforementioned featuredetectors come with a certain feature descriptor that is de-termined by the type of the employed detection approach.For example, the approach of [26], which considers a 16-pixel circle around each candidate pixel, utilizes the vector ofthe 16 intensities of that circle as a feature descriptor. Thesedescriptors enable a feature matching process that producesthe 2D motion vectors. Thus, the following processing chainexecutes: feature selection is applied on both frames of theinput frame-pair, the descriptors of the selected features arecomputed, and feature matching is performed based on thedescriptor values [26]. Feature matching is a cumbersomeprocess that introduces significant latency. In addition, thelevel of success of feature matching depends on the amountof overlap between the real-world features of the feature setsthat are extracted from the two frames. That is, for a real-world feature to be correctly tracked, it should be detectedin both frames, otherwise the feature is lost. Changes in theperspective in the two frames, in addition to noise and blur thatare typical in video sequences captured by a UAV, decreasethe amount of overlap of the real-world features in the twosets. Thus, such egomotion systems should support high-framerates in order to reduce the observed variation of the trackedfeatures between successive frames, while special care shouldbe given to the features close to the borders of the frame.

To avoid feature mismatching and keep the computationalload of feature matching low, this work applies feature se-

4

lection only on the most recent of the two consecutive videoframes, as demonstrated in Fig. 2. A search is then appliedwithin a search window around the selected pixel coordinateson the second frame. To cover a large range of velocitiesand tackle the robustness-accuracy trade-off that limits theperformance of traditional feature trackers, a multi-resolutionprocessing framework [6] is employed. In particular, a multi-resolution pyramid is constructed, and a bottom-up trackingprocessing is applied [6]. To decrease the tracking latencyand increase the system throughput, the presented systememploys optical flow calculation at the pixel level instead ofthe sub-pixel accuracy that is employed by the system of [6].This keeps the number of processing cycles low, while givingsufficient accuracy for the subsequent egomotion estimationunit. In addition, to have a fixed number of cycles per search,the search window is scanned in a pixel-by-pixel fashion. Inthis manner, the Lucas and Kanade optical flow calculationstep, which is employed within the multi-resolution processingframework of [6], is in effect transformed to a straightforwardBlock Matching (BM) [5] process that is applied along thelevels of the pyramid.

B. Egomotion Estimation

A number of egomotion estimation methods are found inthe literature [29]. The most well-established among theseare those of Jepson and Heeger [30], Tomasi and Shi [31],Kanatani [32], Bruss and Horn [33] and Prazdny [34]. Theabove methods address the general-purpose egomotion prob-lem, usually applied in ground robots, and are not a goodmatch for the target application, as they entail high compu-tational costs and do not use domain specific information,e.g. the large distance between the camera and the ground. Infact, given the large distance between camera and scene andthe real-time processing rate associated with real-time UAVnavigation, the effective variations of distance and real-world3D motion between adjacent video frames can be constrained,thus leading to simpler problem formulations and restrictingthe possible solutions to the egomotion problem.

To estimate egomotion in real time, the recent works of [35]and [36] use FPGA-based platforms for the on-line computa-tion of the 3D motion of a ground vehicle that is moving ona planar surface. When a ground vehicle is considered, due tothe limited degrees of freedom that specify its displacement atevery instance, the 3D motion model is simplified to only twoparameters, namely forward translation and yaw [35]. Such amotion model is insufficient for an aerial vehicle, which is ourtarget application. Moreover, contrary to [35] and [36], whichconsider dense optical flow, in this work a sparse optical flowmap is constructed and employed for egomotion estimation.

The recently proposed algorithm of [3] assumes that thedistance between the camera and the scene remains constantbetween two adjacent video frames, all the objects in thescene have similar distance from the camera, and the cameratranslations/rotations are not large between adjacent frames.These assumptions are realistic for the target application,where the processing frame-rate is high, and the scene is at alarge distance from the imaging system. The above simplifying

assumptions render accurate 3D motion vectors with reducedcomputational load, as will be shown in Sect. VIII.

According to [3], the egomotion system performs as follows.The 2D motion approximation model [37] a pixel xn =[xn, yn]T , where xn, yn denote the x- and y-coordinates, isformed as:

x′n = c +(a1 a2

−a2 a1

)xn +

(q1 q2 00 q1 q2

) x2n

xnyn

y2n

, (3)

where c = [c1 c2]T , a = [a1 a2]T , and q = [q1 q2]T

correspond to the translational, affine, and quadratic cameramotion, respectively. x′n denotes the translation vector of pixelxn [37]. Let m denote the 2N × 1 vector that derives bystacking the N x′n vectors, let p = [c1 c2 a1 a2 q1 q2]T , andlet M denote the 2N × 6 matrix that associates x′n with p.Then, vector p is computed as p = (MT M)−1MT m.

The 3D motion model consists of a translation componentdescribed by parameters (A,B,C), a planar rotation β aroundthe optical axis and an out-of-plane rotation described by(θ, α) [3]. The 3D motion parameters are computed based onthe elements of p as described in (4), where the justificationfor the mappings is given in [3].

θ =

−arctan(q1/q2) if q2 > 0−arctan(q1/q2) + π if q2 < 0π/2 if q2 = 0 and q1 > 0−π/2 if q2 = 0 and q1 ≤ 0

α =√q21 + q22

β = α2

A = c1 + α sin θB = c2 − α cos θC = −a1

(4)

IV. OVERVIEW OF THE PROPOSED SYSTEM

The target system utilises a low-power CPU and an FPGAboard. Modern UAVs are usually equipped with a CPU-based platform, which runs the main navigation code of thesystem and fuses information from the various sensors. Theproposed solution aims to equip such system with a vision-based egomotion module, meeting the outlined specificationsby employing an FPGA device.

Our target is to implement most of the system on FPGA.However, the matrix inversion step (Sect. III-B) requiresconsiderably high numerical accuracy and is thus ideallyperformed in floating point arithmetic. On FPGA, matrixinversion would take up a large number of available resources,which we decided to avoid, given that matrix inversion is notpart of the critical path, being executed only once per frame.Due to above, the following Software/Hardware partitioningof the system is employed.

The FPGA device is responsible for processing the capturedframes, build the optical flow map and construct the 6×6 linearsystem, which includes the MTM matrix and the MTm

5

CPU FPGA

USB 2.0

RAM

RAM

RAM

RAM

FPGA board

camera

Fig. 3. High-level view of the proposed system.

FeatureSelection

MultiresolutionBlock Matching

EgomotionEstimation

FPGA

Incoming Frame

RAM 0 RAM 1 RAM 2 RAM 3

we0

we1

we2

we3

Fig. 4. The high-level block diagram of the proposed system (FPGA board).

vector (Sect. III-B). These are sent to the CPU for furtherprocessing. The CPU is responsible for providing the solutionto the 6×6 linear system and utilising further this informationin the navigation algorithm.

Figure 3 illustrates a high-level view of the proposed system.For practical reasons, the camera and the frame capturing arecontrolled by the CPU. The captured frames are then sentthrough USB 2.0 interface to the FPGA board for furtherprocessing. The FPGA device performs the computationallyexpensive part of the algorithm and returns a set of data to befurther processed by the CPU.

A. FPGA board

To enable the real-time processing of each and every oneof the incoming video frames, a quadruple buffering schemeis proposed, which employs four off-chip memory banksfor the storage of frames. The high-level block diagram ofthe FPGA system consists of four off-chip Random AccessMemories (RAMs) and three on-chip processing units (Fig. 4).The bit-width of the incoming pixels is 8 bits, while thememory locations of the off-chip RAMs are 32-bit wide.Therefore, the pixels are stored in groups of 4, thus increasingthe effective memory bandwidth by a factor of 4.

The implemented frame-buffering scheme is controlled bya four-state Finite State Machine (FSM), which allocates ateach frame period a particular role to each one of the fourRAMs. Specifically, at each frame period, one of the RAMsstores the incoming frame, by appropriately setting the writeenable signals (i.e.we0-we3), whereas the other three feed theprocessing blocks with data from the three previous frames.The most recent of these three frames, frame F, is the frameon which feature selection is being executed during the currentframe period, and thus feeds the Feature Selection (FS) block.The other two frames, namely I and J, participate in the multi-resolution Block Matching (BM) and are read in parallel bythe multi-resolution BM block. In particular, frame I is thereference frame that contains the selected feature set, whereasJ is the second frame on which a search process is appliedaround the selected feature coordinates. Thus, at each frame

XXXXXXXXStateRAM id 0 1 2 3

0 Load IPM FM FS1 FS Load IPM FM2 FM FS Load IPM3 IPM FM FS Load

TABLE IITHE FOUR STATES OF THE HIGH-LEVEL FSM.

period each RAM can have one of four possible roles. Theseare the following:

1) Load RAM: It is loaded with the newly arriving frame.2) Feature Selection (FS) RAM: It contains frame F and

thus feeds the FS processing block.3) Feature Memory (FM) RAM: It contains frame I and

thus feeds the first input of the BM block. It is directlyconnected to an on-chip Block RAM (BRAM), whichstores the pixel neighborhood that is required to composethe selected feature up to the coarsest resolution, the FMBRAM.

4) Inter-Pass Memory (IPM) RAM: It contains frame Jand thus feeds the second input of the BM block. Itis directly connected to the IPM BRAM, which is an on-chip RAM that stores the search area around the selectedfeature coordinates.

At each frame period, the high-level FSM allocates to theRAMs the aforementioned roles by controlling the quadrupleframe-buffering scheme that is described in Table II. WhenState 3 is reached, the FSM goes back to State 0 and this four-state loop is constantly repeated as video processing continues.

V. THE FEATURE SELECTION UNIT

A. Feature Selection Approach

The feature selection algorithm of [4] treats the frame glob-ally by selecting the best N features of the entire frame. Thisglobal treatment that involves brute-force feature selection,may result in the concentration of the N selected features on aparticular area or on a limited number of regions of high tex-tural information. The motion vectors that are then computedon these features are also positioned in specific frame areas,leaving the rest of the frame without any motion information.This unbalanced distribution of the motion vectors makes theegomotion calculation less robust to noise and occlusions, ashas been shown in a previous work by the authors [38], wherethe impact of applying geometric constraints to the featureselection for UAV navigation is investigated.

In order to scatter the motion vectors all over the frame,the work proposes a block-based feature selection approachthat initially breaks down the frame into image blocks, theFS blocks. For each one of these blocks, the feature selectionprocess returns the frame coordinates that correspond to thebest feature of the particular block. The proposed systembreaks down the frame into N blocks and renders for eachblock the frame coordinates, which correspond to the featurethat maximises a desired objective function similar to [4].

6

B. Sparse Versus Dense Feature Set

This work considers a sparse feature set, which can sig-nificantly decrease the number of features that are employedin subsequent processing blocks, thus reducing the overallcomputational load. In particular, a dense feature set wouldrender a dense optical flow map that would, for VGA framesizes, consist of 307200 motion vectors. Such a large datavolume would overburden the interface between the opticalflow estimation unit and the subsequent egomotion estimationblock and would increase the computational cost of egomotionestimation. In the current section, we show that the accuracyof the egomotion estimation output, which can be achievedwith the suggested block-based sparse feature selection ap-proach, is similar to the output accuracy that is obtainedwhen dense optical flow is employed. For computing denseoptical flow, the well-established Horn and Schunck opticalflow algorithm [19], [39] has been implemented.

The graphs of Fig. 5 demonstrate the mean of the Normal-ized Root Mean Square Error (NRMSE) for the egomotionparameters between their estimated and ground-truth values,for various feature selection (FS) computation schedules andvalues of N . The input of the ground-truth experiment com-prises an 11-frame semi-synthetic test sequence. This has beengenerated by applying both rotational and translational motion,of various angles and magnitudes, on the first frame of thecaptured video sequence of VGA (480×640) frame size, whichis shown in Fig. 1. In this manner, the ground-truth egomotionparameters are known and are used to estimate the error.

As Fig. 5(a) demonstrates, when considering sparse opticalflow, brute-force feature selection renders significantly higheregomotion estimation error values, compared to the suggestedblock-based approach. In addition, for N ≥ 280 that corre-sponds to image block size 30 (Table III), the errors of thebrute-force feature selection approach are considerably higherthan when employing dense optical flow. For N ∈ [96, 280],the block-based approach renders an output error value that issimilar to the case where dense optical flow is employed. Onthe contrary, to obtain such low error with brute-force featureselection, a large number of features is required, as illustratedin Fig. 5(b), where N = {vs/24, vs/23, vs/22, vs/2}, vs

denoting the number of pixels of a VGA frame.Based on the above error analysis, we have decided to

employ block-based sparse feature selection, which renderssimilar final output accuracy as dense optical flow, for valuesof N as low as the [96, 280] range (Table III), for VGA framesizes, while decreasing the computational load for subsequentegomotion estimation. A detailed assessment of the proposedapproach, in addition to an evaluation of the required hardwareresources, is presented in Sect. VIII.

C. Architecture of the Feature Selection Unit

The high-level block diagram of the Feature Selection unit isdemonstrated in Fig. 6. The proposed system utilises a block-based feature extraction process, combined with the Shi andTomasi framework [4]. A feature selection window of size3× 3, i.e. ωf = 1, is sufficient for feature selection [6] and is

0 100 200 3000

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

NR

MS

E

N

Sparse Brute−Force FSSparse Block−Based FSDense Feature Set

(a)

0 5 10 15

x 104

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

NR

MS

E

N

Sparse Brute−Force FSDense Feature Set

(b)

Fig. 5. Mean NRMSE of the estimated egomotion parameters for theindicated FS approaches. The illustrated values of N are those employedfor VGA frame size in the considered test cases, as described in Table III.

Block-basedZig-zag

RAM Address Generator

3x3

3x3

Calculationof G

block index

Derivative Calculations

MinimumEigenvalue

coordinates

FPGA

3x1

F

FSRAM

addr

dout

CoordinatesBRAM

din

addr

Max MinEig Coordinates

Synchronizer

Calculationof Fx

FxWindow

FWin.

FyWindowCalculation

of Fy

Fig. 6. High-level block diagram of the implemented Feature Selection unit.

thus employed in the implemented system, as it reduces thehardware requirements.

Frame F is read from the FS RAM block by block, andeach FS block is internally scanned in a zig-zag manner, asdemonstrated in Fig. 7. Each FS block is brought on chipwith an additional contour that is ωf + 1 pixels wide, sinceωf adjacent pixels plus 1 pixel for the derivative calculationsshould be considered for the eigenvalue computation at theblock’s edges. For each pixel of the FS block, a minimumeigenvalue λmin of the G matrix [4] is calculated, and thepixel with the maximum λmin is the selected feature for thatblock.

The derivatives Fx and Fy of (1) are calculated us-ing respectively the derivative kernels (1/2, 0,−1/2) and(1/2, 0,−1/2)T . A circular buffering structure as wide as theFS block, which consists of 2 line-buffers that feed a 3 × 1

ωb

...

Fig. 7. Block-based zig-zag scanning of the frame in the FS off-chip RAM.

7

register window, precedes the vertical derivative calculation.This provides at each cycle a 3-pixel column on which thevertical derivative kernel is applied. Each of the calculatedFx and Fy pixel streams is inserted in a circular bufferingstructure with the same width as the image block, whichcontains 2 line-buffers feeding a (2×ωf +1)2 register window.

The Fx and Fy derivative windows are used by the Calcula-tion of G block, which implements (1), by performing simpleadditions and multiplications. Specifically, 3× (2× ωf + 1)2

multiplications are required, and thus 27 multipliers and 24adders are used for ωf = 1.

The Minimum Eigenvalue block calculates at each pixel theλmin value of G, by implementing (2). The implementationincludes 4 adders, 2 multipliers and 1 square root calculation.The latter is computed with the digit-by-digit binary squareroot algorithm [40]. While λmin is calculated for each pixel,the Synchronizer block also produces information regardingthe frame coordinates of the pixel, whose λmin is currentlygenerated, and whether such pixel corresponds to a contourpixel of the FS block and should be discarded. The aboveinformation is used by the Max MinEig Coordinates block,which identifies the pixel with the maximum λmin in thecurrent block and stores its coordinates in the location of theCoordinates BRAM that is reserved for that block.

VI. THE BLOCK MATCHING UNIT

A. Architecture of the Multiresolution BM

To increase the search range of feature tracking, whilekeeping low the computational cost, mutilresolution BM isemployed. Let I denote the reference frame on which featureselection was applied at the previous frame period, and letJ be the adjacent frame on which the selected features of Iare tracked. To calculate the motion vectors between framesI and J, a multiresolution pyramid is constructed [6], andbottom-up BM processing is applied. In the remainder of thepaper, High Resolution (HR) refers to the pixels of the originalframes, which correspond to the finest resolution, whereas LowResolution (LR) refers to the computed pixels of the lowerlevels of the multiresolution pyramid.

Let Lm denote the lowest level of the multiresolutionpyramid, and let L ∈ [0, Lm] denote the pyramid level onwhich BM processing is currently applied. Moving to higherpyramid levels, with L < Lm, the resolution becomes finer,with the upper level L0 being occupied by the HR frame,which is the original frame. In particular, the LR pixel at levelL corresponds to an area of 2L HR pixels on the originalframe, and thus for the highest level it is L0 = 0. However, thesize in pixels of the BM integration window remains constantfor all pyramid levels. Due to the constant window size inpixels and the coarser resolution of the pixels at the lowerpyramid levels, the integration window covers at lower levelsa larger HR area on the original frame. Thus, the larger the HRarea of frames I and J that participates in the BM processing,the larger the motion magnitudes that can be captured. Theunderlying HR area that is considered gradually becomessmaller moving to upper pyramid levels, until, for L0, the LRpixels of the integration window are actually the HR pixels of

FPGA

IPMRAM

Stream of LRpixels at level L

5x5

5x55x5

RegisterWindow

RMSEDisplacementCalculation

d

Circular BuffersIPM

BRAM

addr

Stream of LRpixels at level L

FMBRAM

FMRAM

gL-1

Fig. 8. High-level block diagram of the Multiresolution Block Matching unit.

the original frames. Thus, at the upper levels small motionscan be accurately determined.

The multiresolution BM processing begins from the lowestpyramid level, where rough motion estimates are calculated.These estimates are then gradually refined at the higher pyra-mid levels. Let εL denote the block matching error functionat level L. Moreover, let dL = [dL

xdLy ]T be the residual pixel

displacement vector that minimizes εL, and let gL = [gLx g

Ly ]T

be the initial guess for pixel displacement vector at level L.Then, the initial guess gL−1 that is propagated to the nextlevel L− 1 is:

gL−1 = 2(gL + dL) (5)

The final output d for the currently processed feature derivesas follows:

d = g0 + d0 (6)

where g0 and d0 denote the initial guess and the calculatedresidual displacement vectors associated with the highest pyra-mid level.

B. BM subsystem overview

Figure 8 presents the high-level block diagram of the mul-tiresolution BM unit that performs feature matching. Matchingis applied only for the features that have been selected by theFeature Selection block (Sect. 6), and for each feature theprocessing is executed up to level L0 of the pyramid.

Two types of windows are needed at each resolution formultiresolution BM to be executed, namely the feature windowand the search window. Within the search window, a movingtracking window slides in search of the best match with thereference feature window. This scanning process is executedas shown in Fig. 10. The sizes of the tracking and featurewindows are the same and remain constant for all pyramidlevels. This size is (2 × ωb + 1)2, where ωb ∈ N. Typicallyωb > ωf [6]. In Fig. 8, ωb = 2, and thus a 5 × 5 integrationwindow is employed.

C. On-Chip BRAMs

The FM and IPM BRAMs store the HR pixels, whichcontribute in the formation of the LR pixels of the tracking

8

2L0

2L2

2L1

(2ωb+1)x2L2

(2ωb+1)x2L1

(2ωb+1)x2L0

Fig. 9. The content of the FM BRAM that is read at levels L2, L1, and L0.

2L

sx2L

(2ωb+1)x2L ...

...

vy

vx

Fig. 10. The beginning and ending of the IPM RAM/BRAM scanning process.

and feature windows, respectively. Figure 9 visualizes thedifference in the number of HR pixels that are stored in theFM BRAM, and are required at each level. The same appliesfor the IPM BRAM, whose scanning is shown in Fig. 10.

Each LR pixel of level L spans the area of 2L HR pixels, butthe sizes of the feature and tracking windows remain constantat every pyramid level. Therefore, the number of HR pixelsthat should be brought on chip to form the LR pixels at therequired resolution varies among different levels. Specifically,as shown in Fig. 9, for ωb = 2 and Lm = 2, this number isequal to ((2 × ωb + 1) × 2L2)2, i.e. 20 × 20 pixels for levelL2, whereas for the finest level L0, it is 5 × 5 HR pixels.Similarly, for the search window, the number of HR pixelsfor L2 is ((2s + (2 × ωb + 1)) × 2L2)2, where s ∈ N is aparameter determining the size of the search window. Thusfor L2, the search window is 44×44 pixels large, whereas forL0 it is only 11× 11 pixels.

D. Multiresolution Processing

Figure 11 shows a more detailed block diagram of themultiresolution Block Matching unit, and illustrates the dif-ferent treatment in the processing at the deepest level. Inparticular, the deepest level, which is L2 in Fig. 11, is the firstprocessing level of multiresolution BM, and thus the inputs

FPGA

FM RAM Addressing

IPMBRAM

din

addr

dout

5x5

5x5

DisplacementCalculation

Controller

RMSE

doCalc

FMBRAM

din

dout5x5

5x5

addr

L2 Weighted Sum

L1 Weighted Sum

FeatureWindow

RMSEDisplacementCalculation

FeatureWindow

IPM RAM Addressing

IPM BRAM Addressing

FM BRAM Addressing

TrackingWindowL1 Weighted

Sum

d

TrackingWindowL2 Weighted

Sum

I

J

FMRAM

IPMRAM

addrdout

addr

dout

Fig. 11. The Block Matching unit as implemented for three pyramid levels.

of the associated processing blocks are read directly from theFM and IPM off-chip RAMs. The HR pixels that are broughton chip are used right away, to perform BM at the deepestlevel, and are also stored in the on-chip BRAMs to be usedby subsequent levels. In fact, all HR pixels that are needed atthe upper levels are included in the pixel neighborhood thatis brought on chip for the processing of Lm, as has beenexplained in the previous paragraph and can be observed inFig. 9. Therefore, for L < Lm, all information that is requiredfor the tracking of the particular feature is available on chip,which enables data reuse and leads to high performance gains.

For L = Lm, the FM RAM Addressing unit generates theaddresses corresponding to the HR pixel neighborhood thatsurrounds the selected feature, whose coordinates are stored inthe Coordinates BRAM of the Feature Selection unit (Fig. 6).The raw pixels that are brought on-chip are both stored in theFM BRAM (Fig. 9), to be used by subsequent pyramid levels,and inserted into the Lm Weighted Sum unit. To produce asingle Lm LR pixel, this unit adds up 22Lm weighted HRpixels, which is equivalent to implementing low-pass filteringand subsampling at a rate of 2Lm at each direction. Thecalculated LR pixels are inserted in the (2× ωb + 1)2 featurewindow, i.e. a 5 × 5 register window, for ωb = 2. ForL < Lm, the FM BRAM Addressing unit selects a smallerarea of (2× ωb + 1)× 2L HR pixels (Fig. 9).

In a similar manner, the IPM RAM is selectively readbased on the addresses generated by the IPM RAM Addressingcontrol unit. The scanned area includes the entire search areaat level Lm, which comprises the search window. The LRpixels that are generated by the Lm Weighted Sum unit areinserted in a Circular Buffering unit, which includes (2×ωb)BRAMs that feed a (2×ωb+1)2 tracking window in a circularbuffering structure. Thus, for ωb = 2, 4 BRAMs feed a 5× 5register window.

Contrary to the feature window whose content remainsunaltered during the processing of a given pyramid level, thecontent of the tracking window is updated at each cycle andcontains the LR pixels that are included in the moving windowof Fig. 10. The Root Mean Square Error (RMSE) betweenthe tracking and feature windows is also calculated at eachcycle. The minimum RMSE for level L corresponds to the

9

displacement that renders the best match. This displacementindicates the initial guess gL−1 that is inserted in the IPMBRAM Addressing unit, for the selection of the appropriate(2×ωb+1)×22L search window at level L−1. In this mannera loop is formed, which is repeated for as many iterations aspyramid levels. The final output d for the currently processedfeature derives as described in (6).

The block matching search for the current feature at a givenpyramid level is over, when all RMSE values between thepixels of the feature window and the various contents of thetracking window have been calculated within the correspond-ing search window. The block matching output correspondsto the position within the search window, where the contentof the tracking window gives the smallest RMSE with respectto the content of the given feature window. If more than oneposition happens to render the same RMSE value within thesame search window, the final output is the position which isthe closest to the center of the search window.

VII. THE FINAL EGOMOTION ESTIMATION STEP

Once the 2D motion field is extracted, the FPGA deviceconstructs the 6× 6 linear system as described in Sect. III-Band computes the corresponding MTM matrix and MTmvector. The above information is then fed into the CPU, whichis responsible for the high-level control of the egomotionestimation unit, and the solution of the 6× 6 linear system iscomputed by the CPU. Such a system architecture is beneficial,as (i) the solution of the linear system is not in the criticalpath, and as such there is no need for it to be executed onthe FPGA, and (ii) this architecture provides more flexibilityto the user on how the retrieved information will be utilizedand later combined with other sensors (e.g. Global PositioningSystem (GPS), Inertial Measurement Unit (IMU)), for a betterposition estimation.

To increase computational robustness, instead of relying ona single linear system, which may not be well-conditioned,multiple linear systems are considered, and a process ofaveraging output parameters is employed. In particular, the FSblocks (Sect. V) are categorized into µ groups and, for everyframe that has been processed, the FPGA sends to the CPUµ MTM matrices and MTm vectors. The CPU then solvesthe corresponding 6× 6 linear systems. A standard Gaussianelimination method is applied to solve the obtained systems.The obtained parameters of (3) are then averaged, taking intoaccount only the linear systems that are well-conditioned, andmapped to the 3D motion space using (4). A further step thatmaps the above parameters to Euler angles is performed foreasier integration with the rest of the system.

To compute accurate 3D egomotion outputs for each groupof FS blocks, its members should be scattered uniformlyaccross the frame [38]. Thus, the µ groups are interleaved witheach other, as demonstrated in Fig. 12 for µ = 4. Experimentshave shown that a value of µ = 4 is effective for VGA framesizes and FS block sizes of 44× 44 pixels [38].

VIII. PERFORMANCE EVALUATION

The proposed system has been tested under various con-ditions using real captured video sequences from a camera

K

K+1

Group 4

Group 1

Group 2

Group 3

B2

Fig. 12. Forming µ groups of FS blocks for robust egomotion estimation.

100 200 300 400 500 600

50

100

150

200

250

300

350

400

450

Fig. 13. A frame of the captured sequences used for performance evaluation.

that was placed on board of a UAV. To be able to provide areference flying path, a synthetic test sequence was generatedusing the actual frames of the captured video sequences.The above method allows the full performance evaluation ofthe developed system under various conditions. The obtainedsequences have a certain degree of blur and noise, whichreflects the challenges in processing these sequences. Figure 13shows a frame from the captured sequences 1.

The target FPGA board is an RC300 Celoxica board thathosts a Xilinx Virtex-II 6000 FPGA. Handel-C was employedas the developing environment, which allowed early high-levelexploration of the design space and fast prototyping of thesystem. To implement the host system, we used a netbookthat was equipped with an Atom processor, interfacing withthe FPGA board through USB 2.0. The selection of the FPGAboard was based on its support for high external memorybandwidth due to its four Zero Bus Turnaround (ZBT) SRAMbanks, and the selection of the Atom-based netbook was basedon its weight and low-power requirements.

In total, 15 test scenarios have been considered, each cor-responding to a particular image size and number of features,which is dictated by the size of the Feature Selection Block(FS Block size). Table III illustrates these cases.

A. Resource Usage

The utilisation of the hardware resources is almost indepen-dent of the employed image sizes and FS block sizes across all15 scenarios of Table III. In particular, the number of BRAMsis fixed at 30, which is 20% of the BRAMs available on the

1The data used for the evaluation of the system can be downloaded fromhttp://cas.ee.ic.ac.uk/people/ccb98/research.php

10

TABLE IIITEST CASES

Test 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Image Size 768×1024 768×1024 768×1024 768×1024 768×1024 480×640 480×640 480×640 480×640 480×640 240×320 240×320 240×320 240×320 240×320

FS Block Size 60 80 100 120 140 30 40 50 60 70 15 20 25 30 35

No. of Features 192 108 63 48 35 280 165 96 70 48 234 140 88 54 40

TABLE IVAVERAGE RESOURCE UTILISATION TARGETING A VIRTEX-II 6000

DEVICE FROM XILINX

Resource Type Used Available Utilisation# Occupied Slices 13,923 33,792 41%# Block RAMs 30 144 20%# MULT 18x18 59 144 40%

TABLE VATOM PC CHARACTERISTICS

Processor Atom N270 @ 1.6GHzMemory 1GB DDR2 800/6400

target Virtex-II 6000 FPGA device, while the utilisation ofthe number of FPGA slices has a marginal variation across thevarious configurations. In general, the number of utilised slicestends to slightly decrease for smaller image sizes and largerFS block sizes, thus less feature points. However, the variationof the number of FPGA slices is marginal, and the occupiedslices are estimated to be at 41% of the available Virtex-II6000 slices for all 15 test scenarios. A detailed break-down ofthe utilised resources is given in Table IV.

B. Throughput

The maximum frame-rate supported by the system is ofgreat importance as it dictates the system performance inthe estimation of the motion parameters. The performanceof our system was compared against two other software-onlysolutions. Figure 14 summarises the achieved frame-rates,for the test cases of Table III and for the three processingplatforms stated in the legend. The first platform, referred toas Atom SW in the legend of Fig. 14, is a software versionof the algorithm that runs on the Atom processor whosecharacteristics are given in Table V. The second solution,referred to as Intel SW in the legend, utilises a desktop PCequipped with an Intel Core 2 6300 @ 1.86GHz processor.Finally, the third platform, which is referred to as Atom HWin the legend, comprises the proposed FPGA-based solution.

The comparison of Fig. 14 demonstrates that the proposedsystem, which utilises the FPGA as an accelerator, can achievesignificantly higher frame-rates for all of the test cases de-scribed in Table III. Such high frame-rates allow the UAVto better estimate its position than when employing either ofthe two considered software-only solutions under the sameconditions. Comparing the Atom SW and Atom HW solutions,Fig. 14 indicates that the proposed FPGA-based system candeliver a frame-rate that is 20× higher than the software-only

1 2 3 4 5 6 7 8 9 10 11 12 13 14 150

50

100

150

200

250

Fra

me−

Rat

e (f

/s)

Test Number

Atom HWIntel SWAtom SW

Fig. 14. Frame-rate for the test cases of Table III and for the processingplatforms indicated in the legend.

Atom system, implying an equivalent improved resolution inthe motion parameter estimation under given conditions.

Figure 14 demonstrates that the frame-rate is mainly af-fected by the frame size and not by the number of features,or equivalently the FS block size (Table V), that is employed.This effect is clearly visible in Fig. 14 when moving from testcase 10 to test case 11, i.e. from 480×640 to 240×320 framesize, where a significant frame-rate increase is observed.

C. Motion Parameter Estimation Analysis

To assess the performance of the system in evaluating the3D UAV motion, two sets of tests are performed. The first setfocuses on the performance evaluation of the system and itssensitivity to the actual motion of the UAV under the specificconfiguration of capturing video with 320×240 spatial resolu-tion. The choice of the above resolution was based on the factthat under such configuration high frame-rates are obtainableby the system, thus allowing its deployment to UAVs thatoperate at high velocities or close to the ground. The secondset of tests focuses on the system performance evaluationunder the various configuration parameters (Table III).

1) System Performance Under Various Motion Conditions:The performance of the proposed system was investigatedunder two motion conditions. Table VI summarises the rangeof possible displacements under those conditions across all pa-rameters. In Condition 1, the motion between adjacent framesis more restricted compared to Condition 2, thus modellinga UAV that flies at low velocity, or an on-board camera thatoperates at a high frame-rate. In Condition 2, the motion ofthe UAV between two adjacent frames is allowed to take largervalues, thus modelling a faster moving UAV.

Figures 15 and 16 show the histogram of the errors in theestimation of the translation and rotation parameters under the

11

TABLE VIPARAMETER RANGES FOR FIRST SET OF TESTS

Parameter Condition 1 Condition 2θ (degrees) [-0.58, 0.58] [-1.18, 1.18]ψ (degrees) [-0.57, 0.57] [-1.17, 1.17]φ (degrees) [-1.43, 1.43] [-2.86, 2.86]x (pixels) [-2.38, 2.38] [-4.76, 4.76]y (pixels) [-2.38, 2.38] [-4.76, 4.76]z (pixels) [-0.99, 0.99] [-1.99, 1.99]

−1 −0.5 0 0.5 10

100

200

Error in theta (degrees)

# in

stan

ces

−1 −0.5 0 0.5 10

100

200

Error in psi (degrees)

# in

stan

ces

−1 −0.5 0 0.5 10

100

200

Error in fi (degrees)

# in

stan

ces

−2 0 2 40

100

200

300

Error in x−coord (pixels)

# in

stan

ces

−2 −1 0 1 20

100

200

300

Error in y−coord (pixels)

# in

stan

ces

−1 −0.5 0 0.5 10

100

200

300

Error in z−coord (pixels)

# in

stan

ces

Fig. 15. Histograms of the errors of the rotation and translation parametersunder Condition 1 (see Table VI).

two aforementioned conditions. In most of the cases, the errorremains adequately small - within 0.5 degrees for rotationparameters and within 1 pixel for translation. Comparingthe performance of the proposed system for the two motionconditions, it can be concluded that the smaller the 3D UAVmotion, the better it is estimated by the system.

2) System Performance Under Various System Configura-tions: A subset of the configurations of the system (Tests 6

−1 −0.5 0 0.5 10

100

200

Error in theta (degrees)

# in

stan

ces

−1 0 1 20

100

200

300

Error in psi (degrees)

# in

stan

ces

−1 −0.5 0 0.5 10

100

200

Error in fi (degrees)

# in

stan

ces

−4 −2 0 20

100

200

300

Error in x−coord (pixels)

# in

stan

ces

−4 −2 0 2 40

200

400

Error in y−coord (pixels)

# in

stan

ces

−1 −0.5 0 0.5 10

100

200

300

Error in z−coord (pixels)

# in

stan

ces

Fig. 16. Histograms of the errors of the rotation and translation parametersunder Condition 2 (see Table VI).

TABLE VIIPARAMETER RANGES FOR THE SECOND SET OF TESTS

Parameter Rangeθ (degrees) [-1.18, 1.18]ψ (degrees) [-1.18, 1.18]φ (degrees) [-2.86, 2.86]

x (focal length units) [-0.11, 0.11]y (focal length units) [-0.11, 0.11]z (focal length units) [-0.05, 0.05]

up to 15 (see Table III)) are evaluated using the above semi-synthetic test sequences and adding white Gaussian noise tothe frames. For each configuration, three noise scenarios areconsidered as follows: no additional noise, 20dB SNR and10dB SNR. In all cases, the trajectory of the UAV is thesame, and the only two differences are the configuration of thesystem and the resolution of the image sensor (i.e. 240× 320and 480×640). The results from the 768×1024 resolution areomitted due to space limitations. The range of the parametersis shown in Table VII. Please note that the parameters havethe same range as for Condition 2 in Table VI, but here thefocal length unit is reported for x, y, z parameters.

Figure 17 demonstrates the Mean Square Error (MSE) in theestimation of the 3D motion parameters for the above testsand noise levels. Please note that the translation parametersare reported in units of focal length in order to allow forperformance comparison across different image resolutioncameras, as the two cameras have the same focal length(i.e. capture the same view). These values correspond toabsolute translations, independent from the image resolutionof the camera. The best results are obtained in the case ofusing a 240 × 320 camera resolution, and more specificallyusing configuration 11. This configuration uses the largestnumber of features, leading to a robust estimation of theUAV egomotion. Close inspection of the results shows thata system that employs a 480 × 640 resolution camera, canachieve a better estimation of the position of the UAV whenthere is only an in-plane motion (pure translation). However,the system’s performance deteriorates when there is out-ofplane motion, as in the above experiment. This is attributed tothe performance of the BM subsystem (Sect. VI-D), as thedetected feature undergoes a more radical distortion and alarger tracking window is required for its localisation. Theincreased noise levels affect the system as expected in thecase of 240 × 320 camera resolution, whereas for 480 × 640camera resolution the injected noise does not have a clearimpact on the parameters of interest. Moreover, the obtainedresults show the significance of selecting suitable values forthe system parameters, as they have considerable impact on theoverall system performance and could even lead to a completefailure of the system, as illustrated for the configuration of Test10, for which the selected features cannot be tracked correctly.

D. Vision-based UAV navigation

The above numerical assessment is confirmed with a real-data experiment, which employs the 50-frame video sequenceof Fig. 1. Figure 18 illustrates the evenly distributed optical

12

6 7 8 9 10 11 12 13 14 150

0.5

1

1.5

2

2.5

3

3.5

4x 10

−3

Test Case

MS

E (

foca

l len

gth

units

)

Error in x coordinate

(a) x parameter

6 7 8 9 10 11 12 13 14 150

0.5

1

1.5

2

2.5

3

3.5

4x 10

−3

Test Case

MS

E (

foca

l len

gth

units

)

Error in y coordinate

(b) y parameter

6 7 8 9 10 11 12 13 14 150

0.5

1

1.5

2

2.5x 10

−3

Test Case

MS

E (

foca

l len

gth

units

)

Error in z coordinate

(c) z parameter

6 7 8 9 10 11 12 13 14 150

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Test Case

MS

E (

degr

ees)

Error in θ

(d) θ parameter

6 7 8 9 10 11 12 13 14 150

0.2

0.4

0.6

0.8

1

1.2

1.4

Test Case

MS

E (

degr

ees)

Error in ψ

(e) ψ parameter

6 7 8 9 10 11 12 13 14 150

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Test Case

MS

E (

degr

ees)

Error in φ

(f) φ parameter

Fig. 17. MSE in the indicated parameters, for various test configurations (6-15) and noise levels: no added noise (blue), 20dB SNR (green), 10dB SNR (red)

Fig. 18. Computed optical flow map for frames 44−45, for N = 100using block-based feature selection.

flow, which is obtained when block-based feature selection isemployed, for frames 44− 45 of the sequence and N = 100.Figure 19 demonstrates the computed egomotion parameters,when applying the indicated feature selection approaches onthe real-data sequence of Fig. 1, for N = 100. To map thetranslation parameters to meters, a focal length of 50× 10−3

m and a mean UAV height of 150 m are considered. Foreach parameter set, Fig. 20 presents a flight simulation thatvisualizes the estimated UAV motion for the last 20 framesof the sequence, i.e. frames 31− 50. The UAV trajectory forbrute-force feature selection (Fig. 20(a)) possesses unrealisticdiscontinuities, while Fig. 20(b) illustrates the smooth flighttrajectory that is rendered using the proposed block-basedapproach.

0 20 40−500

0

500

Θ (

o )

0 20 400

0.05

0.1

α (o )

0 20 40−5

0

5

β (o )

0 20 40−5

0

5

frame #

A (

m)

0 20 40−10

0

10

frame #

B (

m)

0 20 40−0.2

0

0.2

frame #

C (

m)

(a) With brute-force feature selection

0 20 40−500

0

500

Θ (

o )

0 20 400

5x 10

−3

α (o )

0 20 40−2

−1

0

β (o )

0 20 40−5

0

5

frame #

A (

m)

0 20 400

5

10

frame #

B (

m)

0 20 40−5

0

5x 10

−3

frame #

C (

m)

(b) With block-based feature selection

Fig. 19. Computed egomotion parameters for the real-data test, whenemploying the indicated FS methods and N = 100.

E. Comparison to existing work

Research in egomotion estimation has mainly focused onground robots [35], [36], [41], for which the applicationrequirements differ significantly from the case where theegomotion estimation of a UAV is targeted. Namely, for theUAV case, the problem can be simplified as it can be safelyassumed that the detected feature points are positioned on the

13

−10

0

10 010

2030

4050

60

−505

(a) With brute-force feature selection

−10

0

10

010

2030

4050

6070

80

−505

(b) With block-based feature selection

Fig. 20. Flight simulation for the parameters of Fig. 19, for framesf31− f50, when employing the indicated FS methods and N = 100.

same plane relative to the camera due to the altitude of theUAV. Moreover, due to the high flying velocity of a UAV, therequired target frame-rate of the system is for the case of aUAV higher than for the case of a ground robot.

In [36], the authors present a SoC system for the egomo-tion estimation of a ground vehicle, using a Virtex-4 FPGA(XC4FX60-10). The proposed system performs a dense opticalflow estimation followed by the R2-Iterative algorithm toremove wrong optical-flow estimates. The resulting systemconsumes around 80% of the available resources achievinga maximum of 28 frames/sec, while using a camera with320×256 pixels resolution. In [41], the egomotion estimationof a walking robot is performed as part of a bigger system.The employed motion model limits the degrees of freedom tofour and has thus limitations in differentiating between rotationand translation in the x and y axis. In [35], the authorspresent an FPGA-based platform for the on-line computationof the 3D motion of a ground vehicle moving on a planarsurface, having simplified the 3D motion model to only twoparameters, namely forward translation and yaw.

Closer to the proposed work is the work of [42], where theauthors propose an insect-inspired motion detector for high-speed vision systems, which can be applied to the egomotionestimation of a UAV. The method is based on the constructionof several receptive fields that respond to specific motion pat-terns. Even though the system achieves frame-rates similar tothe proposed system, its performance evaluation has only beenconducted with synthetic images with well-defined patternsand no quantifiable results are reported.

The proposed work deviates from the existing literature bytaking advantage of the nature of the egomotion estimationproblem in UAVs. Thus, a suitable, light-weight algorithm forthe egomotion estimation of a UAV is employed, and the pro-posed architecture achieves frame-rates up to 200 frames/sec(for 320× 240 resolution). This allows the deployment of thesystem to UAVs that fly at high speeds or close to the ground,while maintaining the six-degrees-of-freedom motion model.

IX. CONCLUSIONS AND FUTURE WORK

This paper presents the design and architecture of a real-time system that estimates the egomotion of a UAV. Byutilising special-purpose hardware, namely an FPGA device,real-time performance is achieved, thus allowing the inclusionof the proposed system in fast flying UAVs. The presentedarchitecture makes efficient use of the available FPGA re-sources leading to a design that consumes approximately 41%of a Xilinx Virtex-II 6000 device. The assessment of theproposed system demonstrates the high performance gainsthat are achieved by utilising such accelerators in a CPU-based platform. This allows the UAV to fly at high speeds,while minimizing its power consumption footprint. Futurework will focus on the “intelligent” selection of the featurepoints to track for the optical flow estimation by utilisingreal-time machine learning techniques [43], [44], aiming tofurther improve the robustness of the system. Furthermore,real-time super-resolution techniques, such as in [45], willbe investigated for a more accurate estimation of the featurepoints.

ACKNOWLEDGMENT

The research leading to these results was co-funded by theTechnology Strategy Board UK. The work was substantiatedin cooperation with Barnard Microsystems Ltd, who manufac-tured the UAV.

REFERENCES

[1] B. Sinopoli, M. Micheli, G. Donato, and T. Koo, “Vision BasedNavigation for an Unmanned Aerial Vehicle,” in IEEE InternationalConference on Robotics and Automation (ICRA), vol. 2, May 2001, pp.1757 – 1764.

[2] C. F. Olson, L. H. Matthies, M. Schoppers, and M. W. Maimone, “RoverNavigation Using Stereo Ego-Motion,” Robotics and Autonomous Sys-tems, vol. 43, pp. 215–229, 2003.

[3] C. Jonchery, F. Dibos, and G. Koepfler, “Camera Motion EstimationThrough Planar Deformation Determination,” Journal of MathematicalImaging and Vision, vol. 32, no. 1, pp. 73–87, Sep. 2008.

[4] J. Shi and C. Tomasi, “Good Features to Track,” in IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR), June 1994, pp.593–600.

[5] K. W. Lim and J. B. Ra, “Improved Hierarchical Search Block MatchingAlgorithm by Using Multiple Motion Vector Candidates,” ElectronicLetters, vol. 33, no. 21, pp. 1771– 1772, Oct. 1997.

[6] J.-Y. Bouguet, “Pyramidal Implementation of the Lucas Kanade FeatureTracker: Description of the Algorithm.” Microprocessor Research Labs,Intel Corporation, 2002.

[7] M. Durkovic, M. Zwick, F. Obermeier, and K. Diepold, “Performance ofOptical Flow Techniques on Graphics Hardware,” in IEEE InternationalConference on Multimedia and Expo (ICME), Jul. 2006, pp. 241–244.

[8] J. Bodily, B. Nelson, Z. Wei, D.-J. Lee, and J. Chase, “A ComparisonStudy on Implementing Optical Flow and Digital Communications onFPGAs and GPUs,” ACM Transactions on Reconfigurable Technologyand Systems, vol. 3, no. 2, pp. 6:1–6:22, May 2010.

[9] M. V. Correia and A. C. Campilho, “Real-Time Implementation ofan Optical Flow Algorithm,” in International Conference on PatternRecognition (ICPR), vol. 4, 2002, pp. 247–250.

[10] J. Dı́az, E. Ros, R. Agı́s, and J. L. Bernier, “Superpipelined High-Performance Optical-Flow Computation Architecture,” Computer Visionand Image Understanding, vol. 112, no. 3, pp. 262–273, Dec. 2008.

[11] J. Diaz, E. Ros, F. Pelayo, E. M. Ortigosa, and S. Mota, “FPGA-BasedReal-Time Optical-Flow System,” IEEE Transactions on Circuits andSystems for Video Technology, vol. 16, no. 2, pp. 274–279, Feb. 2006.

[12] J. L. Martn, A. Zuloaga, C. Cuadrado, J. Lzaro, and U. Bidarte,“Hardware Implementation of Optical Flow Constraint Equation UsingFPGAs,” Computer Vision and Image Understanding, vol. 98, no. 3, pp.462–490, 2005.

14

[13] Z. Wei, D.-J. Lee, and B. Nelson, “FPGA-Based Real-Time Optical FlowAlgorithm Design and Implementation,” Journal of Multimedia, vol. 2,no. 5, pp. 38–45, Sep. 2007.

[14] H. Niitsuma and T. Maruyama, “High Speed Computation of the OpticalFlow,” in Image Analysis and Processing (ICIAP), Sep. 2005, pp. 287–295.

[15] J. L. Barron, D. J. Fleet, and S. S. Beauchemin, “Performance of OpticalFlow Techniques,” International Journal of Computer Vision, vol. 12,no. 1, pp. 43–77, 1994.

[16] B. D. Lucas and T. Kanade, “An Iterative Image Registration Techniquewith an Application to Stereo Vision,” in International Joint Conferenceon Artificial Intelligence (IJCAI), Apr. 1981, pp. 674–679.

[17] F. Barranco, M. Tomasi, J. Diaz, M. Vanegas, and E. Ros, “Pipelinedarchitecture for real-time cost-optimized extraction of visual primitivesbased on FPGAs,” Digital Signal Processing, vol. 23, no. 2, pp. 675–688, 2013.

[18] ——, “Parallel Architecture for Hierarchical Optical Flow EstimationBased on FPGA,” IEEE Transactions on Very Large Scale Integra-tion (VLSI) Systems, vol. 20, no. 6, pp. 1058–1067, 2012.

[19] B. K. P. Horn and B. G. Schunck, “Determining Optical Flow,” ArtificialIntelligence, vol. 17, pp. 185–203, 1981.

[20] Z. Wei, D.-J. Lee, B. Nelson, and M. Martineau, “A Fast and AccurateTensor-Based Optical Flow Algorithm Implemented in FPGA,” in IEEEWorkshop on Applications of Computer Vision (WACV), Feb. 2007, pp.18–23.

[21] C. Harris and M. Stephens, “A Combined Corner and Edge Detector,”in Alvey Vision Conference, Sep. 1988, pp. 147–151.

[22] D. G. Lowe, “Distinctive Image Features from Scale-Invariant Key-points,” International Journal of Computer Vision, vol. 60, no. 2, pp.91–110, Nov. 2004.

[23] S. M. Smith and J. M. Brady, “SUSAN - A New Approach to Low LevelImage Processing,” International Journal of Computer Vision, vol. 23,no. 1, pp. 45–78, May 1997.

[24] M. Trajkovic and M. Hedley, “Fast Corner Detection,” Image and VisionComputing, vol. 16, no. 2, pp. 75–87, Feb. 1998.

[25] P. G. T. Dias, A. A. Kassim, and V. Srinivasan, “A Neural NetworkBased Corner Detection Method,” in IEEE International Conference onNeural Networks (ICNN), vol. 4, Nov. 1995, pp. 2116–2120.

[26] E. Rosten and T. Drummond, “Fusing Points and Lines for HighPerformance Tracking,” in IEEE International Conference on ComputerVision, vol. 2, Oct. 2005, pp. 1508–1511.

[27] E. Rosten, R. Porter, and T. Drummond, “Faster and Better: A MachineLearning Approach to Corner Detection,” IEEE Transactions on PatternAnalysis and Machine Intelligence, vol. 32, no. 1, pp. 105–119, Jan.2010.

[28] E. Rosten and T. Drummond, “Machine Learning for High-Speed CornerDetection,” in European Conference on Computer Vision (ECCV), May2006, pp. 430–443.

[29] T. Y. Tian, C. Tomasi, and D. J. Heeger, “Comparison of Approachesto Egomotion Computation,” in IEEE Conference on Computer Visionand Pattern Recognition (CVPR), June 1996, pp. 315–320.

[30] A. Jepson and D. J. Heeger, “Linear Subspace Methods for RecoveringTranslational Direction,” in Spatial Vision in Humans and Robots, 1992,pp. 39–62.

[31] C. Tomasi and J. Shi, “Direction of Heading from Image Deforma-tions,” in IEEE Conference on Computer Vision and Pattern Recogni-tion (CVPR), June 1993, pp. 422–427.

[32] K. Kanatani, “3-D Interpretation of Optical Flow by Renormalization,”International Journal of Computer Vision, vol. 11, no. 3, pp. 267–282,Dec. 1993.

[33] A. R. Bruss and B. K. Horn, “Passive Navigation,” Computer Vision,Graphics, and Image Processing, vol. 21, no. 1, pp. 3–20, Jan. 1983.

[34] K. Prazdny, “On the Information in Optical Flows,” Computer Vision,Graphics, and Image Processing, vol. 22, no. 2, pp. 239–259, May 1983.

[35] Z. Wei, D.-J. Lee, B. E. Nelson, and J. K. Archibald, “Hardware-FriendlyVision Algorithms for Embedded Obstacle Detection Applications,”IEEE Transactions on Circuits and Systems for Video Technology,vol. 20, no. 11, pp. 1577–1589, Nov. 2010.

[36] M. Vanegas, L. Rubio, M. Tomasi, J. Dı́az, and E. Ros, “On-Chip Ego-Motion Estimation Based on Optical Flow,” in International Symposiumon Applied Reconfigurable Computing (ARC), Mar. 2011, pp. 206–217.

[37] J. M. Odobez and P. Bouthemy, “Robust Multiresolution Estimationof Parametric Motion Models,” Journal of Visual Communication andImage Representation, vol. 6, no. 4, pp. 348–365, Dec. 1995.

[38] M. Angelopoulou and C.-S. Bouganis, “Feature Selection with Geomet-ric Constraints for Vision-Based Unmanned Aerial Vehicle Navigation,”

in IEEE International Conference on Image Processing (ICIP), Sep.2011, pp. 2357 –2360.

[39] A. Bruhn, J. Weickert, and C. Schnrr, “Lucas/kanade meetshorn/schunck: Combining local and global optic flow methods,” Inter-national Journal of Computer Vision, vol. 61, no. 3, pp. 211–231, Feb.2005.

[40] K. K. Parhi, Digital Signal Processing for Multimedia Systems, T. Nishi-tani, Ed. New York, NY, USA: Marcel Dekker, Inc., 1999.

[41] A. Laika, J. Paul, C. Claus, W. Stechele, A. Auf, and E. Maehle, “FPGA-Based Real-Time Moving Object Detection for Walking Robots,” inIEEE International Workshop on Safety Security and Rescue Robotics(SSRR), July 2010, pp. 1 –8.

[42] T. Zhang, H. Wu, E. Borst, K. Khnlenz, and M. Buss, “An FPGAImplementation of Insect-Inspired Motion Detector for High-SpeedVision Systems,” in IEEE International Conference on Robotics andAutomation (ICRA), 2008, pp. 335–340.

[43] M. Papadonikolakis and C. Bouganis, “A scalable fpga architecturefor non-linear svm training,” in International Conference on FieldProgrammable Technology (FPT), 2008, pp. 337–340.

[44] ——, “Novel Cascade FPGA Accelerator for Support Vector MachinesClassification,” IEEE Trans. Neural Netw. Learning Syst., vol. 23, no.2162-237X, pp. 1040–1052, 2012.

[45] M. E. Angelopoulou, C. Bouganis, P. Y. Cheung, and G. A. Constan-tinides, “Robust Real-Time Super-Resolution on FPGA and an Appli-cation to Video Enhancement,” ACM Transactions on ReconfigurableTechnology and Systems (TRETS), vol. 2, 2009.

Maria E. Angelopoulou received the M.Eng. degreefrom the Department of Electrical and ComputerEngineering, University of Patras, Greece, and thePhD degree from the Department of Electrical andElectronic Engineering, Imperial College London,UK, in 2005 and 2008 respectively. Upon comple-tion of her PhD, she became a postdoctoral researchassociate with the Circuits and Systems ResearchGroup of the Department of Electrical and Elec-tronic Engineering of Imperial College London, andworked on FPGA-based vision methods for real-time

UAV navigation. In October 2009, she joined the Products DevelopmentDepartment of Intracom Defense Electronics, Athens, Greece, as an R&Dengineer on hardware-based wireless multimedia communications. Since April2011, she has been a researcher with the Communications and SignalProcessing Research Group of the Department of Electrical and ElectronicEngineering of Imperial College London, working on computer vision forautonomous vehicles. Her main interests are in computer vision, digital signalprocessing and real-time reconfigurable architectures. She has received severalawards from the Greek State Scholarships Foundation and the TechnicalChamber of Greece. She is a member of the IEEE, the IEEE Signal ProcessingSociety and the Technical Chamber of Greece.

Christos-Savvas Bouganis is a Senior Lecturerwith the Electrical and Electronic Engineering De-partment, Imperial College London, London, U.K.He has published over 40 research papers in peer-referred journals and international conferences, andhe has contributed three book chapters. His currentresearch interests include the theory and practiceof reconfigurable computing and design automa-tion, mainly targeting digital signal processing al-gorithms. He currently serves on the program com-mittees of many international conferences, including

FPL, FPT, DATE, SPPRA, and VLSI-SoC. He is an Editorial Board Memberof the IET Computers and Digital Techniques and the Journal of SystemsArchitecture. He has served as the General Chair of ARC in 2008 and theProgram Chair of the IET FPGA designers forum in 2007. He is a memberof the IEEE and ACM.

Documents

Vision-Based Egomotion Estimation on FPGA for Unmanned ... › bitstream › 10044 › 1... · Vision-Based Egomotion Estimation on FPGA for Unmanned Aerial Vehicle Navigation Maria