An FPGA implementation of the SURF algorithm for the ExoMars programme

An FPGA implementation of the SURFalgorithm for the ExoMars programme

G. Lentaris, I. Stamoulias, D. Diamantopoulos, K. Siozios, and D. Soudris

School of ECE, National Technical University of Athens, [email protected], [email protected],

{diamantd,ksiop,dsoudris}@microlab.ntua.gr

Abstract. Achieving highly accurate feature extraction in a short pe-riod of time is very important for space applications based on computervision algorithms, such as with the ExoMars programme of ESA. Thepaper describes a HW/SW co-design scheme using FPGAs to speed-upthe SURF algorithm when executed on a low computational power CPU.It performs algorithmic analysis and restructures the SURF steps leadingto efficient hardware implementation, while at the same time it exploitsthe advantages of the CPU. The HW architecture is implemented on aXilinx Virtex 6 FPGA achieving real-time performance with low hard-ware cost and highly accurate results.

Keywords: SURF, HW/SW co-design, FPGA, Rover Navigation

1 Introduction

A number of space missions to Mars are designed during the last 15 years toprovide significant information regarding the planets environment and evidenceabout exobiology. The ExoMars programme [7], established from the EuropeanSpace Agency (ESA), is designed to further investigate the Martian environmentand to demonstrate new technologies paving the way for a future Martian samplereturn missions. The ExoMars rover will rely on autonomous navigation andlocalization systems using computer vision algorithms (CV), which will allowhigh mobility and facilitate sampling from remote sites. The SPARTAN projectof ESA (SPAring Robotics Technologies for Autonomous Navigation) targets theefficient implementation of CV algorithms to support the ExoMars rover.

Robust navigation over long distances requires an accurate method for track-ing the rover’s position on the Martian surface. To this end, the Visual Odom-etry process determines the robots position and orientation by analyzing theassociated camera images. It bases on feature extraction to detect and describemultiple visual features across frames. The features are matched between con-secutive frames to deduce their 3D distance and estimate the robot’s motion.The most desirable property for a feature extractor is repeatability and robust-ness with respect to noise. In the SPARTAN project, we implement the SURFalgorithm (Speeded Up Robust Features) [1] to acquire both fast and accurateresults. Compared to the hitherto feature extraction algorithms, SURF shows

improved performance due to the use of integrals for image convolution, itsHessian matrix-based metric used during detection, and its distribution-baseddescriptor. To meet the constraints imposed by ESA, we implement SURF usingHW/SW co-design achieving both sufficient speed-up of the process and highlyaccurate motion predictions.

The main efforts of related works on the hardware implementation of SURFwere focused, mainly, in the detector part of the algorithm [3][4][5][6]. In [3],only the Fast-Hessian detector part of SURF is implemented on HW, whilethe descriptor is handled entirely by SW executed on the PowerPC-440 of theXC5VFX70 FPGA. The presented hardware block supports up to 1024×1024images and uses 2 SURF octaves. The work of [4] also implements only thefeature detector in HW to process 640×480 images at 56 frames per second. In[5], a complete detector and descriptor stage of SURF has been implementedand validated in a Virtex 5 XC5VFX70T FPGA. Again a PowerPC was usedto run a software optimized version of SURF and to communicate with the HWcores speeding-up the detector process.

The current paper presents the design and development of SPARTAN-SURF.Section 2 describes the HW/SW co-design, while section 3 presents the archi-tecture of SPARTAN-SURF. Implementation results are given in section 4.

2 Feature Extraction with HW/SW co-design

Our approach in developing the Feature Extraction engine of the SPARTANproject bases on a HW/SW co-design exploiting the best of both worlds: 1)the highly accurate arithmetic of a CPU, its increased memory capacity, theshort developement time, with 2) the parallelization capabilities of the FPGAsallowing significant speed-up of image processing. Based on a detailed study ofthe SURF algorithm [1] and an extensive profiling of the OpenSURF imple-mentation [2], we perform a HW/SW partitioning tailored for the SPARTANproject specifications, as shown in figure 1. Based primarily on the executiontime and secondarily on the communication and memory requirements of eachSURF function, we combine certain algorithmic steps to create the modulesCoarse Detection and Box Description, which will be executed on HW.

Coarse Detection generates the response maps and performs a first –rough–selection of interest points via non-maxima suppression. Box Description com-putes the 4 values describing a given sub-area (1/16th) of the interest point. Theentire point descriptor is constructed via 16 successive box descriptions. On theSW side, Interpolation refines the locations of the interest points (ipts) detectedby the HW, Orientation determines their direction, whereas Ipts Disassemblerand Assembler divide each ipt to the 16 boxes sent to HW for description andthen collect the results. These four modules consume little time when executedon SW, while they perform highly accurate refinement of the results by usingfloating point arithmetic and advanced math operations. By contrasting theirreduced SW requirements (time) to their increased HW requirements (expectedLUTs and DSPs), we decide to implement them on SW.

Fig. 1. HW/SW Partitioning of the SURF algorithm for the SPARTAN project.

Coarse Detection and Box Description consume the majority of the execu-tion time of the original SURF (approx. 82%). Furthermore, they only involveadd/sub and multiplication operations, which can be supported by fixed-pointarithmetic. In contrast to the other SURF functions requiring complex floating-point math operations, these modules allow for significant HW cost reductionwithout compromising the quality of the final results.

Besides time optimization, the proposed partitioning scheme also considerscommunication and memory constraints. To this end, SW Ipts Disassembling andAssembling support HW in processing the ipts in a divide and conquer fashion,such that only a small part of the integral image needs to be stored on-chip ateach time. To reduce the data communicated between SW and HW, first of all, weavoid transmitting the response maps (generated on HW due to their increasedcomplexity). We do so by also implementing on HW (within Coarse Detection)the non-maximum scale-space suppression function of SURF. Otherwise, thislow-complexity SW function would receive from HW an excessive amount ofdata (similar to 3 images of 512×384 pixels). To further reduce communication,Integral Image (fig. 1) will be computed both on SW and HW, potentially inparallel, with low SW and HW cost (transmitting the integrals at 100 Mbps isexpected to consume more time than computing them, approx. ×5).

3 SPARTAN-SURF Architecture

The proposed architecture is divided in three major parts (fig. 1): Integral Im-age, Feature Detection (coarse detection, interpolation, orientation), and FeatureDescription (ipts disassembling, box description, ipts assembling). The followingsub-sections present their design focusing mainly on the HW modules, as wellas, on our SW modifications of the OpenSURF implementation [2].

3.1 Integral Image

The HW Integral Image serves as a small pipe transforming the incoming pixelsto integral values, which are then forwarded to the remaining HW modules. In

principle, it computes the recursive function

f(x, y) =

{0, if x<0 or y<0

I(x, y) + f(x−1, y−1)− f(x, y−1)− f(x−1, y), else.

where (x, y) are coordinates on the image and I is the intensity value of the(x, y) pixel. That is, the module combines each incoming (x, y) pixel with the3 previously computed integral values, which are adjacent to the (x, y) posi-tion. It inputs 512×384 image pixels (in raster scan order, 1 per cycle) andoutputs 512×384 integral values (also in raster scan order, 1 per cycle) in a con-tinuous pipelined fashion with 5 cycles latency. The 5-stage pipeline utilizes 3adders/subtractors to combine the values, 8 registers to temporarily store thevalues and synchronize their continuous flow, and 1 FIFO memory (circularbuffer) to store 1 row of the integral image (the previous from the currently pro-cessed). The registers within the pipe are cleared periodically based on a simplecounter of the input pixels, such that the values f(x−1, y), f(x−1, y−1), and/orf(x, y−1), are set to zero whenever the input pixel lies on an image border.

3.2 Feature Detection

The detection part of the proposed design implements four basic SURF functions:the generation of the response maps (most computationally intensive), the non-maximal suppression of the responses, the interpolation, and the orientation.The first two are implemented on HW as Coarse Detection (CD), while theremaining are executed on SW.

Coarse Detection Functionality The CD module (Fig. 2) consists of fourmain components: the Response Calculator (RC), the Response Keeper (RK),the Non-Maximal Suppressor (NMS), and the Control Unit (CU). During ex-ecution, the CU issues successive requests to the RC to calculate specific setsof responses. Each response is forwarded directly to the response map mem-ory within RK. At regular time intervals, RK forwards the stored layers of theresponse map to NMS for detecting interest points (ipts). Figure 2 shows theoverall architecture of the module, which also includes the Integral Memory (IM),the Box Integral (BI), and an output memory buffering the ipts.

Functionally, the CD module processes iteratively the image in sections. Eachsection is a horizontal stripe of 512×100 values. The stripe slides downwards byone row (once per iteration) until the entire image is covered and all ipts aredetected. The key idea behind this approach is that SURF, actually, processesremote parts of the image independently. Therefore, we can break-down andre-order its steps, such that they can be executed reusing the HW memory andachieving significant resource optimization. In practice, we examine the imagefor ipts in a row-by-row fashion storing on-chip only the integral image stripe andthe response map stripes, which envelope the row under examination (central ofthe sliding window).

Each iteration of the CD module process consists of three steps:

Fig. 2. The HW architecture of Coarse Detector

1. input of a new integral image row (i.e., movement of the sliding window).2. calculation of every response lying on the central row of the sliding window

of the integral image.3. detection of any interest point lying on the central row of the sliding window

of the response map.

At step 1, the CU will issue a request for a new integral row (512 values) tothe external memory. The incoming values are temporarily stored to the IntegralMemory, which utilizes a custom 512×100 circular buffer to overwrite the oldestrow with the incoming row. At step 2, every response with filter size up to 99(coverage area 99x99 pixels) lying on the central row of the 512x100 window iscalculated. Each response is stored at a corresponding layer within the responsemap in RK. At step 3, the RK will forward a burst of 3x3x3 scale-space cubes tothe NMS (1 per cycle). In a pipeline fashion, NMS will find the maximum valueper cube and compare it to a given threshold to identify an interest point. Thelocation of each detected ipt and its size is stored to the output ipt memory.

To reduce the on-chip memory, analogously to the sliding window of theintegral image, the RK holds only a narrow stripe of each response layer. Specif-ically, it is sufficient to store 3-9 rows from each layer depending on the numberof octaves that the layer participates in (Fig. 2) [1]. For each layer we develop asliding window moving downwards on the response map (with circular buffers).RK uses certain response rows to form a “line” of cubes as follows: it stacks 3rows from the bottom layer, 3 from the middle, and 3 from the top to form mul-tiple, center-adjacent, collinear 3x3x3 cubes (the top/middle/down layers varydepending on the examined octave). At step 3 of the iteration, 6 lines of cubes(2 per octave) are forwarded to NMS.

The selected 512x100 size of our sliding window supports the computation ofthe first three SURF octaves, i.e., the creation of 8 distinct response map layers

(with filter sizes: 9, 15, 21, 27, 39, 51, 75, and 99). Octaves 4 and 5 require noHW speed-up and, moreover, contribute very few ipts (e.g. 3 to 6) to SPARTANat an increased HW cost (they double the on-chip memory).

Coarse Detection architecture details To support high throughput compu-tation during step 2 of Coarse Detection, the Integral Memory bases on a parallelmemory organization allowing the access of 8 distinct values in a single cycle.The organization utilizes 4 memory banks spatially interleaved by the mapping

BANK(x, y) = x mod 2 + 2× (y mod 2)

ADDR(x, y) = x div 4 + y × 128

That is, each incoming integral value is mapped according to its image positionto one of the 4 banks via simplified bit-wise operations. The proposed mappingallows the parallel read of 4 values located at the 4 corners of any even numbersized square, anywhere on the 2-D plane. Therefore, it supports the calculationof one box integral per cycle (for any box filter used during the process). Wedouble this throughput by using dual-port memories.

The calculation of the box integral values is performed in a pipelined fashionby the Box Integral (BI) component (Fig. 2). BI combines the 8 outputs of theIntegral Image and forwards 2 box integral values to the Response Calculator(RC). RC forms the Dxx, Dyy, and Dxy filter values and combines them toproduce the final value of a requested response [1]. Since one complete responsecalculation requires 8 box integrals (2 for Dxx, 2 for Dyy, and 4 for Dxy), ourdesign can sustain a computation rate of 1 response per 4 cycles. The RC pipelineuses fixed-point arithmetic with 22 bits to achieve highly accurate predictions.

The CU issues multiple successive requests to the RC to generate an entirerow of responses corresponding to a specific layer of the response map. Duringstep 2, up to 8 new response rows will be generated, one for each of the 8 distinctlayers within RK (Fig. 2). The exact layers to be augmented with a new rowat each iteration are decided on-the-fly based on a combination of the currentposition of the sliding window and the subsampling factor of each response layer.

During step 3, the CU will issue to RK up to 6 commands referring to scale-space scans (for ipts detection), which are centered around 6 certain layers and1 y-position (the currently examined). Such a command triggers the generationof a line of scale-space cubes to be forwarded to NMS. For each cube, 3 layerscontribute 3 response rows. Layers 15, 27, and 51 use a predetermined subsetof their rows (e.g., rows 4-5-6, or 3-5-7, or 1-5-9, for layer 27) depending onthe current octave under examination. In a pipeline fashion, RK starts readingresponses at the left side of the map and terminates at its right side. One newcube is completely forwarded whenever 9 new response values (a vertical slice)are forwarded (the remaining 18 values are already forwarded as part of theprevious –overlapping– cube). RK accesses the 3 layers in parallel (stored indistinct memories), and thus, one new cube is forwarded to NMS every 3 cycles.

NMS inputs the 3x3x3 cubes in a pipeline fashion and detects their maximumvalue in 7 consecutive pipeline stages, which perform simple comparisons and in-

termediate value forwarding. The sustained throughput is one cube examinationper 3 cycles. In case the maximum value is located at the center of the cube andis greater than a given threshold (input to SURF), then a new interest point isdetected at the position {x,y,layer} corresponding to the examined scale-spacecube. The new ipt is stored to the output memory, which forwards all results tothe CPU upon completion of Coarse Detection.

3.3 Feature Description

The description part of the proposed design implements the Haar wavelet basedprocessing of the detected interest points [1]. It utilizes both HW and SW mod-ules (Fig. 2) to divide each ipt in 16 smaller areas, which are then processedsequentially starting from the top and moving to the bottom of the image.

The purpose of the proposed design approach is twofold. First, similarly toCoarse Detector, we reuse the HW resources to achieve significant memory re-duction by developing a dedicated sliding window (we develop a distinct slidingwindow to support the completely different requirements of Descriptor and toallow the two HW modules to operate in parallel). Second, breaking down theproblem of describing an interest point to describing several smaller boxes re-duces the area required to be cached on HW by almost 1/16th. Therefore, giventhe aforementioned 13 scales of SPARTAN-SURF and the worst case orientationof the interest point, the worst case height of an interest point’s box is reducedto 130 rows and our sliding window has a size of 512×130 integral values.

SW modifications To support the HW module, we perform certain SW mod-ifications on OpenSURF [2] We develop the Ipts Disassembler component (Fig.2) to divide each ipt to its 16 main squares, namely to its “boxes” (1 ipt = 4× 4boxes). The set of all boxes of the image (up to 1600) is sorted according totheir y coordinate. This sorting allows the HW descriptor to process the boxesin order, i.e., to use a sliding window, which will always move downwards on theimage unloading data and reusing memory (memory optimization).

We implement a custom data structure on SW to sort and store the boxesfor future reference in linear time (avoiding execution bottlenecks). Specifically,we use one 1-D matrix of lists: each box of the set is pushed in the list connectedat the cell corresponding to its y coordinate. Additionally, we store an ID num-ber for each box, their (x, y) coordinates and scales, the sine and cosine of thedominating orientation (angle) of the ipts (computed on the CPU using floatingnumbers, instead of the FPGA utilizing an increased amount of HW resources).These data are transmitted to the FPGA starting from the left-most list.

The Ipts Assembler receives the box descriptors from HW and stores themto the corresponding positions within our custom data structure. Thereafter, itnormalizes the HW results and forms the final 64-value descriptor of each ipt.

Box Descriptor Overview The HW module of SPARTAN-SURF descriptionconsists of seven components (Fig. 3): the Boxes Memory (BM), the SampleX

& SampleY Computation (SSC), the Component of Memories (CM), the HaarX& HaarY Calculator (HHC), the Gauss Computation (GC), the Box Descrip-tor Computation (BDC) and the Descriptors Memory (DM). It processes thereceived boxes iteratively by moving the sliding window downwards on the in-tegral image (by 1 row at each iteration) and describing all possible boxes lyingon the central row of the window (after checking their y-coordinate).

Fig. 3. The HW architecture of Box Descriptor

The Box Memory stores all the information regarding the boxes to be de-scribed. That is, at the beginning of each Descriptor process, the BM compo-nent interfaces with the CPU to receive and store multiple 6-tuples of the form{x, y, scale, sin, cos, ID}, one for each box to be described. Each 6-tuple is re-ceived in three 32-bit words and is unpacked and placed in a local FIFO memory.During execution, whenever a box description is completed, the CU will pop anew box from the FIFO to initiate the next box description. The process termi-nates when the FIFO is empty.

The SampleX & SampleY Computation generates 81 coordinate pairs spec-ifying the 81 “samples” required for the description of a box [2]. It uses theinformation associated with each box (the 6-tuple) to compute X = round(x+(−j · scale · sin+ i · scale · cos)), where (i, j) are integers identifying each sam-ple (i.e., counters) and X is the actual image coordinate of each sample (Y iscomputed analogously).

The Component of Memories acts as a local cache of the integral values, i.e.,it implements our sliding window on the integral image. Additionally, duringthe description of a box, it inputs the aforementioned 81 sample coordinatepairs in a pipeline fashion. For each one, it accesses 8 integral values from itslocal memory organization. The addresses of the 8 values are computed via astraightforward add/sub combination of the sample coordinates with the scaleof the current ipt. The parallel memory will provide all 8 values with a singlecycle access facilitating the high throughput computation of the HaarX-HaarYcharacterization of each sample (1 per cycle).

The parallel memory storing the 512×130 stripe of the integral image utilizes16 true dual port banks and is organized based on the non-linear mapping

BANK(x, y) = (x+ y × 3) mod 16

ADDR(x, y) = x div 16 + (y mod 130)× (512/16)

The above organization allows for a single cycle access of the 8 integral valuesrequired for the computation of any Haar wavelet used by SURF (i.e., dx and dyresponses), for up to 13 scales (besides scale 8), and from anywhere on the image.Note that this requirement implies random access to multiple rectangle-shapedpatterns, and hence, increases the complexity of the organization.

The HaarX & HaarY Calculator inputs the 8 integrals coming from the par-allel memory to calculate the HaarX and HaarY results for 1 sample [2]. In apipeline, it utilizes adders and subtractors to process 1 sample per cycle.

The Gauss Computation (GC) generates the values used to weight each oneof the 81 samples. The generation bases on the distance of the sample from thecenter of its box. To reduce the hardware complexity of the weighting function,we perform certain simplifications on OpenSURF regarding the rounding of thescale value and the coordinate pair used to calculate the exponential terms. Asa result, we avoid implementing exponentiation and division circuits to processnumerous combinations of arbitrary inputs. Instead, we reduce the set of possibleinputs for each sample to few rounded numbers (i.e., 13 and 25), stored in LUTs,and use only one multiplication for their final combination.

Finally, the Box Descriptor Computation inputs the HaarX & HaarY of thesamples and weights them to produce the descriptor of the box. The box descrip-tor consists of 4 values, i.e. Σdx, Σdy, Σ|dx|, and Σ|dy|, which are computed viathe summation of the 81 samples. The responses dx and dy are computed sep-arately for each sample using 20 bit fixed-point multipliers. Specifically, for dx(and similarly for dy), it computes dx = gauss weight·(−HaarX·sin+HaarY·cos).

4 Implementation Results

The proposed HW architecture was developed using parametric VHDL and im-plemented on Xilinx XC6VLX240T-2 FPGA. The resource utilization of eachcomponent is shown in Table 1 together with the total cost of the two HW mod-ules. Clearly, the most memory consuming components are the Integral Mem-ory of Coarse Detector (44 RAMB36E1) and the Memory Component of BoxDescriptor (104 RAMB36E1). Overall, Coarse Detector utilizes only 2% of theFPGA slices, 1.1% of the LUTs, 0.8% registers, 13% RAM, and 0.5% DSP blocks.Box Descriptor utilizes 6.5% of the slices, 2.8% LUTs, 32% RAM, and 1.7% DSPblocks. The maximum operating frequency is 175 MHz, facilitating real-time ex-ecution in the context of SPARTAN: one 512x384 image is processed in approx.13 ms (8 ms for coarse detection and 5 ms for describing 200×16 ipt boxes).

To test the quality of the results, we compare the SPARTAN-SURF output tothe OpenSURF output. For the 200 more important interest points (those crucialfor SPARTAN), our tests show that 99% of the ipts detected by the proposedsolution are exactly the same with that of OpenSURF (same location on theimage). The deviation is due to the fixed point arithmetic used in HW. Regardingdescription, we employ the procedure of OpenSURF used to match ipts betweensuccessive frames. Our tests show that 92% of the matches of SPARTAN-SURFare equal to the matches provided by OpenSURF. Hence, overall, SPARTAN-

Table 1. Resource Utilization of SPARTAN-SURF on Xilinx XC6VLX240T-2 FPGA

component slices LUTs Registers RAMBs DSPs

Response Calculator 137 375 464 0 4Response Keeper & Memory 180 383 492 9 0

Coarse Detector (total) 777 1,667 2,286 54 4Box Descriptor (total) 2,468 4,348 5,692 137 13

SURF provides a sufficient amount of ipts matches for the rover to performaccurate localization.

5 Conclusion

The current paper presented the HW/SW co-design of the SURF algorithmfor use in rover navigation. Taking into consideration the specifications of theSPARTAN project for the ExoMars programme (ESA), the design based on theFPGA speed-up of the computationally intensive kernels of SURF, as well as onthe involved, floating-point, mathematic operations of the CPU. The algorithmicsteps were analyzed and restructured to allow efficient implementation on HW.The architecture was implemented on a Xilinx XC6VLX240T-2 FPGA achievingreal-time performance with low HW cost and highly accurate results.

Acknowledgments This work is supported by ESA (European Space Agency)project Sparing Robotics Technologies For Autonomous Navigation (SPARTAN)(ESA/ESTEC ITT Reference AO/1-6512/10/NL/EK).

References

1. Bay, H., Ess, E., Tuytelaars, T., Van Gool, L.: Speeded-Up Robust Features (SURF).In Computer Vision and Image Understanding, Vol. 110, Issue 3, 346–359 (2008)

2. Christopher Evans: Notes on the OpenSURF Library (2009). URL:http://www.chrisevansdev.com/computer-vision-opensurf.html

3. Svab, J., Krajnik, T., Faigl, J., Preucil, L.: FPGA BASED SPEEDED UP ROBUSTFEATURES. In IEEE International Conference on Technologies for Practical RobotApplications (TePRA), Woburn, Massachusetts, USA, 35–41 (2009)

4. Bouris, D., Nikitakis, A., Papaefstathiou, I.: Fast and Efficient FPGA-Based FeatureDetection Employing the SURF Algorithm. 18th IEEE Annual Int’l Symp. on Field-Program. Custom Comput. Mach. (FCCM), Charlotte, N. Carolina, 3–10 (2010)

5. Schaeferling, M., Kiefer, G.: Flex-SURF: A Flexible Architecture for FPGA-BasedRobust Feature Extraction for Optical Tracking Systems. In Int’l Conf. on Reconfig.Computing and FPGAs (ReConFig), Cancun, Quintana Roo, Mexico, 458–463 (2010)

6. Battezzati, N., Colazzo, S., Maffione, M., Senepa, L.: SURF algorithm in FPGA: Anovel architecture for high demanding industrial applications. In Design, Automation& Test in Europe (DATE), Dresden, Germany, 161–162 (2012)

7. ESA robotic exploration of mars. URL: http://exploration.esa.int

Documents

An FPGA implementation of the SURF algorithm for the ExoMars programme