14
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, ACCEPTED 1 High-Performance Vision-Based Navigation on SoC FPGA for Spacecraft Proximity Operations George Lentaris, Ioannis Stratakos, Ioannis Stamoulias, Dimitrios Soudris, Manolis Lourakis and Xenophon Zabulis Abstract—Future autonomous spacecraft rendezvous with un- cooperative or unprepared objects will be enabled by vision- based navigation, which imposes great computational challenges. Targeting short duration missions in low Earth orbit, this paper develops high-performance avionics supporting custom computer vision algorithms of increased complexity for satellite pose tracking. At algorithmic level, we track 6D pose by rendering a depth image from an object mesh model and robustly matching edges detected in the depth and intensity images. At system level, we devise an architecture to exploit the structure of commercial System-on-Chip FPGAs, i.e., Zynq7000, and the benefits of tightly coupling VHDL accelerators with CPU-based functions. At implementation level, we employ our custom HW/SW co- design methodology and an elaborate combination of digital circuit design techniques to optimize and map efficiently all functions to a compact embedded device. Providing significant performance per Watt improvement, the resulting VBN system achieves a throughput of 10-14 FPS for 1 Mpixel images, with only 4.3 Watts mean power and 1U size, while tracking ENVISAT in real-time with only 0.5% mean positional error. Index Terms—FPGA, space avionics, active debris removal, computer vision, pose estimation, tracking, HW/SW co-design I. I NTRODUCTION P ROXIMITY operations in space are maneuvers conducted by an orbiting chaser spacecraft in order to approach and remain close to other space resident targets. Traditionally, proximity operations have been concerned with specially pre- pared, cooperative targets with controlled attitude. However, near future space-tug, on-orbit servicing and active debris removal missions [1], [2], [3], will also involve uncoopera- tive/unprepared targets whose motion is not known accurately. As practical limitations preclude real-time teleoperation, to cope with any target state uncertainties while minimizing the risk of collision, the chaser should be maneuvered by a Guid- ance, Navigation, and Control (GNC) system with increased autonomy. Such a system should benefit from advanced Vision Based Navigation (VBN) [4]. However, advanced VBN de- mands significant on-board processing power, mandating the use of novel, high-performance avionics with one order of magnitude higher throughput than that of space-grade CPUs. Advanced VBN involves intricate vision algorithms for estimating the relative pose between two spacecraft [5]. Rel- ative pose refers to the position and orientation parameters G. Lentaris, I. Stratakos, I. Stamoulias and D. Soudris are with the Department of Electrical and Computer Engineering, National Technical University of Athens (NTUA), Greece, e-mail: [email protected]. M. Lourakis and X. Zabulis are with the Institute of Computer Science, Foundation for Research and Technology – Hellas (FORTH), Greece, e-mail: [email protected] Manuscript received August 2018 defining the geometrical relationship between two objects. In lieu of visual markers and prior attitude stabilization, reliable pose estimation for uncooperative or unprepared targets will also require high-definition cameras and multiple frames per second (FPS) throughput. To execute complex algorithms on such high volumes of data, HW accelerators and HW/SW co-design are necessary. Field Programmable Gate Arrays (FPGAs) are the most promising platforms for this task, as they provide increased performance per Watt and high paralleliza- tion capabilities [6]. FPGAs are already being used in space and the market offers multiple space-qualified devices. Fur- thermore, following the latest trend of using Commercial Off The Shelf (COTS) components in certain commercial and/or cost-constrained missions, more powerful FPGAs should be considered by exploring COTS device families. With these considerations and the support of the European Space Agency (ESA), we design vision algorithms and devise avionics solutions customized for active debris removal (ADR) via short-term, low Earth orbit missions [3]. We examine the approach with ENVISAT 1 and focus on VBN methods relying on passive cameras. In such settings, due to the partial shielding by Earth’s magnetosphere and the short flight dura- tion, radiation tolerance becomes less critical, hence enabling COTS-based solutions and, in particular, modern high-density SRAM System on Chip (SoC) FPGAs. The requirements set by ESA are to define the most appropriate platform in terms of size/mass/power and develop a HW/SW system able to process at least 1 Mpixel images at 10 FPS, i.e., 5-10x faster than the expected GNC control rate. In this paper, we explicitly address the HW/SW imple- mentation of a vision system for tracking the relative pose of ENVISAT at distances 20-50m by processing monocular grayscale images. We co-design the pose estimation algorithm, the HW/SW system architecture and the VHDL accelerators, and demonstrate our proof-of-concept implementation on Xil- inx Zynq7000 SoC FPGA. Assuming small motion between successive frames, our pose tracking algorithm evolves an estimate of target pose by first using it to render a target model and then updating it with the aid of matches established among the edges detected in the rendered depth map and the intensity edges extracted from each frame. The proposed architecture adapts to the various HW parts embedded in the SoC and exploits parallelization at multiple levels via parametric VHDL circuit design. The system is tested with a dataset comprised 1 An inoperable Earth observation satellite measuring 26 × 10 × 5m 3 , a debris of uncertain structural integrity and tumbling rate (2 /sec); shown in Fig. 3.

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO ...users.ics.forth.gr/~lourakis/publ/2019_tcsvt.pdf · tations that exploit the capabilities of the SoC and improve the utilization

  • Upload
    others

  • View
    13

  • Download
    0

Embed Size (px)

Citation preview

Page 1: IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO ...users.ics.forth.gr/~lourakis/publ/2019_tcsvt.pdf · tations that exploit the capabilities of the SoC and improve the utilization

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, ACCEPTED 1

High-Performance Vision-Based Navigation onSoC FPGA for Spacecraft Proximity Operations

George Lentaris, Ioannis Stratakos, Ioannis Stamoulias, Dimitrios Soudris,Manolis Lourakis and Xenophon Zabulis

Abstract—Future autonomous spacecraft rendezvous with un-cooperative or unprepared objects will be enabled by vision-based navigation, which imposes great computational challenges.Targeting short duration missions in low Earth orbit, this paperdevelops high-performance avionics supporting custom computervision algorithms of increased complexity for satellite posetracking. At algorithmic level, we track 6D pose by rendering adepth image from an object mesh model and robustly matchingedges detected in the depth and intensity images. At system level,we devise an architecture to exploit the structure of commercialSystem-on-Chip FPGAs, i.e., Zynq7000, and the benefits oftightly coupling VHDL accelerators with CPU-based functions.At implementation level, we employ our custom HW/SW co-design methodology and an elaborate combination of digitalcircuit design techniques to optimize and map efficiently allfunctions to a compact embedded device. Providing significantperformance per Watt improvement, the resulting VBN systemachieves a throughput of 10−14 FPS for 1 Mpixel images, withonly 4.3 Watts mean power and 1U size, while tracking ENVISATin real-time with only 0.5% mean positional error.

Index Terms—FPGA, space avionics, active debris removal,computer vision, pose estimation, tracking, HW/SW co-design

I. INTRODUCTION

PROXIMITY operations in space are maneuvers conductedby an orbiting chaser spacecraft in order to approach

and remain close to other space resident targets. Traditionally,proximity operations have been concerned with specially pre-pared, cooperative targets with controlled attitude. However,near future space-tug, on-orbit servicing and active debrisremoval missions [1], [2], [3], will also involve uncoopera-tive/unprepared targets whose motion is not known accurately.As practical limitations preclude real-time teleoperation, tocope with any target state uncertainties while minimizing therisk of collision, the chaser should be maneuvered by a Guid-ance, Navigation, and Control (GNC) system with increasedautonomy. Such a system should benefit from advanced VisionBased Navigation (VBN) [4]. However, advanced VBN de-mands significant on-board processing power, mandating theuse of novel, high-performance avionics with one order ofmagnitude higher throughput than that of space-grade CPUs.

Advanced VBN involves intricate vision algorithms forestimating the relative pose between two spacecraft [5]. Rel-ative pose refers to the position and orientation parameters

G. Lentaris, I. Stratakos, I. Stamoulias and D. Soudris are with theDepartment of Electrical and Computer Engineering, National TechnicalUniversity of Athens (NTUA), Greece, e-mail: [email protected].

M. Lourakis and X. Zabulis are with the Institute of Computer Science,Foundation for Research and Technology – Hellas (FORTH), Greece, e-mail:[email protected]

Manuscript received August 2018

defining the geometrical relationship between two objects. Inlieu of visual markers and prior attitude stabilization, reliablepose estimation for uncooperative or unprepared targets willalso require high-definition cameras and multiple frames persecond (FPS) throughput. To execute complex algorithms onsuch high volumes of data, HW accelerators and HW/SWco-design are necessary. Field Programmable Gate Arrays(FPGAs) are the most promising platforms for this task, as theyprovide increased performance per Watt and high paralleliza-tion capabilities [6]. FPGAs are already being used in spaceand the market offers multiple space-qualified devices. Fur-thermore, following the latest trend of using Commercial OffThe Shelf (COTS) components in certain commercial and/orcost-constrained missions, more powerful FPGAs should beconsidered by exploring COTS device families.

With these considerations and the support of the EuropeanSpace Agency (ESA), we design vision algorithms and deviseavionics solutions customized for active debris removal (ADR)via short-term, low Earth orbit missions [3]. We examinethe approach with ENVISAT1 and focus on VBN methodsrelying on passive cameras. In such settings, due to the partialshielding by Earth’s magnetosphere and the short flight dura-tion, radiation tolerance becomes less critical, hence enablingCOTS-based solutions and, in particular, modern high-densitySRAM System on Chip (SoC) FPGAs. The requirements setby ESA are to define the most appropriate platform in terms ofsize/mass/power and develop a HW/SW system able to processat least 1 Mpixel images at 10 FPS, i.e., 5−10x faster thanthe expected GNC control rate.

In this paper, we explicitly address the HW/SW imple-mentation of a vision system for tracking the relative poseof ENVISAT at distances 20−50m by processing monoculargrayscale images. We co-design the pose estimation algorithm,the HW/SW system architecture and the VHDL accelerators,and demonstrate our proof-of-concept implementation on Xil-inx Zynq7000 SoC FPGA. Assuming small motion betweensuccessive frames, our pose tracking algorithm evolves anestimate of target pose by first using it to render a target modeland then updating it with the aid of matches established amongthe edges detected in the rendered depth map and the intensityedges extracted from each frame. The proposed architectureadapts to the various HW parts embedded in the SoC andexploits parallelization at multiple levels via parametric VHDLcircuit design. The system is tested with a dataset comprised

1An inoperable Earth observation satellite measuring 26×10×5m3, a debrisof uncertain structural integrity and tumbling rate (∼ 2◦/sec); shown in Fig. 3.

Page 2: IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO ...users.ics.forth.gr/~lourakis/publ/2019_tcsvt.pdf · tations that exploit the capabilities of the SoC and improve the utilization

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, ACCEPTED 2

of synthetic videos of ENVISAT and its evaluation showssystem acceleration in the area of 19x vs embedded processors,10−14 FPS for 1024×1024 image resolution, 4.3 Watts meanpower consumption on a 1U-size board, and average positionestimation error in the range of 0.5% of the distance betweenENVISAT and the camera.

The main contributions of the paper are a) to developsolutions for a novel application in space, b) to demonstratethe feasibility of a low-cost, low-power, low-size, high-rate,and high-definition VBN embedded system, c) to describe acustom methodology leading to efficient HW/SW implemen-tations that exploit the capabilities of the SoC and improvethe utilization of HW for use in avionics, d) to combinemultiple techniques at scheduling and VHDL level (e.g.,data partitioning, sliding windows, on-the-fly processing withminimal buffers, word-length optimization with fixed- andfloating-point arithmetic, deep pipelining, parallelization at bit-/data-/instruction-/task-level) for efficient circuit design, e) todevelop a pose estimation algorithm that can accommodate acomplex object model without redesign/preprocessing and canrobustly handle mismatches. Finally, we note that the proposedarchitecture/algorithm is adaptable and can support VBN forscenarios and targets other than ADR and ENVISAT.

The rest of the paper is organized as follows. Section II dis-cusses prior work whereas Section III describes our methodol-ogy and the proposed system architecture. Our pose estimationalgorithm is outlined in Section IV and its HW/SW develop-ment on SoC FPGA is detailed in Section V. Section VI pro-vides evaluation results. The paper concludes in Section VII.

II. RELATED WORK

Pose refers to the six position and orientation parametersdefining the transformation between the coordinate systemattached to a rigid object and a reference coordinate system.Pose tracking exploits the temporal continuity of an object’smotion to estimate its pose parameters over time. Using anobject model to track pose is a widely studied topic [7]. Apopular approach has been to employ an approximate objectpose to predict the appearance of the model and then refinethis pose using matches established between predicted andactually detected features. The most influential such trackeris RAPiD [8], one of the earliest to operate in real-time oncommon hardware. As a result, several of its attributes havebeen retained, e.g. in [9], [10], [11], [12], [13], [14].

An important aspect in which techniques for pose trackingin space differ, concerns whether the tracked target is cooper-ative or not. A cooperative/prepared target spacecraft carrieseasily detectable special markers (e.g., light emitting diodesor retroreflectors), arranged in a special pattern for facilitatingpose estimation by the chaser spacecraft [15]. Methods suchas [16], [17] that exploit structural features already present onsome satellites are also classified as cooperative. Cooperativetracking is more accurate and robust but not universally appli-cable and might constrain the chaser to approach from anglesensuring the visibility of the employed pattern or feature.

When uncooperative pose determination is pursued, an-other important choice concerns the type of natural features

employed for tracking. Sparse interest points extracted fromlocal patches are very popular in terrestrial or planetaryapplications [18]. However, they are fairly sensitive to ambientlight changes and lack of strong texture, which are particularlycommon occurrences in orbit environments. In contrast, im-age edges are defined by sharp intensity variations and aremoderately robust against noise and illumination or viewpointchanges. They can also be accurately localized and are in-ferable even from weakly textured objects. Hence, edges areroutinely preferred as the primary type of feature for trackingin space, e.g. [4], [19], [12], [13], [20]. A detailed overview ofthe state-of-the-art in spacecraft pose determination is in [21]and techniques for initial pose estimation are compared in [22].

Uncooperative spacecraft pose estimation errors reportedin the literature differ considerably in the procedures, testdata and metrics employed for evaluation. Yet, we brieflyinclude here some figures to indicate the currently achievableperformance. Positional errors under 2% of the camera-targetrange and rotational errors up to 3◦ are achieved in [4]. TheGNFIR algorithm [19] yields 0.2-0.4m RMS position error fora target ranging 3-30m and attitude errors 2-4◦ in roll/pitch,and 0.5◦ in yaw. For a target ranging between 20-70m, [12]advertises around 0.5m error in position and 1◦ in attitude,whereas for a target range between 20-76m, [13] reports lateral(i.e., X,Y ) position errors less than 1% of range, longitudinalerrors around 4-5% and attitude errors between 1-5◦.

Concerning the HW platforms suitable for our task, themarket offers a number of space-qualified components that wesurveyed in [6]. However, the vast majority of flight heritageHW bases on radiation-hardened CPUs, such as PowerPCand LEON, which are 1–3 orders of magnitude slower thanterrestrial CPU/GPUs. Hence, when targeting more than 10FPS, complex vision algorithms for tracking the pose ofsatellites [12], [13], or generic objects [23], are usually im-plemented on GPUs. On high-end desktop CPUs, processing1 Mpixel images for pose estimation can be as slow as 5FPS [20]. Since desktop CPUs/GPUs are currently not a viablesolution for space avionics, engineers consider DSP and FPGAplatforms as accelerators [6]. Even COTS devices are beingexamined, especially FPGAs amenable to mitigation of theradiation effects via sophisticated design/programming. NASAsupports such research, with certain teams proposing and pro-totyping platforms in Zynq7000 [24], [25]. The delivered HWmeets challenging constraints of spacecraft design, e.g., caneven fit in CubeSat of form factor 1U (=10×10×11.35cm3), orconsume electrical power in the area of 2–5 Watts (comparableor even better than space-grade CPUs). However, these worksfocus more on the HW aspects and less on algorithms. Visionalgorithms have been implemented on FPGAs achieving 1–20FPS [18], [26], but solve different problems (i.e., visual odom-etry for rover/UAV) rather than accelerating pose estimation.Note that, on rad-hard CPUs, even with the latest devices, timemeasurements in [6] for the algorithm of [18] show less than1 FPS even for 0.2 Mpixel images. For the given problemand scenario, to the best of our knowledge, the current paperpresents the first complete VBN solution on a compact SoCFPGA with such high image resolutions and frame rates.

Page 3: IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO ...users.ics.forth.gr/~lourakis/publ/2019_tcsvt.pdf · tations that exploit the capabilities of the SoC and improve the utilization

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, ACCEPTED 3

III. METHODOLOGY AND SYSTEM DESCRIPTION

The aforementioned publications and our own research in-dicate that conventional space-grade CPUs are not sufficientlypowerful for the high-performance VBN processing requiredin future uncooperative/unprepared space proximity scenarios.Instead, alongside the algorithmic design, we must considernew platforms to facilitate one order of magnitude fasterexecution, e.g., via optimized HW accelerators. In this work, totackle all aspects of the problem, we relied on the methodologysummarized in the following interdependent steps:

1) Selection of Computational Platforma) high-level system architecture definitionb) benchmarking and comparison

2) Design of Computer Vision Algorithma) dataset and requirements definitionb) algorithmic development on generic CPU

3) Development on Target HWa) profiling and low-level architecture designb) coding and optimizationc) testing and tuning

The first step considers the plethora of factors related tothe system-level design of a spacecraft’s electronics: radiation-hardening, power consumption, mass/size, connectivity, de-pendability, market availability, re-programmability, and fore-most, processing power. These factors are considered againstthe specifics of each space mission, prioritizing certain criteriadepending on its nature/goals. For instance, for a short flightin low Earth orbit, radiation-hardening becomes less importantthan the real-time image processing needed for autonomousrendezvous operations. In this case, various possibilities openup for the use of COTS devices. Thus, the first step of ourmethodology is further divided in two parts: the high-levelarchitectural definition of the platform and the benchmarkinganalysis for selecting the most suitable components accordingto the given criteria.

Regarding the high-level architecture, we examined multipletopologies involving single- or multi-core CPUs, intercon-nected with peripherals, GPU, FPGA, or DSP co-processors,altogether forming heterogeneous platforms consisting of oneor more distinct chips. We performed a wide survey of theavailable devices in the market, both commercial and space-grade, and we reported the results in [6]. Overall, favoringminiaturization and fast data transfers among the computenodes, we opt for single System-on-Chip solutions.

Going one step further, we performed a comparative studywith extensive benchmarking to select the best type of co-processor for VBN [6]. Based on the preparatory work forstep 2 of our methodology, we identified a variety of repre-sentative image processing kernels, which were implementedas common benchmarks on a number of diverse platforms,i.e., rad-hard CPU, embedded CPU, desktop CPU, mobileGPU, high-end DSP, FPGA, and desktop GPU. The executionresults were used to compare, primarily, the throughput andpower consumption of these platforms. The analysis graduallyfocused on 28nm technology nodes and is reported in detailin [6]. Overall, we concluded that assuming a budget of 10

Fig. 1. Proposed system on top of Zynq7000 SoC FPGA for VBN support.

Watts would allow for at least one order of magnitude fasterexecution than conventional space-grade CPUs by utilizingmany-core DSPs, mobile GPUs, or FPGA devices. This gainsuffices to meet the requirements of future VBN for increasedimage resolution and high frame-rate, e.g., 1 Mpixel at 10 FPS.Among the three device categories, FPGAs simultaneouslyprovide the highest performance-per-Watt and the highestthroughput (with the same device). More specifically, forhigh-definition images, HW/SW co-processing on Zynq7045outperforms the Myriad2 DSP by ∼10x in speed, or the66AK2H14 DSP by ∼10x in performance-per-Watt, or ahigh-end mobile GPU by at least 2x in speed and 2x inpower consumption. By also considering its improved boardconnectivity and radiation mitigation amenability, we selectedthe Zynq7000 FPGA SoC for developing our VBN system [6].

We complete here the first step of our methodology byproposing the high-level architecture depicted in Fig. 1. Ouraim is to efficiently match the various HW componentsalready embedded in the SoC to the functionality requiredby VBN and/or the On Board Computer (OBC), while alsoexploiting the structure of Zynq7000 to facilitate techniquesfor increasing the system’s dependability. More specifically,following a master-slave approach, we propose isolating the 2CPU cores and assigning almost distinct tasks and connectionsto each, i.e. the master core will execute generic OBC tasksto offload and communicate with the main processor (oreven play the role of OBC in low-cost spacecraft), while theslave core will be dedicated to computer vision and imageprocessing. During the GNC procedure, CPU0 will invokeCPU1 for executing the heavy task of pose estimation andwill receive only a minimal set of results, e.g., those requestedby the control algorithm. Furthermore, to offload the localprocessor (PS), CPU1 will be the only core linked via AXI4to Zynq’s PL (Fig. 1), which will include VHDL acceleratorsfor the most intensive kernels of pose estimation. Additionally,Zynq’s PL will include a central VHDL arbiter for handlingdata transfers and scheduling the accelerators. The PS-PLcommunication will consist of two individual links, one for

Page 4: IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO ...users.ics.forth.gr/~lourakis/publ/2019_tcsvt.pdf · tations that exploit the capabilities of the SoC and improve the utilization

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, ACCEPTED 4

simple control commands (AXI4-lite) and one for transferringapplication data over DMA (AXI4-stream). Regarding the off-chip connections, both cores will be able to access the externalRAM, however, only the master will access the peripherals forcommunicating with other external components (e.g., networkport, storage device). Such isolation can be imposed via theXilinx tools at the initial system configuration, as follows:enable only a limited number of peripherals and route themexplicitly to CPU0 through the 256KB-on-chip memory (notto be accessed by CPU1), allow the shared L2 to link with theRAM, disable the PS-PL interrupts from CPU0, enable theexception interrupts between CPU0 and CPU1. Finally, thecamera will be connected directly to the PL pins to facilitatelow-latency transfer and fast pre-processing of substantialamounts of pixel data without engaging the local CPU (PS).On top of this high-level architecture, depending on missionrequirements, the developer can employ modular redundancyon the PL side, embedded watchdogs in Zynq, read-backchecks, or any other reliability technique [24].

The second and third steps of our approach are explainedseparately in the next two sections, together with details of themethodology and actual development involved in each part.

IV. ALGORITHM DESIGN

The pose tracking algorithm developed in this work adoptsa model-based approach to continuously update an estimate ofan object’s pose relative to the camera. It is inspired by theRAPiD tracker [8], which is discussed in more detail next.For each frame, RAPiD requires an object pose predictionand assumes that its difference from the true pose is small.Therefore, matches can be efficiently established using 1Dlocal searches for image edges along directions perpendicularto predicted edges. To keep computational overhead low,matching is limited to a set of predetermined control points.Control points are sparsely selected on the tracked 3D objectso that they are likely to project on high-contrast image edges.The predicted projections of control points are matched to edgepixels (edgels) and their displacements in directions perpen-dicular to edges are calculated. Each such displacement yieldsa linear constraint on the change in object pose parameters.

The standard RAPiD algorithm assumes that control pointsare manually sampled offline along the edges of a crude objectmodel and in areas of rapid albedo change. Furthermore,it requires the visibility of control points to be managedexternally and makes no provisions for robustness againstmismatches and occlusions. To remedy these shortcomings,several improvements have been proposed, e.g. [9], [10], [11].Nevertheless, all of them share the common drawback ofrequiring that the tracked object is modeled with a simplewireframe consisting of a small number of straight edges,whose projections are sampled to define control points. Sucha choice ensures computational efficiency but imposes strongconstraints on permissible object models. This is becausedetailed CAD object models that might be available need to bemanually redesigned and only their most salient edges retainedto be suitable for tracking.

Here, as shown in Fig. 2, we propose to dynamically selectcontrol points by combining information from a depth map

Camera edge detection

3D mesh model

edge detection

model rendering

edge matching

pose refinement

6 DoF pose

intensity edges

depth 2D map

depth edges

control points (5K vectors)

Z-1

1024x1024 grayscale

35K triangles 20K vertices

1024x1024

Fig. 2. Proposed pose tracking algorithm: main functions & example dataflow

rendered from the object model and the edges of an inputimage (cf. Sec. IV-A-IV-C). Irrespectively of the model’scomplexity, rendering automatically determines visibility andfacilitates the generation of control points. This approach pro-vides increased applicability and flexibility, as no constraintsare imposed on the type of the employed 3D model andany tedious preprocessing for the control points is eliminated.We also ensure resilience to erroneous control points via acombination of robust regression techniques (cf. Sec. IV-D).Our tracking algorithm assumes the initial object pose to beavailable and focuses on evolving it from frame to frame.The algorithm was entirely developed and refined in C. Fig. 3illustrates our target mesh model and visualizes sample output.The following subsections present the algorithm’s key aspects,whilst a detailed description can be found in [20].

A. Edge detection

The Canny edge detector [27] is widely used for detectingedges in images. It combines the following sequence of steps:a) smoothing to reduce noise, b) differentiation to computeimage gradients, c) non maximum suppression to removespurious edges by retaining only maxima of the gradientmagnitude in the gradient direction and d) hysteresis thresh-olding to track edges with a pair of thresholds on gradientmagnitudes. Hysteresis identifies strong edges using a highthreshold on gradient magnitude, suppresses all weak edgesbelow a low threshold and retains the intermediate edges thatare connected to strong edges. For customization to our HWimplementation, we chose the 3x3 Sobel kernels for smoothingand differentiation and set the hysteresis thresholds to 1.8µ and1µ, where µ is the median of gradient magnitudes, computedfrom a corresponding histogram with 4K, 16-bit bins. Canny’salgorithm is employed in our tracker to detect both intensityand depth edges, and hence reuse its resources in HW.

B. Depth map rendering

Supplied with a triangle mesh model and a relative pose,depth rendering produces an image whose pixel values aredistances rather than intensities. Every pixel in the rendereddepth image encodes the distance to the nearest point on themodel’s surface that projects on it. To save on-chip memoryin our HW implementation, we represent depth images with16 bit pixels, using an average 32-bit depth offset for theentire image and a 16 bit deviation for each depth pixel. Depth

Page 5: IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO ...users.ics.forth.gr/~lourakis/publ/2019_tcsvt.pdf · tations that exploit the capabilities of the SoC and improve the utilization

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, ACCEPTED 5

images are produced with rasterization rendering, a techniquethat determines which parts of a model are visible to thecamera, excluding those outside the field of view or being self-occluded. Rasterization projects mesh triangles onto the imageby projecting their vertices, using bounding box traversalto determine the image pixels that are inside the projectedtriangle and computing the distance of the 3D triangle fromeach such pixel. Multiple triangles projecting on the samepixel are dealt with Z-buffering which retains the projection ofthe triangle closest to the camera. Rasterization involves justgeometry calculations (i.e., no texture mapping/shading) anduses [28] to calculate ray–triangle intersections. Automatic,off-line mesh decimation is applied to avoid very high trianglecounts. Thus, for customization in HW, the triangles in ourinitial ENVISAT model were reduced from 120K to 35Kwithout perceivable change in appearance and pose accuracy.

C. Edge matching

To obtain measurements from imagery, the object’s meshmodel is first rendered with the latest pose estimate to producea depth map. Depth edges are then detected in this depth mapand matched with the intensity edges found in an input image.Owing to the well-known aperture problem which states thatthe component of edge motion tangent to the edge itself is notobservable through a finite-size aperture, the matching locationof a depth edgel cannot be fully determined. Instead, only itsperpendicular displacement from the corresponding depth edgeis measurable. As this matching is local, it can cope with partsof the object being undetected or out of view.

The search for matching intensity edges is one-dimensionaland confined in the vicinity of the detected depth edges, alongtheir gradient (i.e., perpendicular to edges). It proceeds untileither an intensity edgel with the same orientation is found ora maximum distance has been traced. For customization, tokeep complexity low, the search is performed up to a distanceof 5 pixels in the horizontal, vertical, or diagonal directionthat is closest to the actual depth gradient direction. Withoutcompromising the accuracy of pose estimation, this 8-waysearch strategy decreases the complexity of edge matching,e.g., up to 4x compared to searching along the gradientvector, and simplifies implementation on HW. Matched edgelsdefine the control points whose partial, normal displacementsbetween predicted and actual edge locations are employed inSec. IV-D to refine object pose. The number of generatedcontrol points ranges between 500 to 5000, which representsan increase by one to two orders of magnitude compared tothe typical number of control points in the original RAPiD [8].

D. Pose refinement

Let ξ = (ωx, ωy, ωz,∆tx,∆ty,∆tz)T be a small rotation

and translation motion applied to the tracked object. As shownin [20], the image projections m and m′ of a point P on theobject at two successive time instants are linearly related withm′ = m + Aξ, where A is a 2 × 6 matrix whose elementsare associated with the coordinates of P. Combined with theperpendicular distance d between points m and m′ along theunit gradient vector n, this equation yields ntAξ = d. Thus,

Fig. 3. Mesh model used for tracking (left) and depth edges detected afterrendering with the estimated pose, overlaid in red on an input frame (right).

every control point provides one constraint on the infinitesimalincremental motion ξ, hence six such points suffice to estimateit. We accommodate more than six constraints in a leastsquares manner by minimizing

ξ = arg minξ

∑i

(nitAiξ − di)2. (1)

Solving Eq. (1) with ordinary least squares is not recom-mended, as the latter is sensitive to outlying residuals causedby various types of error. To ensure that pose refinementis immune to outliers, estimation is carried out in a robustregression framework that allows invalid measurements to beidentified and discarded, preventing them from corrupting thepose estimate. We achieve robustification by substituting thesummation operation in Eq. (1) with the median:

ξ = arg minξ

medi(nitAiξ − di)2. (2)

This results in the Least Median of Squares (LMedS) estima-tor [29], which can tolerate up to 50% erroneous constraints.Unlike least squares, however, LMedS has no analytical solu-tion. To overcome this, random sets of at least six constraintsare repetitively sampled. Solving Eq. (1) for each such sampleyields an estimate of ξ which is used to compute the median ofthe squared residuals. The estimate giving rise to the minimummedian is retained as the minimizer of Eq. (2). To improve theprecision of the LMedS estimate, the least squares estimateobtained from its corresponding inliers is computed. If fewerthan 50% outliers are expected, LMedS can be generalized tothe Least Quantile of Squares (LQS) estimator by employinga quantile other than the median; we used the 70th percentile.

For an extra level or robustness, we employ an M-estimator.M-estimators mitigate the effect of any remaining outliers byreplacing the squared residuals with a symmetric function ρ()that has a minimum at zero:

ξ = arg minξ

∑i

ρ(nitAiξ − di). (3)

To down-weight excessively large residuals, function ρ() ischosen to be increasing less steeply than quadratically. Theparticular M-estimator employed in this work is the Lp norm,i.e. ρ(x) = |x|p with p = 1.5. Eq. (3) is solved withthe Iteratively Reweighted Least Squares (IRLS) algorithm,applied to the LMedS inliers of Eq. (2). The estimated ξ is

Page 6: IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO ...users.ics.forth.gr/~lourakis/publ/2019_tcsvt.pdf · tations that exploit the capabilities of the SoC and improve the utilization

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, ACCEPTED 6

integrated with a linear Kalman filter which better predicts thepose estimate that drives rendering.

To measure pose error, we employed a single figure metricamounting to the average misalignment between the model’svertices at the true and estimated pose. More specifically, forthe true {Rg, tg} and estimated {Re, te} pose, the alignmenterror is defined as

E =1

N

∑N

i=1‖(Rgxi + tg)− (Rexi + te)‖, (4)

where xi denote the N vertices of the mesh model. Apartfrom evaluating performance, we also employed this metric toassess the effect of our customizations in SW and HW, payingattention so that they did not impact the baseline implemen-tation error by more than 0.01% on average (wrt ENVISATdistance). We also report separate errors for the translation androtation components. Specifically, position error is the normof the difference between the translational components of theestimated and true pose; its ratio to the camera-target distanceis the relative position error. Rotational error is the angle ofthe rotation that aligns the estimated with the true rotation.

V. DEVELOPMENT ON SOC FPGA

To meet the high-performance requirements of autonomousGNC, the set of algorithms designed in the previous sectionmust be accelerated while retaining the accuracy of poseestimation and adhering to the constraints of the underlyingHW. We tackle this multi-faceted challenge with our customHW/SW co-design methodology, which elaborates on the thirdstep of the approach outlined in Section III as follows:

i) porting of SW on native CPU (Zynq’s ARM)ii) SW profiling and complexity analysis

iii) SIMD coding for native CPU (NEON)iv) HW/SW partitioning and schedulingv) parallel HW architecture design

vi) parametric VHDL developmentvii) HW/SW integration and communication (AXI4)

viii) testing and tuning (return to step iv until specs are met)

The main purpose of the first two steps is to evaluatethe capabilities of the embedded CPU in comparison tothe computational demands of the specific algorithm. Weperform a platform-dependent analysis to facilitate step iv,where we additionally consider the application requirementsto identify the kernels that must be accelerated on FPGA.Before HW/SW partitioning, we also explore the possibility ofutilizing ARM’s NEON engine for executing few lightweightfunctions instead of porting them on PL, i.e., we acceleratepart of the SW code via SIMD instructions to exploit the fullcapabilities of the PS. After partitioning and scheduling ofoperations/communications, we design efficient parallel archi-tectures for each HW component and use parametric VHDLto develop its circuits. Parameterization facilitates step viii,where we fine-tune the system to adapt specifically to the givenproblem, e.g., we refine datapath widths and/or parallelizationfactors to balance all speed−cost−accuracy trade-offs. Noticethat we tune/test the fully integrated system and, moreover,we use custom datasets (described in Section VI) to emulate

the given application scenario. In case of failure, i.e., violationof time or resources or accuracy constraints, our methodologytracks back to step iv for further adaptation and optimization.

A. Profiling, SIMD SW development and HW/SW Partitioning

Aiming at proper partitioning and efficient architecturaldesign, we analyze the complexity of the algorithms withrespect to a) execution time per function, b) number of genericcalculations/operations (e.g., multiplications and divisions),c) type of operations and variables (e.g., float or integer values,dynamic range), d) memory footprint (including access pat-terns), e) communication requirements per function (I/O data),f) amenability to parallelization and/or streaming processing,and g) programming complexity (e.g., library dependencies,increased function calls). Our analysis involves both manualstudy and automated tools, such as the Valgrind profiler andthe recording of CPU timestamps between major algorithmictasks. Given the nature of our target SoC, the above criteriaa+b+f point out strong candidates for FPGA implementation,criteria d+g point out strong candidates for CPU, whereascriteria c+e tip the scales when finalizing the partitions. Inother words, functions with increased execution time and par-allelization amenability are considered for FPGA acceleration,whereas functions with increased programming complexityand irregular accesses to large memories are better servedby the flexibility of the CPU model. Functions that do notclearly fall in one of these categories are mapped to partitionsaccording to secondary criteria, e.g., amount of data exchangedwith other functions (to avoid creating PS-PL communicationbottlenecks) or type of operations involved (to avoid excessiveuse of floating-point arithmetic in FPGA).

Table I reports succinctly the main results of our algorithmicanalysis. Being representative of the target scenario, the resultsrefer to the single-threaded C code ported on ARM Cortex-A9 @667MHz (compiled with gcc -O3) and configured withour final parameters and processing sequences of syntheticimages depicting different phases of the ENVISAT approach(at 30m and 50m away from the camera, with 1024x10248-bit pixels, cf. Fig. 3 right). The 5 algorithms were analyzedto their key functions and the execution time is reported inan additive fashion. The I/O varies across rows in the tabledepending on the generation of internal data and the numberof function calls. As shown in Table I, the complexity is data-dependent for the majority of the functions and increases asthe object approaches the camera. This is due to the increasein the object’s apparent size, which begins by covering 6%of the image at 50m and grows to 20% at 30m, i.e., thenumber of edgels grow from 6K to 11K to generate up to5K matches between successive frames. To account for theworst case, we now focus on column 3 (columns 4 and 5also refer to 30m). Clearly, the most intensive algorithms arerendering and edge detection, which both consume almost90% of the total time. This is mostly due to the repetitivechecking of generated pixels inside the projected triangles ofthe model and the calculation of image gradients (includingtheir 8-way quantized orientation). Pose estimation consumesup to 250msec when provided with up to 5K points (columns

Page 7: IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO ...users.ics.forth.gr/~lourakis/publ/2019_tcsvt.pdf · tations that exploit the capabilities of the SoC and improve the utilization

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, ACCEPTED 7

TABLE IPROFILING ON ARM A9 (ZYNQ) FOR 1 MPIXEL IMAGES AT 30m & 50m

algorithm time / frame time / frame I/O RAM other−function (cam@50m) (cam@30m) (Mbit) (MB) issue*

imaging 82 msec 82 msec 16.81

−fetch 56 msec 56 msec 16.8 FX, SP

−enhance 26 msec 26 msec 16.8rendering 343 msec 869 msec 52

7−initialize 0.1 msec 0.1 msec 0.01−projtrian 7.5 msec 10 msec 125 FL+FX,

−gen+chk 206 msec 544 msec 4211 HP, EM

−pixdepth 51 msec 153 msec 659−other 78.4 msec 162 msec ≤ 1

edgedetect 303 msec 326 msec 12.618

−gradient 210 msec 210 msec 41.9 HP+SP,

−suppress 44 msec 44 msec 67.1 FX

−hysterss 16 msec 29 msec 67.1−other 33 msec 43 msec 33.5matching 38 msec 39 msec 17.6 2.2 SP, FX

pose estim 51 msec 99 msec ≤ 1.6 ≤ 1−LQS-fit 16 msec 32 msec ≤ 1.6 PC, EM,

−LSinlier 33 msec 65 msec ≤ 1.6 FL

−other 2 msec 2 msec ≤ 1.6

total**: 1120 msec 1741 msec 8.4 44 −

*abbreviations: FL=floating-point arithmetic, FX=small fixed-point vari-ables, HP=highly parallelizable, SP=streaming processing, PC=high program-ming complexity, EM=expensive math (e.g., divisions, trigonometric)

**total= imaging + rendering + 2×detection + matching + pose estimation

2 and 3 are for 1K and 2K points, respectively). However, fine-tuning shows that 1000−1300 is the optimal amount of controlpoints with respect to output pose accuracy, and hence, thefinal cost of this function can be fixed at around 50−60msec.

SIMD SW developmentOur analysis advanced even further to examine the possibilityof SIMD acceleration via the NEON engine of ARM. Wefocused on imaging and matching, which allow for streamingprocessing of small variables, but their execution time isrelatively small for definitively mapping them on FPGA. Wedeveloped parallel SW code by following different approachesand we achieved an average speedup of up to 3x (vs the-O3 -ftree-vectorize executable, otherwise it appears as 8x).More specifically, we tested individually, a) NEON intrinsics,b) NEON inline assembly and prefetching mechanisms. Theformer approach provides a 2x speedup, whereas the latter pro-vides up to 3x. However, due to the architectural peculiaritiesof NEON, more important than selecting the coding approachproved to be the parallelization technique, as explained next.

The matching function inputs a set of ‘source’ edgels andsearches for each one’s nearest correspondence within a set of‘target’ edgels. We deduced that it is more efficient to processin parallel multiple ‘source’ edgels, rather than the ‘target’edgels of one ‘source’. That is, instead of creating a longinstruction to examine in parallel all 9 candidate ‘targets’ ofa single ‘source’ edgel, we create an instruction that handles16 distinct ‘source’ edgels to examine one distinct ‘target’ foreach. Hence, the entire search involves 9 such instructions per16 ‘source’ edgels. This technique is more efficient because,

first, we utilize all 16 lanes of NEON’s datapath instead of 9,second, we improve the data access patterns, and third, wemanage to incorporate the sequential portion of the algorithm(determine nearest match by comparing distances) in the logicof the already 9 serialized steps (instead of inserting extrainstructions after the parallel examination of 9 candidates). Theinefficient technique results in practically zero accelerationdue to penalties such as ARM↔NEON register transferring,NEON pipeline hazards, or unconditional fetching of all 9 can-didates without early termination. In contrast, our optimizedtechnique decreases the execution time of matching to 13msec.

HW/SW PartitioningConsidering all of the above results on the PS of Zynq,we partition our algorithms by prioritizing the criterion oftime: our constraint is 10 FPS or 100msec per frame. SinceNEON provides acceleration in the area of 3x, rendering andedge detection remain one order of magnitude away from therequirements and must be accelerated by FPGA with a targetspeedup of 50−100x. Such acceleration is feasible due to thenature of both algorithms involving highly-parallel operationsand/or streaming processing on pixel basis. One exception isthe hysteresis function, which includes an inherently sequentialtracing of edges; however, due to the low volume of weakedgels in our images (1−3K) and the possibility to relax theorder of tracing, hysteresis can be pipelined without delayingthe FPGA execution. Moreover, the arithmetic of detection isideal for VHDL implementation as it relies on custom-lengthfixed-point variables and logical operations. In contrast, thearithmetic of rendering is very demanding and our word-lengthexploration showed that it is ineffective to revert from floating-to fixed-point operations in most of its functions (high dynamicrange variables), except for checking pixel coordinates forinclusion in a triangle, where we managed to adapt to 17-bitwords. As a result, rendering is expected to have significantlyincreased FPGA cost.

Further partitioning of these two algorithms is avoided dueto the high I/O rates of their internal functions (Table I).Isolating such a function would require PS-PL communicationof several Gbps to complete on time, e.g., within a 10msecgoal. The only exception is the initialization of rendering,which inputs limited information to calculate the cameramatrix and optical center of a particular frame. This calculationbases on expensive and infrequently employed mathematicaloperations, which when implemented on HW increase thealready huge cost of rendering by 33% (Section VI). Hence,in the second iteration of our methodology (Section V, stepviii), the initialization was irrevocably mapped on CPU. Lastlyfor rendering and detection, we note that their memory re-quirements are challenging the FPGA implementation and aretackled via certain optimization techniques (Section V-B), suchas word-length minimization of data stored on-chip, on-the-flyprocessing with minimal buffering, image splitting in stripesand sliding windows.

Despite the time decrease achieved via SIMD on NEON,matching was also mapped on FPGA in our final partitioning.The highly streaming nature of its process allows us to install itin a pipelined fashion after detection and practically diminish

Page 8: IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO ...users.ics.forth.gr/~lourakis/publ/2019_tcsvt.pdf · tations that exploit the capabilities of the SoC and improve the utilization

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, ACCEPTED 8

Fig. 4. HW architecture at function/data level: partitions and scheduling (top),interconnection/dataflow of SW & HW components at PS & PL side (bottom).

its time cost via limited FPGA resources (we process the dataon-the-fly while anyway being sent to the PS).

Finally, the pose estimation algorithm was mapped on CPUbecause, a) fine-tuning restricts execution to a manageableamount of time, b) it has increased programming complexity.The latter involves irregular paths in the computation graphwith calls to specialized math libraries (BLAS/LAPACK)and expensive arithmetic operations (trigonometry, logarithms,etc). Such algorithms consume an excessive amount of HW re-sources with low HW utilization, i.e., they result in inefficientHW design as in the above case of rendering’s initialization.We note that, in the context of this paper, we also implementimaging on CPU and we pipeline it with the processing of theprevious image (feasible since it requires less than 100msec).

B. HW architecture and scheduling

Based on our analysis, we proceed to system-level HW/SWarchitecture design. Fig. 4, top, summarizes the HW/SWpartitioning and the proposed scheduling of operations. First,each new frame is stored temporarily in the external memoryof the system. Subsequently, we execute rendering on FPGA.To tackle the limited on-chip RAM challenge, we dividethe depth image in 4 disjoint stripes of 1024×256 16-bitpixels, each, and we render the stripes sequentially by reusingthe same FPGA resources. The rendering component utilizesa 1024×256 buffer to store the depths, which are updatedby the algorithm in an unpredictable order preventing theircontinuous streaming to the remainder of the pipeline. Uponcompletion, the stripe is forwarded pixel-by-pixel to the edgedetection component, which processes the depth image on-the-fly and decreases storage requirements from 16 to 5 bitsper pixel. This decrease enables us to store on-chip an entire

1024×1024 edgel map and then trace its edges seamlesslyvia hysteresis. Notice that rendering and edge detection areexecuted pseudo-parallely, in 4 steps, with detection operatingwhile the depths are being sent to the CPU (to extract infor-mation about the control points after perpendicular matching).

The next phase in scheduling concerns edge detection on theintensity image. The 1024×1024 8-bit pixels are sent from PSto PL, where they are forwarded directly to our edge detectioncomponent for on-the-fly processing. We reuse the same HWcomponent, but switch to a distinct buffer to store the secondedge map. Before and after detection, the HW componentexchanges information with the CPU related to hysteresis(thresholds and histogram). Upon completion, the maps ofthe intensity and depth images are forwarded in raster-scanorder, in parallel, to the perpendicular matching component,which finds their matching edgel (x, y) coordinates on-the-fly and sends them to the CPU. Note that, for all components,our architectural approach to perform pipelining and on-the-flyprocessing of the data while being transferred, with minimalbuffering and balanced throughputs, leads to substantial mask-ing of the communication time and improved HW utilizationvia parallelization at task-level.

The last part of scheduling concerns processing on CPU.Each (x, y) match is augmented with information regardingdepth and derivatives (calculated by SW locally at (x, y)) toform a control point input to pose estimation. This augmen-tation on SW alleviates the need to store excessive amountsof depths and derivatives on HW until after matching, at theexpense of a negligible time cost (‘auxiliary SW’, Table II). Inaddition to pose estimation, CPU handles the Canny thresholdsand the rendering initialization for the next image.

The architecture is depicted in Fig. 4, bottom. Thebuses/connections among the 3 HW accelerators facilitatethe above schedule. The most notable on-chip memories areindicated by the 1 plus 3 rounded boxes shown in Fig. 4. Thelatter 3 have similar size and store the two edge maps and thedepth image stripe, as explained above. The former 1 stores themodel’s vertices. In particular, the model’s triangles are storedat the PS and are streamed to the HW renderer upon every call,whereas the model’s vertices are stored locally in ROM tosupport their random access pattern. The PS-PL communica-tion is realized via the SoC’s AXI interconnect. Overall, ourHW-SW integration involves two clock domains on FPGA,one at 100MHz for PS-PL communication and another at200MHz for the accelerators, a custom HW arbiter handlingand interfacing the diverse ports of our HW components, andSW drivers for AXI. The following subsections describe thedesign and VHDL development of each HW component.

1) PS-PL Communication: To transfer data between theCPU and VHDL accelerators, we utilize the AXI4-lite andAXI4-stream of Zynq. In particular, we use AXI4-lite tocontrol/synchronize our HW and SW components via smallcustom packets. Each packet holds a single 32-bit word andis sent through the hierarchy of AMBA interconnections ofPS. The packets transfer reset & sync commands and Canny’sthreshold values. We prefer AXI4-lite for these small transfers,because it offers direct access to PL with limited cycles latency

Page 9: IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO ...users.ics.forth.gr/~lourakis/publ/2019_tcsvt.pdf · tations that exploit the capabilities of the SoC and improve the utilization

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, ACCEPTED 9

Fig. 5. Proposed HW architecture of Canny, with deep pipelining, on-the-fly processing, limited use of buffers, and recursive edge tracing with a small stack.

and an overhead of one handshake per word.For the most important part of communication we use the

AXI4-stream. Targeting high bandwidth, this link relies onDirect Memory Access (DMA) and direct connection to thememory controller of PS. Each stream consists of multipleconsecutive 32-bit words sent at 100 MHz. The overhead ofAXI4-stream is the additional configuration of DMA at thestart of each transfer. Hence, we use AXI4-stream only forhigh volume data to avoid handshakes between words. Ateach iteration (Fig. 4), we transfer successively, a) 38 bytesto initialize rendering, b) 444 Kbytes for the satellite’s model,c) the 2-Mbyte depth image, d) the 1-Mbyte intensity image,e) a histogram of 8 Kbytes, and f) matches of ∼10 Kbytes.

On top of the SoC parts embedded in Zynq, to facilitate theabove PS-PL links, we install VHDL and SW drivers [30]. Forcrossing clock domains on PL and assembling the 32-bit wordsof AXI4, we employ 4 dual-clock FIFOs and a custom VHDLarbiter. We developed the arbiter as a Finite State Machine(FSM) to also incorporate the aforementioned schedule a–fand execute the algorithmic functions in sync with the CPU.Overall, we measured the communication at ∼3 Gbps. Suchbandwidth implies 5.6 msec to transfer a 16-bit 1024×1024depth image and is in balance with the processing throughputof our VHDL edge detection, i.e., our schedule is properlydesigned towards efficient masking/parallelization at task level.

2) Perpendicular Matching: The VHDL component is cus-tom designed for pipelined processing of the data while beingtransferred to PS. It inputs the two maps in raster-scan order,1+1 edgel per cycle, and outputs a 32-bit packet designatingany occasionally detected match (ultimately, only the matchesare sent to the CPU). Internally, it utilizes a serial-to-parallel(S2P) structure to extract the 9x9 region of each ‘target’ edgelentering the pipeline. The S2P consists of 9 FIFO RAMB18sand 9x9 registers connected linearly in a deep pipeline, asshown in the leftmost box of Fig. 5 for the case of Canny. The9x9 registers output a window of 81 edgels which, essentially,slides over the image map by one location per cycle. Eachwindow is forwarded in parallel to a big multiplexer controlledby the orientation of the corresponding ‘source’ edgel enteringthe component. Notice that, for synchronization purposes, the4-bit ‘source’ edgels enter in one RAMB36 FIFO properlyconfigured by Xilinx tools for depth=5095. The multiplexeroutputs 9 candidate edgels, which are compared in parallelto the ‘source’ edgel. The comparators are connected viaa multiplexing structure giving precedence to the middlecandidates. In case of equal orientation at distance ∈ [−4, 4],

the component outputs 〈x, y, distance〉, where x, y are beingtracked via counters synchronized to the input rate. Therefore,with minimal on-chip memory and negligible logic resources(see Table II of the evaluation section), the proposed unitcompletes perpendicular matching, on-the-fly, with exactly thesame accuracy as the CPU but eliminating the SW time cost.

C. VHDL development: Edge Detection

To detect edges in both the intensity and depth images, wedesign on HW the Canny algorithm customized in Section IV.We implement on FPGA one module configured for 16-bit1024×1024 images. The module inputs either 16-bit depthvalues, or 8-bit pixels from the camera with zero-bit-padding.Processing is performed in an almost streaming fashion witha window sliding downwards the image and alleviating theneed for more sophisticated data handling. In both cases, itoutputs a 1024x1024 4-bit map, where each pixel denotes theexistence of an edgel and its 8-way quantized orientation.

From a HW design perspective, the algorithm is dividedin three parts: a) the fully streaming calculation of gradientsand non-maximal suppression, b) the statistics on gradients forthreshold calculation, which implies a second iteration/pass ofthe image, i.e., local storage that would cancel the benefits ofthe above streaming operation, and c) the hysteresis requiringa recursive computation on the initial edgel map. Each part istackled individually as explained next.

First, during the image transfer to PL, the gradients arecomputed on-the-fly inside a deep pipeline with a throughputof one pixel/gradient per cycle. As shown in the leftmost partof Fig. 5, the pixels enter in burst mode, in raster-scan order,and are forwarded to two distinct components for calculatingin parallel the Ix and Iy partial image derivatives. Basedon Sobel’s 3×3 operator, these two components implementthe same 2D convolution, but with their kernel rotated by90◦. To facilitate streaming processing and avoid a secondpass of the image, we disregard the separability of the filter;instead, we employ a serial-to-parallel buffer to output one3×3 pixel window per cycle (the S2P buffer is shared be-tween Ix and Iy , Fig. 5, left). With each new pixel enteringCanny, the 3×3 window slides in raster-scan order and allowsfull parallelization of the 3×3 kernel multiplication. The 2Dmultiplication is implemented via hardwired, parallel, bit-shiftoperations and the results are accumulated by a 4-stage addertree to generate one 〈Ix, Iy〉 vector per cycle. This 32-bitvector is forwarded to the next component, which computesthe 16-bit magnitude and 3-bit orientation of each gradient.

Page 10: IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO ...users.ics.forth.gr/~lourakis/publ/2019_tcsvt.pdf · tations that exploit the capabilities of the SoC and improve the utilization

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, ACCEPTED 10

More specifically, Gm=√I2x + I2y is calculated by a pair of

multipliers and a square-root unit, altogether forming a 24-stage pipeline with continuous flow. In parallel, Go∈ [0, 7] iscalculated by comparing the signed tangent of the input vectorto the tangent of 3π/8 and π/8. For optimization purposes,we replace the Iy/Ix divider with constant multipliers (forIx ·tan 3π/8 and Ix ·tan π/8) integrated in an 8-stage pipelinetogether with 6 comparators and extra registers synchronizingGo to Gm. Subsequently, each Gm value is conditionallysuppressed by examining its 2 neighbors in the direction ofGo. In particular, a second serial-to-parallel buffer extracts the3×3 region of each gradient (1 per cycle). All 9 values areforwarded in parallel to a multiplexer controlled by Go, whichin turn feeds 3 comparators determining whether the currentGm is a local maximum. Afterwards, in the 〈Gm, Go〉′ stream(Fig. 5), we compare the unsuppressed Gm to two thresholdsfor distinguishing between ‘weak’ and ‘strong’ edgels. Thethresholding operation also decreases storage requirements to5 bits per pixel. Streaming terminates at the RAM storing theedge map, which is initially filled in 1024×1024 cycles.

The second part of Canny concerns the once-per-imageupdating of the aforementioned hysteresis thresholds. Thecalculation uses a histogram of gradients of the current image.To avoid storing all gradients on-chip and to decrease Canny’sexecution time in half, we exploit the temporal redundancy ofsuccessive frames: we calculate on-the-fly the threshold pairfor the current image, but use it for the next image. Due toframe affinity, our tests show that successive threshold valuesdiffer by ∼10% and that our modification affects only 3% ofeach edgel set, whereas the overall result of pose estimationremains practically the same. As shown in Fig. 5, a histogramcomponent monitors the gradients while flowing in the mainpipeline. By using a read-write loop over a small memory (2RAMBs), including registers and adders, we group the 16-bit values in 4K bins and also calculate their median. Resultsare sent to the CPU, which refines and handles the thresholdsthroughout the sequence with practically zero cost.

The third part of Canny concerns edge tracing on the initialmap (after thresholding, Fig. 5). Our analysis of the ENVISATimages shows a manageable amount of initial ‘strong’ edgels,e.g., 10K, with limited fetches of adjacent edgels, e.g., 100K.Therefore, to avoid scanning the entire map, we prefer storingall initial strong edgels in a local LIFO memory and then,recursively, fetching their neighbors to update the connectedweak edgels. The former operation is performed on-the-fly,during the storage of the initial map, by monitoring the mainflow to the map’s RAM. The latter is performed after the initialmap is filled, as follows. An FSM continuously pops edgelsfrom the local stack (1 per 8 cycles), fetches sequentially their8 neighbors from the map’s RAM, and checks their strengthwithout stalling the pop procedure. When a ‘weak’ edgel isfetched, it is pushed to the LIFO and it is marked as ‘strong’in the map’s RAM. Notice that the push-pop order in the localstack is not strict (all adjacent edgels are eventually fetched)and allows continuous flow in a fully utilized pipeline. Edgetracing completes 10−20× faster than the initial map creation.

Fig. 6. Proposed HW architecture of rendering, with 4x parallelization at pixellevel, floating-point & deeply-pipelined components, and parallel memories.

D. VHDL development: Rendering

To generate depth images from a 3D model with the lastestimated pose, we design via parametric VHDL the renderercustomized in Section IV-B. For sufficient acceleration and inaccordance to our analysis, we design our engine to projectup to 1 triangle per cycle and render up to 4 pixels per cycle.To support such throughput, we rely on memory interleaving,deep pipelining at vertex/pixel level, and module paralleliza-tion. Moreover, due to the dynamic change of workload permodel triangle, certain parts of the pipeline are designed tosupport arbitrary stalling. The proposed architecture (Fig. 6)consists of five parts: a) the I/O controller, b) the perspectiveprojection of a triangle, c) generation of pixels inside the boxbounding the triangle, d) depth computation of each pixel inthe triangle, and e) updating of the depth image.

The I/O controller receives from CPU and stores all modelvertices at the beginning of a sequence (the model canchange during the chaser’s approach). Each 〈X,Y, Z〉 vertexis stored in three distinct dual-port RAM memories allowingthe parallel fetch and fast projection of a triangle. With eachnew image, the controller receives the initialization values(intrinsic calibration matrix and pose R, t). With each newstripe, the controller receives all model triangles in a stream, upto one triangle per cycle, and forwards them in the processingpipeline. Note that, a) it is impractical to distinguish and sendonly a subset of triangles per stripe, and b) the overheadof handling all triangles with each new stripe is negligible,because the majority of the complexity lies in the processingof pixels and triangles projecting outside the current stripe arefiltered out early in our pipeline. The input stream is pauseddepending on each triangle’s size (pixels to be rendered)

Page 11: IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO ...users.ics.forth.gr/~lourakis/publ/2019_tcsvt.pdf · tations that exploit the capabilities of the SoC and improve the utilization

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, ACCEPTED 11

to allow time for pixel generation to complete; stalling isachieved by using input buffers and carefully interconnectedclock-enable signals throughout the pipeline.

The perspective projection of each triangle (Fig. 6) utilizessingle-precision floating-point units, i.e., 11 multipliers and9 adders and 1 reciprocal unit, which are all integratedin a 67-stage pipeline for fine-grained register-transfer leveloperations. We employ three such components to project inparallel all 3 vertices of a triangle. Hence, the component canproject up to one triangle per cycle. In addition, we developa fourth pipeline to project the geometric center of a trianglewhen it is too small to be reliably intersected by the castedray. In practice, when not stalled, due to the input rate fromAXI4 and the access time of the model’s RAM, the processingrate increases up to 1 triangle projection per 2 cycles.

The pixel generation component (Fig. 6) inputs the pro-jected triangle and outputs all pixels residing inside its bound-ing box. It is designed with 17-bit fixed-point arithmetic. First,with a small pipeline of comparators, it computes the pro-jected triangle’s minimum and maximum image coordinatesto examine whether its bounding box resides partially in thecurrent image stripe. Second, it generates 4 neighboring pixelsin parallel, either in a 4×1 or in 2×2 vector of integer 〈x, y〉pairs. We dynamically select between the row or square formatdepending on the size of the bounding box: if width>heightand pixels>4, then we use the 4×1 rows, otherwise, we use2×2 squares. Accordingly, by initializing two 11-bit counters,multiple quadruplets are generated continuously, one per cycle,until the entire box is scanned (the input of a new trianglepauses here, if necessary). Indicatively, for ENVISAT at 30m,a bounding box can include up to 22K pixels, although halfof the bounding boxes include up to 40 pixels.

The four generated pixels are forwarded to four modulescomputing their distance from the camera, in parallel (Fig.6). Each module utilizes 6 fixed point multipliers, 14 fixedpoint subtractors, 2+1 fixed-float converters, 42 floating pointmultipliers, 38 floating point adders/subtractors, 2 floatingpoint dividers, 1 floating point square root unit, and 1 floatingpoint reciprocal unit. All these units are integrated in a 307-stage pipeline. Functionally, it involves a check on a pixel’svalidity (whether it is within the projected triangle) and a seriesof algebraic operations on vectors followed by an Euclideandistance calculation to estimate its depth [28]. Each of thefour components outputs the depth by converting to a 16-bitfixed-point value, and hence, decreasing storage requirementsin half. In addition to the four calculators (Fig. 6), we employa fifth component to handle small triangles that cannot beintersected by the casted ray. Similarly, this component utilizes23 floating-point units integrated in a 124-stage pipeline.

The final module updates the depth image (Z-buffer) via 16-bit fixed-point arithmetic and a parallel memory organizationthat allows conflict-free access to any pixel quadruplet. Thedepth memory utilizes 4 banks {A,B,C,D} interleaved inthe x direction and shifted linearly by 2 in the y directionof the image (i.e., ABCDABCD . . . for the first image row,CDABCDAB . . . for the second row, etc.), so that any 4neighboring pixels are mapped to distinct banks. Pixel locatedat (x, y) is stored in bank(x, y) = ((y mod 2)·2 + x) mod 4

Fig. 7. Pose estimation (alignment error) of our HW/SW system for threeENVISAT distances. The HW/SW accuracy is practically equal to that of SW(cf. 3 similar versions plotted with thin lines). In all 3 cases, error < 40cm.

in the address(x, y) = y·256+x div 4. At each cycle, we com-pare the 4 stored values to the 4 newly calculated distances,and we update the RAM contents with the smaller values.The update component is pipelined to operate seamlessly evenfor successive reads/writes to the same RAM address. Uponcompletion of the algorithm, the stripe is output in raster-scanorder and the I/O controller clears the 4-bank memory.

VI. SYSTEM EVALUATION

We implemented the entire vision-based navigation systemon the Xilinx Zynq 7Z100 FPGA hosted on a Mini ModulePlus board. We employ a pair of synthetic sequences whichrealistically simulate the motion of ENVISAT and the solarillumination. They have a length of 1000 frames each, aresolution of 1024x1024 8-bit pixels, 40◦ FoV and knownground truth pose. In the first sequence, ENVISAT rotatesout of plane and remains approximately at about 50 metersfrom the camera. In the second, the camera is initially at30m away from ENVISAT and gradually approaches to 20m.We note that, since the camera specifics/interface are notthe primary concern of this work, we assume pre-storedimages being fetched and pipelined by CPU0 in less than82msec each (Section V-A). We also note that, since thiswork concerns primarily the high-performance potential ofthe proposed system, which is examined prior to verifyingits mission-dependent reliability, we omit demonstrating hereany additional radiation mitigation technique for Zynq [24].

Our evaluation begins with the accuracy of the HW/SWimplementation. As expected, due to HW optimizations suchas floating- to fixed-point arithmetic modifications and word-length minimization, the HW/SW output deviates slightlyfrom the initial all-SW output. In particular, Canny on HWgenerates 96−98% same sets of edgels with Canny on SW,

Page 12: IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO ...users.ics.forth.gr/~lourakis/publ/2019_tcsvt.pdf · tations that exploit the capabilities of the SoC and improve the utilization

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, ACCEPTED 12

TABLE IIFINAL SYSTEM: RESOURCES ON SOC FPGA XC7Z100IFFG900-2L

component LUT DFF DSP RAM36 time (ms)

rendering 93383 148071 966 224 11±4

edge detection 2948 3174 4 346.5 6

perp. matching 298 389 - 5.5 5

commun.+misc. 2895 3777 - 6 3 Gbps

pose estimation ARM core #1 55±10

auxiliary SW ARM core #1 0.5

total99524 155411 970 582 83±14(36%) (28%) (48%) (77%) (∼12 fps)

with their equality increasing to 99% when we consideronly the strongest edgels. Therefore, since usually the mostprominent edgels are matched between successive frames,we get up to 99.9% same matches by the HW and SWversions (notice that the perpendicular matching function onHW is fully equivalent to SW). The HW rendering outputs aprojected model, the shape of which is 99.7% equal to theSW output with 96% of the pixels having less than 1mmdifference in depth (and 98% having less than 10mm). Intotal, when we combine all three HW modules, we derivesets of control points on ENVISAT, which are 90% identicalto the sets derived by the SW version. Therefore, as Fig. 7shows qualitatively for three representative segments of thesequences, the pose estimates of the entire HW/SW system arepractically equal to those of the original SW. Quantitatively,when taking into account our entire dataset, the all-SW andthe HW/SW results have the same error in terms of mean, stdand range. More specifically, when considering the alignmentmetric of Eq. (4), we obtain a mean error of 17cm whenENVISAT is 50m away from the camera, which decreases to7cm at 20m distance. This alignment error comprises 2−50cmposition error and 0.5◦ − 3◦ rotation error with the highestdeviations presented at far distances. Overall, the alignmenterror of our HW/SW system is around 0.5% and less thanthe desired 1% of the camera–ENVISAT distance. With theadditional metrics described in Sec. IV-D, our pose tracker hasa mean relative position error of 0.53% and a mean rotationerror of 0.51◦. We note that at very short ranges (i.e., 10m orless), and due to most of ENVISAT and its edges being outsidethe camera field of view, the proposed algorithm might need tobe replaced by one relying on more abundant visual features,possibly from a different sensor modality such as LIDAR.

Table II reports the FPGA resources required by each com-ponent and the utilization of Zynq xc7z100. In terms of logic,rendering proves to be the most expensive algorithm, by oneorder of magnitude, with 822 DSPs and 82K LUTs devotedto calculating the distance-to-triangle of the ENVISAT model.This is mainly due to the floating-point arithmetic and the4x parallelization employed at pixel level. Edge detectionwith fixed-point arithmetic is far less demanding, with halfof its logic resources devoted to the calculation of gradients.Memory is more reasonably distributed between rendering andedge detection, partly due to our image partitioning technique(Section V-B), which also tackles the RAMB size problem and

enables us to fit all functions in a single FPGA. Out of the 347RAMBs in edge detection, 160 are devoted to the edge mapof the intensity and 160 to the map of the depth image (both1024×1024 5-bit), whereas 20 are for the stack memory ofhysteresis. Out of the 224 RAMBs in rendering, 128 are de-voted to storing a 1024×256 16-bit stripe of the depth image,whereas 96 are for storing the vertices of ENVISAT’s model(20K vertices for 35K triangles). The remaining memoryutilization on FPGA was significantly decreased via on-the-flyprocessing of nodes/pixels/edgels and avoiding intermediatebuffering. Perpendicular matching has negligible cost mainlydue to the relatively small parallelization and straightforwardoperations involved. Communication is more costly due to thecomplexity of AXI4-stream (3−4x vs lite), e.g., for DMA andmultiplexing of read/write channels. We mention here that inrendering, the frame initialization was implemented with 31KLUTs and 312 DSPs to gain only 0.1 msec, and hence, wasomitted from the PL (Section V-A). We also note that, whenconsidering smaller FPGAs, i.e. the Zynq 7045, the RAMBsand DSPs are over-utilized by ∼7%, however, the 55% freeCLBs make plausible the tuning/rearrangement of resourcesin order to fit in a single xc7z045 device.

To analyze execution time, the last column of Table II re-ports the milliseconds spent per function for a 1 Mpixel image.The FPGA components operate at 200MHz, communication at100MHz, whereas ARM operates at 667MHz. Depending onthe apparent size and the number of control points detectedon ENVISAT, we get the reported fluctuation of time duringrendering and pose estimation. In total, considering that edgedetection is executed twice, the average time to completean entire frame is 83msec. The average time increases from71msec at 50m away from ENVISAT to 90msec at 20m (interms of complexity, the amount of computations increasesby up to 87% during the ENVISAT’s approach). Therefore,our system achieves a processing rate of 10−14 FPS. Wenote that it is straightforward to increase this rate by another2−3 FPS by decreasing SW time, i.e., by configuring theCPU to operate at 800MHz and/or by further limiting thenumber of control points in pose estimation (now 1300 perframe, could be suppressed down to 1000). We also note that,when assuming higher resolution images input to our HW/SWsystem, the SW execution time would remain roughly the same(with the number of control points at 1300), whereas the HWtime would increase almost proportionally to the number ofpixels. Hence, our system would achieve a processing rate ofmore than 5 FPS even with 4 Mpixel images.

Further analysis for the 1 Mpixel images shows that, despitehaving mapped ∼96% of the computation on FPGA, due to therelatively slow speed of ARM, time is almost fairly distributedbetween PS and PL with a 2:1 ratio. The average pipelineutilization on FPGA is 2/3 when PL is engaged in processing.Owing to the efficient scheduling of FPGA operations and on-the-fly processing of data while being transferred from/to theCPU, we achieve 92% masking of the PS-PL communicationtime. As a result, we decrease the communication overheadto only 1msec, i.e., almost 1% of the total time, and we gainup to 11msec increasing our FPS rate by one extra frame.In total, when comparing to the all-SW execution on Zynq’s

Page 13: IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO ...users.ics.forth.gr/~lourakis/publ/2019_tcsvt.pdf · tations that exploit the capabilities of the SoC and improve the utilization

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, ACCEPTED 13

ARM, we achieve a speedup factor of up to 19x for the entireHW/SW system. Individually for each kernel, we acceleratedrendering by 50−62x and edge detection by 54x, or even 75xwhen assuming the maximum clock frequency of 278 MHz.Such acceleration factors are critical and, as shown here, canbe achieved via parallelization on FPGA when targeting oneorder of magnitude faster execution than the 1 Mpixel frame-per-second capabilities of latest rad-hard CPUs [6].

To evaluate energy consumption, we combined the powerestimator tool of Xilinx and real HW measurements viathe UCD9248PFC controller chip hosted on a Zynq ZC702board. When PS operates at 667MHz and PL at 200MHz,the ARM processor consumes 1.6W, whereas the static poweron xc7z100’s PL is 0.4W. The dynamic power on PL rangesbetween 6−7W. Therefore, given the PS-PL utilization time ofour algorithms, the total power consumption ranges between2−9W on MMP and is averaged over multiple frames at 4.3Watts. Such power and size, i.e., only 5.7× 10.2 cm2 and65 gr, make the proposed embedded system highly suitablefor many cost-effective LEO applications in space.

Comparison to previous worksCompared to similar works in the literature, our VBN providesstate-of-the-art HW efficiency (i.e., throughput per resources)at component level, high accuracy at algorithmic level, andimproved miniaturization/readiness at system level. We resorthere to this distinction due to the practical difficulties ofperforming fair/complete comparisons between diverse andscarce publications about the examined VBN scenario. First,at component level, even with recursive hysteresis, our 8-bit Canny utilizes half FPGA resources compared to [31] (itachieves 4x higher throughput, which however is not necessaryin our case). Even in comparison with FPGA Canny com-ponents specifically targeting low-cost/power implementationsuch as [32], we achieve almost the same HW efficiency (20%faster execution with 40% extra LUTs, when normalized tosame frequency and image size and Xilinx technology). Simi-lar conclusions can be drawn for our VHDL renderer, despiteperforming more coarse comparisons due to the differencesin algorithms and object models. That is, by extrapolating tothe same clock frequency and number of triangles/data, we getalmost equal throughput per LUTs with [33] and [34]. Second,at algorithmic level, the results presented above show that our0.53% relative position and 0.51◦ rotation errors are similar orlower than the 1%-5% and 1-5◦ errors reported in the literature(cf. Sec. II). Moreover, our pose tracking algorithm was shownin [20] to perform much better compared to two state ofthe art trackers. Third, at system level, in contrast to [17],we estimate the full 6D pose of the satellite, we acceleratea much larger part of the computation on FPGA, and wedeliver a self-contained embedded system on a compact board,which is more representative of space applications. This self-containment advantage is also evident when compared withthe VIBANASS system [35], which requires an additionalground-segment equipment to perform image processing withslightly higher error (i.e., ∼1%). Likewise, compared to theADR demonstration mission of [36] that proclaims an error of∼0.4% at 25m, we achieve similar accuracy in 6D pose esti-

mation with edge-based tracking, but we already accelerate thealgorithm on FPGA and provide a high-performance, space-deployable VBN engine/solution on an embedded SoC ([36]does not perform image processing on-board the satellite).

VII. CONCLUSION

The paper has presented the development on SoC FPGA ofa complete HW/SW embedded system for novel VBN appli-cations in space. Targeting future ADR mission requirements,in a top-down co-design approach, we have proposed a high-level architecture exploiting the structure of Xilinx Zynq7000,we refined our computer vision algorithm for tracking thepose of ENVISAT, and we devised a HW/SW implementationmethodology to customize and efficiently map all functionsto the SoC FPGA. Among others, at the expense of 3−16xcomplexity increase, we introduced 3D mesh rendering toenable accurate pose tracking with arbitrary models withoutline visibility processing. We tackled the computational chal-lenges via a combination of techniques at the architectureand VHDL level. To decrease on-chip memory by at least6x, we relied on resource reuse among distinct image stripesand windows, word-length minimization, as well as minimalbuffering during stream data processing. To decrease time byup to 19x, we employed deep-pipelining, module replication,parallel memory organization, tight coupling of PS and PLfunctions, as well as task level parallelization and balancedscheduling. The resulting VBN system achieves a throughputof 10−14 FPS for 1 Mpixel images, with only 4.3 Watts meanpower and 1U size, tracking its target in real-time and withmean error 0.5% of the distance between the camera and thetarget. Therefore, via efficient utilization of the underlying HWand state-of-the-art competency at function level, the proposedlow-power/size but high-rate/definition VBN engine improvesconventional space avionics by an order of magnitude in termsof throughput and performance per Watt.

ACKNOWLEDGMENTS

The authors thank Marcos Aviles R. and David Gonzalez-Arjona from GMV, Spain, for providing the ENVISAT testdata and feedback on the system design approach, as well asGianluca Furano from ESA/ESTEC. This work was partiallysupported by ESA via project “HIPNOS” (High PerformanceAvionics Solution for Advanced and Complex GNC Systems,ref. 4000117700/16/NL/LF) and NTUA via project “I-CoVo”.

REFERENCES

[1] A. Long, M. Richards, and D. E. Hastings, “On-orbit servicing: A newvalue proposition for satellite design and operation,” J. Spacecr. Rockets,vol. 44, pp. 964–976, 2007.

[2] A. Flores-Abad et al., “A review of space robotics technologies for on-orbit servicing,” Prog. Aerosp. Sci., vol. 68, pp. 1–26, Jul. 2014.

[3] R. Biesbroek, L. Innocenti, A. Wolahan, and S. M. Serrano, “e.Deorbit -ESA’s active debris removal mission,” in Eur. Conf. Space Debris, 2017.

[4] J. Kelsey et al., “Vision-based relative pose estimation for autonomousrendezvous and docking,” in IEEE Aerosp. Conf., 2006, pp. 1–20.

[5] C. English et al., “Real-time dynamic pose estimation systems in space:Lessons learned for system design and performance evaluation,” Intl. J.Intell. Contr. Syst., vol. 16, no. 2, pp. 79–96, 2011.

[6] G. Lentaris et al., “High-performance embedded computing in space:Evaluation of platforms for vision-based navigation,” J. Aerosp. Inf.Syst., vol. 15, no. 4, pp. 178–192, 2018.

Page 14: IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO ...users.ics.forth.gr/~lourakis/publ/2019_tcsvt.pdf · tations that exploit the capabilities of the SoC and improve the utilization

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, ACCEPTED 14

[7] V. Lepetit and P. Fua, “Monocular model-based 3D tracking of rigidobjects: A survey,” Found. & Trends. Comp. Graph. Vision, vol. 1, 2005.

[8] C. Harris, “Tracking with rigid objects,” in Active Vision, A. Blake andA. Yuille, Eds. MIT Press, 1992, pp. 59–73.

[9] M. Armstrong and A. Zisserman, “Robust object tracking,” in AsianConf. on Computer Vision, vol. I, 1995, pp. 58–61.

[10] T. Drummond and R. Cipolla, “Real-time visual tracking of complexstructures,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 7, pp.932–946, 2002.

[11] A. I. Comport et al., “Real-time markerless tracking for augmented re-ality: The virtual visual servoing framework,” IEEE Trans. Vis. Comput.Graph., vol. 12, no. 4, pp. 615–628, 2006.

[12] K. Kanani et al., “Vision based navigation for debris removal missions,”in 63rd Intl. Astronaut. Congr., 2012, IAC-12.

[13] A. Petit, “Robust visual detection and tracking of complex objects: ap-plications to space autonomous rendez-vous and proximity operations,”Ph.D. dissertation, Universite Rennes 1, Dec. 2013.

[14] Y. Zhang et al., “Comparative study of visual tracking method: Aprobabilistic approach for pose estimation using lines,” IEEE Trans.Circuits Syst. Video Technol., vol. 27, no. 6, pp. 1222–1234, 2017.

[15] G. Zhang et al., “Cooperative relative navigation for space rendezvousand proximity operations using controlled active vision,” J. Field Robot.,vol. 33, no. 2, pp. 205–228, 2016.

[16] X. Du et al., “Pose measurement of large non-cooperative satellite basedon collaborative cameras,” Acta Astronaut., vol. 68, no. 11, pp. 2047–2065, 2011.

[17] J. Peng, W. Xu, and H. Yuan, “An efficient pose measurement methodof a space non-cooperative target based on stereo vision,” IEEE Access,vol. 5, pp. 22 344–22 362, 2017.

[18] G. Lentaris et al., “HW/SW Co-design and FPGA Acceleration of VisualOdometry Algorithms for Rover Navigation on Mars,” IEEE Trans.Circuits Syst. Video Technol., vol. 26, no. 8, pp. 1563–1577, 2016.

[19] B. Naasz et al., “The HST SM4 relative navigation sensor system:Overview and preliminary testing results from the flight robotics lab,”J. Astronaut. Sci., vol. 57, no. 1, pp. 457–483, Jan 2009.

[20] M. Lourakis and X. Zabulis, “Model-based visual tracking of orbitingsatellites using edges,” in Proc. IEEE/RSJ Intl. Conf. on Intell. Robot.Syst., 2017, pp. 3791–3796.

[21] R. Opromolla et al., “A review of cooperative and uncooperative space-craft pose determination techniques for close-proximity operations,” P.Aerosp. Sci., vol. 93, pp. 53–72, 2017.

[22] S. Sharma and S. D’Amico, “Comparative assessment of techniques forinitial pose estimation using monocular vision,” Acta Astronaut., vol.123, pp. 435–445, 2016.

[23] L. Zhong, M. Lu, and L. Zhang, “A direct 3D object tracking methodbased on dynamic textured model rendering and extended dense featurefields,” IEEE Trans. Circuits Syst. Video Technol., 2017.

[24] X. Iturbe et al., “An integrated SoC for science data processing in next-generation space flight instruments avionics,” in 2015 IFIP/IEEE Intl.Conf. on Very Large Scale Integration (VLSI-SoC), 2015, pp. 134–141.

[25] D. Rudolph et al., “CSP: A multifaceted hybrid architecture for spacecomputing,” in SSC14-III-3, 28th Annual AIAA/USU Conference onSmall Satellites, 2014.

[26] Z. Zhang et al., “Visual-inertial odometry on chip: An algorithm-and-hardware co-design approach,” in Robotics: Science and Systems, 2017.

[27] J. Canny, “A computational approach to edge detection,” IEEE Trans.Pattern Anal. Mach. Intell., vol. 8, no. 6, pp. 679–698, Nov. 1986.

[28] T. Moller and B. Trumbore, “Fast, minimum storage ray-triangle inter-section,” J. Graph. Tools, vol. 2, no. 1, pp. 21–28, 1997.

[29] P. J. Rousseeuw, “Least median of squares regression,” J. Amer. Stat.Assoc., vol. 79, no. 388, pp. 871–880, 1984.

[30] Xillybus. (2018, Jul) Xillybus FPGA designers guide. [Online].Available: http://xillybus.com/downloads/doc/

[31] C.-L. Sotiropoulou et al., “Real-time machine vision FPGA implementa-tion for microfluidic monitoring on lab-on-chips,” IEEE Trans. Biomed.Circuits Syst., vol. 8, no. 2, pp. 268–277, 2014.

[32] J. Lee, H. Tang, and J. Park, “Energy efficient Canny edge detector foradvanced mobile vision applications,” IEEE Trans. Circuits Syst. VideoTechnol., vol. 28, no. 4, pp. 1037–1046, 2018.

[33] X. Wang, F. Guo, and M. Zhu, “A more efficient triangle rasterizationalgorithm implemented in FPGA,” in Intl. Conf. Audio, Language andImage Processing (ICALIP). IEEE, 2012, pp. 1108–1113.

[34] Z. Safarzik, E. Gervais, and M. Vucic, “Implementation of division-free perspective-correct rendering optimized for FPGA devices,” in Intl.Convention MIPRO, 2010, pp. 177–182.

[35] G. Hausmann et al., “VIBANASS (vision based navigation sensorsystem) system test results,” in Eur. Conf. Space Debris, vol. 723, 2013.

[36] T. Chabot et al., “Vision-based navigation experiment onboard theRemoveDebris mission,” in Intl. ESA Conf. on GNC Systems, 2017, pp.1–23.

George Lentaris holds a Ph.D. in Computing fromthe National & Kapodistrian University of Athens(NKUA), Greece, as well as two M.Sc. degreesin “Logic, Algorithms, and Computation” and in“Electronic Automation”, with a B.Sc. in Physics.His research interests include parallel architecturesand algorithms for DSP, digital circuit design, imageprocessing, computer vision, video compression, andreliability of FPGA chips. He is currently a researchassociate at the National Technical University ofAthens (NTUA), working on HW/SW co-design

with single-, multi-, and SoC-FPGA platforms, for space applications for ESA.

Ioannis Stratakos received his M.Sc. degree inControl and Computing and B.Sc. degree in Physicsfrom the National & Kapodistrian University ofAthens (NKUA) in 2016 and 2013, respectively.Currently, he is working towards a PhD degree inthe Microprocessors and Digital Systems Lab at theNational Technical University of Athens (NTUA).His research interests include hardware/software co-design, implementation of hardware accelerators, re-configurable computing and embedded systems de-sign for applications in signal and image processing.

Ioannis Stamoulias received a M.Sc. degree inMicroelectronics, specialized in Design of IntegratedCircuits, from the National and Kapodistrian Uni-versity of Athens (NKUA), Greece, and a B.Sc.degree in Computer Science and Telecommunica-tions in 2010 and 2007 respectively. Currently,he is a Ph.D. student at the Department of In-formatics and Telecommunications of NKUA. Hisresearch interests include Digital Signal Process-ing Systems, Embedded Systems, Network-on-Chip,Hardware/Software co-design, Computer Vision, etc.

Dimitrios Soudris is an Associate Professor inSchool of Electrical and Computer Engineering,National Technical University of Athens, Greece. Hereceived the Ph.D. Degree in Electrical Engineeringfrom the University of Patras, Greece. His researchinterests include embedded systems design, reconfig-urable architectures, reliability and low power VLSIdesign. He is leader and principal investigator innumerous research projects funded from the GreekGovernment and Industry, European Commission,and European Space Agency.

Manolis Lourakis is a principal researcher at theInstitute of Computer Science of the Foundation forResearch and Technology – Hellas (FORTH) in Her-aklion, Greece. He holds a Ph.D. degree in ComputerScience and has been a researcher with FORTHsince 2002. His interests include geometric computervision and topics related to robotics, motion trackingand analysis, registration and matching, multiple andsingle view geometry, camera(s) self-calibration, 3Dreconstruction, 3D shape extraction and analysis, andcontinuous optimization.

Xenophon Zabulis is a principal researcher at theInstitute of Computer Science, FORTH. He receivedhis Ph.D. in Computer Science from the Universityof Crete, Greece, in 2001. From 2001 until 2003 hewas a Postdoctoral Fellow at the GRASP and at theIRCS laboratories, at the University of Pennsylvania,USA. During 2004 to 2007, he was a ResearchFellow at the Institute of Informatics and Telematics- CERTH, Greece. His research interests include3D reconstruction, pose estimation, medical imageanalysis and visual estimation of human motion.