7
24 Xcell Journal Second Quarter 2013 XCELLENCE IN EMBEDDED VISION Using OpenCV and Vivado HLS to Accelerate Embedded Vision Applications in the Zynq SoC

Using OpenCV and Vivado HLS to Accelerate Embedded Vision

  • Upload
    others

  • View
    47

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Using OpenCV and Vivado HLS to Accelerate Embedded Vision

24 Xcell Journal Second Quarter 2013

XCELLENCE IN EMBEDDED VIS ION

Using OpenCV and Vivado HLS to AccelerateEmbedded Vision Applicationsin the Zynq SoC

Page 2: Using OpenCV and Vivado HLS to Accelerate Embedded Vision

Second Quarter 2013 Xcell Journal 25

Computer vision has been awell-established discipline inacademic circles for several

years; many vision algorithms todayhail from decades of research. Butonly recently have we seen the prolif-eration of computer vision in manyaspects of our lives. We now have self-driving cars, game consoles that reactto our every move, autonomous vacu-um cleaners and mobile phones thatrespond to our gestures, among othervision products.

The challenge today is how toimplement these and future vision sys-tems efficiently while meeting strictpower and time-to-market constraints.The Zynq™ All Programmable SoCcan be the foundation for such prod-ucts, in tandem with the widely usedcomputer vision library OpenCV andthe high-level synthesis (HLS) toolsthat accelerate critical functions inhardware. Together, this combinationmakes a powerful platform fordesigning and implementing SmarterVision systems.

Embedded systems are ubiquitousin the market today. However, limita-tions in computing capabilities, espe-cially when dealing with large picturesizes and high frame rates, haverestricted their use in practical imple-mentations for computer/machinevision. Advances in image sensortechnologies have been essential inopening the eyes of embeddeddevices to the world so they can inter-act with their environment using com-puter vision algorithms. The combina-tion of embedded systems and com-puter/machine vision constitutes

embedded vision, a discipline that isfast becoming the basis for designingmachines that see and understandtheir environments.

DEVELOPMENT OF EMBEDDEDVISION SYSTEMSEmbedded vision involves runningintelligent computer vision algorithmsin a computing platform. For manyusers, a standard desktop-computingprocessing platform provides a conve-niently accessible target. However, ageneral computing platform may notmeet the requirements for producinghighly embedded products that arecompact, efficient and low in powerwhen processing large image-datasets, such as multiple streams of real-time HD video at 60 frames/second.

Figure 1 illustrates the flow thatdesigners commonly employ to createembedded vision applications. Thealgorithm design is the most impor-tant step in this process, since thealgorithm will determine whether wemeet the processing and quality crite-ria for any particular computer visiontask. At first, designers explore algo-rithm choices in a numerical-comput-ing environment like MATLAB® inorder to work out high-level process-ing options. Once they have deter-mined the proper algorithm, they typi-cally model it in a high-level language,generally C/C++, for fast executionand adherence to the final bit-accu-rate implementation.

System partitioning is an importantstep in the development process. Here,through algorithm performance analy-sis, designers can determine what por-

Pairing Vivado HLS with the OpenCVlibraries enables rapid prototyping anddevelopment of Smarter Vision systemstargeting the Zynq All Programmable SoC.

X C E L L E N C E I N E M B E D D E D V I S I O N

by Fernando Martinez Vallina HLS Design Methodology Engineer Xilinx, [email protected]

José Roberto AlvarezEngineering Director for Video TechnologyXilinx, [email protected]

Page 3: Using OpenCV and Vivado HLS to Accelerate Embedded Vision

tions of the algorithm they will needto accelerate in hardware given thereal-time requirements for processingrepresentative input data sets. It isalso important to prototype the entiresystem in the target platform, so as torealistically measure performanceexpectations. Once the prototypingprocess indicates that a design hasmet all performance and quality tar-gets, designers can then start imple-menting the final system in the actualtargeted device. Finally, the last stepis to test the design running on thechip in all use-case scenarios. Wheneverything checks out, the team canrelease the final product.

ZYNQ SOC: SMARTEST CHOICEFOR EMBEDDED VISIONIn the development of machinevision applications, it is very impor-tant for design teams to choose ahighly flexible device. They need acomputing platform that includespowerful general-purpose process-

ing capabilities supporting a widesoftware ecosystem, along withrobust digital signal-processing capa-bilities for implementing computa-tionally demanding and memory-effi-cient computer vision algorithms.Tight integration at the silicon level isessential for implementing efficientand complete systems.

Xilinx® All Programmable SoCs areprocessor-centric devices that offersoftware, hardware and I/O program-mability in a single chip. The ZynqSoC features an ARM® dual-coreCortex™-A9 MPCore™ processingsystem coupled with FPGA logic andkey peripherals on a single device. Assuch, the device enables designers toimplement extremely efficient embed-ded vision systems.

This level of integration betweenthe processing subsystem, FPGA logicand peripherals in the Zynq SoCensures faster data transfer speeds anda much lower power requirement andBOM cost than a system designed withindividual components. It is feasible toimplement systems in the Zynq SoCthat require real-time processing for1080p60 video sequences (1,920 x 1,080

RGB pictures at 60 frames/s) with pro-cessing capabilities in the hundreds ofgiga-operations per second.

To take full advantage of the manyfeatures of the Zynq SoC, Xilinx pro-vides the Vivado™ Design Suite, an IP-and system-centric design environ-ment that increases designer develop-ment productivity with the fast inte-gration and implementation cyclesneeded to dynamically developsmarter embedded products. A com-ponent of that suite, Vivado HLS,allows you to take algorithms you’vedeveloped in C/C++ and compile theminto RTL to run in the FPGA logic.

The Vivado HLS tool is particularlywell-suited to embedded vision design.In this flow, you create your algorithmsin C/C++; compile the algorithm orparts of the algorithm into RTL usingVivado HLS; and determine which func-tions are better suited to run in FPGAlogic and which are better suited to runon the ARM processor. In this way, yourdesign team can home in on the optimalperformance for their vision systemsrunning in the Zynq SoC.

To further help embedded visiondevelopers to create Smarter Vision

26 Xcell Journal Second Quarter 2013

X C E L L E N C E I N E M B E D D E D V I S I O N

AlgorithmDesign

Modeling

Implementation

Release

SystemPartitioning

Prototyping

+_ +_

C, C++ or SystemC

VHDL or Verilog

System IP Integration

Algorithmic Specification

Microarchitecture Exploration

RTL Implementation

Comprehensive Integration withthe Xilinx Design Environment

Accelerate Algorithmic C-to-IP Integration

Vivado HLS

Figure 1 – Embedded vision system

development process

Figure 2 – High-level synthesis design flow

Page 4: Using OpenCV and Vivado HLS to Accelerate Embedded Vision

systems, Xilinx has added to Vivadosupport for the OpenCV libraries ofcomputer vision algorithms. Xilinxhas also introduced the new IPIntegrator tools and SmartCORE™ IPto support these kinds of designs (seecover story, page 8).

OPENCV MAKES COMPUTERVISION ACCESSIBLEOpenCV provides a path to the devel-opment of intelligent computer visionalgorithms that are predicated on real-time performance. The libraries pro-vide designers with an environmentfor experimentation and fast proto-typing of algorithms.

The design framework of OpenCVis supported in multiple platform. Butin many cases, to make the librariesefficient for embedded productsrequires implementation in an embed-ded platform that is capable of accel-erating demanding routines for real-time performance.

While OpenCV was designed withcomputational efficiency in mind, itoriginated in traditional computingenvironments with support for multi-core processing. These computingplatforms may not be optimal for anembedded application where efficien-cy, cost and power consumption areparamount.

FEATURES OF OPENCVOpenCV is an open-source computervision library released under a BSDlicense. This means that it is free touse in academic and commercialapplications. It was originallydesigned to be computationally effi-cient on general-purpose multipro-cessing systems, with a strong focuson real-time applications. In addi-tion, OpenCV provides access tomultiple programming interfaceslike C/C++ and Python.

The advantage of an open-sourceproject is that the user community isconstantly improving algorithms andextending them for use in a widevariety of application domains. There

are now more than 2,500 functionsimplemented in OpenCV; here aresome examples:

• Matrix math

• Utilities and data structures

• General image-processing functions

• Image transforms

• Image pyramids

• Geometric descriptor functions

• Feature recognition, extractionand tracking

• Image segmentation and fitting

• Camera calibration, stereo and3D processing

• Machine learning: detection,recognition

For a more detailed description ofOpenCV, please go to opencv.org andopencv.willowgarage.com.

ACCELERATING OPENCV FUNCTIONS USING HLSOnce you have partitioned thearchitecture of an embedded visionsystem to single out computational-ly demanding portions, HLS tools

can help you accelerate these func-tions while still written in C++.Vivado HLS makes use of C, C++ orSystemC code to produce an effi-cient RTL implementation.

Furthermore, the Vivado IP-centricdesign environment provides a widerange of processing IP SmartCOREsthat simplify connections to imagingsensors, networks and other neces-sary I/O interfaces, easing the processof implementing those functions in theOpenCV libraries. This is a distinctadvantage from other implementationalternatives, where there is a need toaccelerate even the most fundamentalOpenCV I/O functionality.

WHY HIGH-LEVEL SYNTHESIS?The Vivado HLS compiler from Xilinxis a software compiler designed totransform algorithms implemented inC, C++ or SystemC into optimized RTLfor a user-defined clock frequency anddevice in the Xilinx product portfolio.It has the same core technology under-pinnings as a compiler for an x86processor in terms of interpretation,analysis and optimization of C/C++programs. This similarity enables arapid migration from a desktop devel-opment environment into an FPGA

X C E L L E N C E I N E M B E D D E D V I S I O N

Second Quarter 2013 Xcell Journal 27

Current Frame

Previous Frame

Detected New Car

Output Frame

OpenCVMotion-Detection

Algorithm

Figure 3 – Motion-detection example from the OpenCV library of algorithms

Page 5: Using OpenCV and Vivado HLS to Accelerate Embedded Vision

implementation. The default behaviorof Vivado HLS will generate an RTLimplementation without the need foruser input after you have selected thetarget clock frequency and device. Inaddition, Vivado HLS, like any othercompiler, has optimization levels.Since the final execution target of thealgorithm is a tailor-made microarchi-tecture, the level of optimizations pos-sible in Vivado HLS is finer-grainedthan in a traditional compiler. Theconcept of O1 – O3 optimizations typ-ical in software design for a processoris replaced with architecture-explo-ration directives. These directivesdraw on the expertise of the user toguide Vivado HLS in creating the bestpossible implementation for a particu-lar algorithm in terms of power, areaand performance.

The user design flow for the HLScompiler is shown in Figure 2. At a

conceptual level, the user provides aC/C++/SystemC algorithmic descrip-tion and the compiler generates anRTL implementation. The transfor-mation of a program code into RTL isdivided into four major regions: algo-rithm specification, microarchitec-ture exploration, RTL implementa-tion and IP packaging.

The algorithmic-specification stagerefers to the development of the soft-ware application that will be targetedto the FPGA fabric. This specificationcan be developed in a standard desk-top software-development environ-ment and can make complete use ofXilinx-provided software librariessuch as OpenCV. In addition toenabling a software-centric develop-ment flow, Vivado HLS elevates theverification abstraction from RTL toC/C++. The user can carry out com-plete functional verification of the

algorithm using the original software.After RTL generation through VivadoHLS, the generated design is analo-gous to processor assembly code gen-erated by a traditional software com-piler. The user can debug at thisassembly code level, but is notrequired to do so.

While Vivado HLS can handlealmost all C/C++ code targeted atother software compilers, there isone restriction placed on the code.For compilation into an FPGA usingVivado HLS, the user code cannotinclude any run-time dynamic memo-ry allocations. Unlike a processor,where the algorithm is bounded by asingle memory architecture, an FPGAimplementation has an algorithm-specific memory architecture. Byanalyzing the usage patterns ofarrays and variables, Vivado HLS candetermine the physical-memory lay-

28 Xcell Journal Second Quarter 2013

X C E L L E N C E I N E M B E D D E D V I S I O N

Figure 4 – Motion detection on the Zynq SoC using the ARM processor

Page 6: Using OpenCV and Vivado HLS to Accelerate Embedded Vision

out and memory types that will bestfit the storage and bandwidthrequirements in an algorithm. Theonly requirement for this analysis towork is that you explicitly describeall memory that an algorithm con-sumes in the C/C++ code in the formof arrays.

The second step in the transfor-mation from a C/C++ implementa-tion into an optimized FPGA imple-mentation is the microarchitectureexploration. At this stage, you applyVivado HLS compiler optimizationsto test out different designs for theright mix of area and performance.You can implement the same C/C++code at different performance pointswithout having to modify sourcecode. The Vivado HLS compiler opti-mizations or directives are how theperformance of different portions ofan algorithm are stated.

The final stage in the Vivado HLScompiler flow is RTL implementationand IP packaging. These are automat-ic stages inside the Vivado HLS com-piler that do not require RTL knowl-edge from users. The details of opti-mized RTL creation for differentdevices in the Xilinx product portfo-lio are built into the compiler. At thisstage, the tool is, for all intents andpurposes, a pushbutton utility thathas been thoroughly tested and veri-fied to produce timing-driven andFPGA fabric-driven RTL. The outputfrom the Vivado HLS compiler isautomatically packaged in formatsaccepted by other Xilinx tools such asIP-XACT. Therefore, there are noadditional steps in using an HLS-gen-erated IP core in Vivado.

The OpenCV libraries from Xilinxprovide a shortcut in the process ofdesign optimization using Vivado

HLS. These libraries have beenprecharacterized to yield functionscapable of pixel-rate processing at1080p resolutions. The optimizationknowledge required to guide theVivado HLS compiler is embeddedinto these libraries. Thus, you are freeto quickly iterate between an OpenCVconcept application in a desktop envi-ronment to a running OpenCV applica-tion on the Zynq SoC, which enablesoperations on the ARM processor andthe FPGA fabric.

Figure 3 shows an overview of amotion-detection application devel-oped in OpenCV. The goal of thisdesign is to detect moving objects ina video stream by comparing the cur-rent image frame with the previousone. The first stage in the algorithmis to detect the edges in both frames.This data-reduction operation makesit easier to analyze the relative

X C E L L E N C E I N E M B E D D E D V I S I O N

Second Quarter 2013 Xcell Journal 29

Figure 5 – Motion detection on the Zynq SoC using the programmable fabric

Page 7: Using OpenCV and Vivado HLS to Accelerate Embedded Vision

change between consecutive frames.Once the edge information has beenextracted, edges are compared todetect edges that appear in the cur-rent image but not in the previousimage. These new detected edgescreate a movement-detection maskimage. Before the results of the newedge detection can be highlighted onthe current image, you must takeinto account the effects of imagesensor noise. This noise, which canvary from frame to frame, can leadto random false edges in the motion-detection mask image. Therefore,you must filter this image to reducethe impact of noise on the quality ofthe algorithm.

Noise reduction for this applicationis accomplished through the use of a7x7 median filter on the movement-detection mask image. The centralidea of a median filter is to take themiddle value in a 7x7 window ofneighboring pixels. The filter thenreports back the median value as thefinal value for the center pixel of thewindow. After noise reduction, themovement-detection mask image iscombined with the live input image tohighlight moving edges in red.

You can fully implement the appli-cation to run on the ARM processingsubsystem with the source code toZynq SoC mapping shown in Figure4. The only hardware elements in thisimplementation are for the “cvget-frame” and “showimage” functions.These video I/O functions are imple-mented using Xilinx video I/O sub-systems in the FPGA fabric. At thetime of the cvgetframe function call,the input side of the video I/O sub-system handles all the details, grab-bing and decoding a video streamfrom an HDMI interface and placingthe pixel data into DDR memory. Atthe time of showimage, this subsys-tem handles the transfer of pixel datafrom DDR memory into a video dis-play controller in order to drive a TVor other HDMI-compliant video dis-play device.

Vivado HLS-optimized OpenCVlibraries for hardware accelerationenable the porting of the code inFigure 4 into a real-time 60-frame/spixel-processing pipeline on theFPGA fabric. These OpenCV librariesprovide foundational functions forthe elements in OpenCV that requirehardware acceleration. Without hard-ware acceleration—that is, if youwere to run all the code in the ARMprocessors only—this algorithm hasa throughput of merely 1 frame every13 seconds (0.076 frames/s). Figure 5shows the new mapping of the appli-cation after Vivado HLS compilation.Note that the video I/O mapping ofthe original system is reused. Thecomputational core of the algorithm,which was previously executing onthe ARM processor, is compiled intomultiple Vivado HLS-generated IPblocks. These blocks, which are con-nected to the video I/O subsystem inVivado IP Integrator, are optimizedfor 1080p resolution video process-ing at 60 frames/s.

The All Programmable environ-ment provided by the Zynq SoC andVivado Design Suite is well-suitedfor the design, prototyping and test-ing of embedded vision systems atthe high data-processing ratesrequired by the latest high-definitionvideo technologies. Using the open-source set of libraries included inOpenCV is the best way to imple-ment demanding computer visionapplications in short developmenttimes. Since OpenCV libraries arewritten in C++, we use Vivado HLSto create source code that can beefficiently translated to hardwareRTL in the Zynq SoC FPGA fabric,and used as convenient processingaccelerators without sacrificing theflexibility of the design environmentoriginally envisioned by OpenCV.

For more information on creatingSmarter Vision designs with VivadoDesign Suite, visit http://www.xil-

inx.com/products/design-tools/viva-

do/index.htm.

30 Xcell Journal Second Quarter 2013

X C E L L E N C E I N E M B E D D E D V I S I O N

FPGA S OLUTIONS

? 100K Logic Cells + 240 DSP Slices? DDR3 SDRAM + Gigabit Ethernet? SO-DIMM form factor (68 x 30 mm)

Mars AX3 Artix®-7 FPGA Module

from

Mars ZX3 SoC Module

? Xilinx Zynq™-7000 All Programmable SoC (Dual Cortex™-A9 + Xilinx Artix®-7 FPGA)

? DDR3 SDRAM + NAND Flash? Gigabit Ethernet + USB 2.0 OTG? SO-DIMM form factor (68 x 30 mm)

We speak FPGA.

www.enclustra.com

? Xilinx Kintex™-7 FPGA? High-performance DDR3 SDRAM? USB 3.0, PCIe 2.0 + 2 Gigabit Ethernet ports? Smaller than a credit card

Mercury KX1 FPGA Module

FPGA Manager Solution

Simple, fast host-to-FPGA data transfer, for PCI Express, USB 3.0 and Gigabit Ethernet. Supports user applications written in C, C++, C#, VB.net,

MATLAB®, Simulink® and LabVIEW.