SDSoC Programmers Guide - china.xilinx.com · Appendix C: Additional Resources and Legal Notices ... • Pipelining: A technique to increase instruction-level parallelism in the hardware

SDSoC Programmers Guide

UG1278 (v2018.2) July 2, 2018

https://www.xilinx.com

Revision HistoryThe following table shows the revision history for this document.

Section Revision Summary07/02/2018 Version 2018.2

Initial Xilinx release.

Revision History

UG1278 (v2018.2) July 2, 2018 www.xilinx.com [placeholder text]SDx Programmers Guide 2Send Feedback


https://www.xilinx.com/about/feedback.html?docType=User_Guides&docId=UG1278&Title=SDSoC%20Programmers%20Guide&releaseVersion=2018.2&docPage=2

Table of ContentsRevision History...............................................................................................................2

Chapter 1: Introduction to Programming with SDSoC.............................. 5Software Acceleration with SDSoC............................................................................................ 7Execution Model of an SDSoC Application............................................................................... 8SDSoC Build Process................................................................................................................. 11SDSoC Programming Flow Overview......................................................................................13

Chapter 2: Programming the Embedded Processor Application....... 15Memory Allocation.................................................................................................................... 16Sequential/Parallel Accelerator Execution............................................................................. 18Validating the Software to Hardware Conversion.................................................................20Performance Estimation...........................................................................................................22

Chapter 3: Programming Hardware Functions............................................23Coding Guidelines..................................................................................................................... 24Data Movers...............................................................................................................................27The Function Body.....................................................................................................................29Datatypes................................................................................................................................... 29Array Configuration.................................................................................................................. 31Loops.......................................................................................................................................... 35Dataflow Optimization..............................................................................................................41

Chapter 4: Using External I/O................................................................................45Accessing External I/O using Memory Buffers......................................................................45Accessing External I/O using Direct Hardware Connections...............................................48

Appendix A: Exporting a Library for GCC........................................................ 50Building a Shared Library.........................................................................................................50Compiling and Linking Against a Library............................................................................... 53Exporting a Shared Library...................................................................................................... 54

Appendix B: SDSoC Environment API................................................................ 55




Appendix C: Additional Resources and Legal Notices............................. 57Xilinx Resources.........................................................................................................................57Documentation Navigator and Design Hubs.........................................................................57References..................................................................................................................................58Training Resources....................................................................................................................58Please Read: Important Legal Notices................................................................................... 59




Chapter 1

Introduction to Programming withSDSoC

The SDSoC™ environment provides tools for developing embedded systems in Xilinx®

Zynq®-7000 SoC and Zynq UltraScale+ MPSoC, or MicroBlaze embedded processor on Xilinxdevices.

It includes:

• An Eclipse-based integrated development environment (IDE) with compilers, debuggers, andprofilers for Arm® and MicroBlaze processors.

• A hardware emulator.

• A hardware compiler that synthesizes C/C++ functions into optimized hardware functions tobe used in the programmable logic (PL).

• A system compiler that generates complete hardware/software systems, including customhardware accelerators and data mover hardware blocks (e.g. DMA engines), from applicationcode written in the C/C++ programming languages.

The sdscc/sds++ (referred to as sds++) system compiler provides options to perform hardware/software event tracing, which provides detailed timeline visibility into accelerator tasks running inhardware, data transfers between accelerators and memory, and application code running on theCPUs.

Xilinx FPGAs and SoC devices offer many advantages, including a programmable hardwarearchitecture for implementing custom data paths, multi-level distributed memory architectures,and interfacing to custom input/output devices, with full customizability of the hardware /software interface to high-performance embedded CPUs. By building a custom hardware system,you can achieve higher performance and lower power dissipation for your embeddedapplications. What is unique about the SDSoC environment is the ability it provides theprogrammer to create hardware and software, while working within familiar softwaredevelopment workflows including cross-compiling, linking, profiling, debugging, and runningapplication binaries on target hardware and in an emulator. Using the sds++ system compiler, youcan target parts of your application to be implemented as hardware accelerators running manytimes faster than optimized code running on a processor.




The programmer's view of the target device is heterogeneous computing, where code written inC/C++ will be running on multi-core Arm CPUs, as well as in custom hardware accelerators,typically with a non-uniform memory architecture and custom interfaces to input/output devices.More attention to where code will run, how data is mapped into memory, and how hardware andsoftware interact will allow for better performance of your application. In general, applicationcode should reflect the heterogeneity of the target system. Take into consideration that C/C++code compiled into hardware accelerators benefits from programming idioms that reflectmicroarchitecture details, while code running on CPUs benefits from idioms that reflect theinstruction set, cache, and memory architecture. when working in the SDSoC environment, thehardware/software interface between CPU and hardware accelerators will be described throughfunction calls, and APIs specific to the underlying devices. Majority of the code will accessaccelerators through function calls rather than device driver APIs, with the sds++ systemcompiler generating highly efficient access from userspace, automatically managing low levelconsiderations like cache management through custom drivers provided by the system compiler.

Chapter 1: Introduction to Programming with SDSoC




Software Acceleration with SDSoCWhen compared with processor architectures, the structures that comprise the programmablelogic (PL) fabric in a Xilinx® device enable a high degree of parallelism in application execution.The custom processing architecture generated by the sdscc/sds++ (referred to as sds++) for ahardware function in an accelerator presents a different execution paradigm from CPUexecution, and provides opportunity for significant performance gains. While you can retarget anexisting embedded processor application for acceleration on programmable logic, writing yourapplication to use libraries of existing hardware functions such as modifying your code to betteruse the device architecture or using the Xilinx xfOpenCV library yields significant performancegains and power reduction.

CPUs have fixed resources and offer limited opportunities for parallelization of tasks oroperations. A processor, regardless of its type, executes a program as a sequence of instructionsgenerated by processor compiler tools, which transform an algorithm expressed in C/C++ intoassembly language constructs that are native to the target processor. Even a simple operation,like the addition of two values, results in multiple assembly instructions that must be executedacross multiple clock cycles. This is why software engineers restructure their algorithms toincrease the cache hit rate and decrease the processor cycles used per instruction.

An FPGA is an inherently parallel processing device capable of implementing any function thatcan run on a processor. Xilinx SoC devices have an abundance of resources that can beprogrammed and configured to implement any custom architecture and achieve virtually anylevel of parallelism. Unlike a processor, where all computations share the same ALU, the FPGAprogramming fabric acts as a blank canvas to define and implement your acceleration functions.The FPGA compiler creates a unique circuit optimized for each application or algorithm; forexample, only implementing multiply and accumulate hardware for a neural net - not a wholeALU.

The sds++ system compiler exercises the capabilities of the FPGA fabric through the automaticinsertion of data movers and accelerator control IP, creation of accelerator pipelines, andinvoking the Vivado®HLS tool, which employs processes of scheduling, pipelining, and dataflow:

• Scheduling: The process of identifying the data and control dependencies between differentoperations to determine when each will execute. The compiler analyzes dependenciesbetween adjacent operations as well as across time and groups operations to execute in thesame clock cycle when possible, or to overlap the function calls as permitted by the dataflowdependencies.





• Pipelining: A technique to increase instruction-level parallelism in the hardwareimplementation of an algorithm by overlapping independent stages of operations or functions.The data dependence in the original software implementation is preserved for functionalequivalence, but the required circuit is divided into a chain of independent stages. All stages inthe chain run in parallel on the same clock cycle. Pipelining is a fine-grain optimization thateliminates CPU restrictions requiring the current function call or operation to fully completebefore the next can begin.

• Dataflow: Enables multiple functions implemented in the FPGA to execute in a parallel andpipelined manner instead of sequentially, implementing task-level parallelism. The compilerextracts this level of parallelism by evaluating the interactions between different functions ofa program based on their inputs and outputs.

Execution Model of an SDSoC ApplicationThe execution model for an SDSoC™ application can be understood in terms of the normalexecution of a C++ program running on the target CPU after the platform has booted. It is usefulfor the programmer to be aware of how a C++ binary executable interfaces to hardware.

The set of declared hardware functions within a program is compiled into hardware acceleratorsthat are accessed with the standard C run time through calls into these functions. Each hardwarefunction call in effect invokes the accelerator as a task, and each of the arguments to thefunction is transferred between the CPU and the accelerator, accessible by the program afteraccelerator task completion. Data transfers between memory and accelerators are accomplishedthrough data movers; either a direct memory access (DMA) engine automatically inserted intothe system by the sds++ system compiler, or by the hardware accelerator itself (such as the azero_copy data mover).





Figure 1: Architecture of an SDSoC System

To ensure program correctness, the system compiler intercepts each call to a hardware functionand replaces it with a call to a generated stub function that has an identical signature, but with aderived name. The stub function orchestrates all data movement and accelerator operation,synchronizing software and accelerator hardware at exit of the hardware function call. Within thestub, all accelerator and data mover control is realized through a set of send/receive APIsprovided by the sds_lib library.

When program dataflow between hardware function calls involves array arguments that are notaccessed after the function calls have been invoked within the program (other than destructorsor free() calls), and when the hardware accelerators can be connected via streams, the systemcompiler will transfer data from one hardware accelerator to the next through direct hardwarestream connections rather than implementing a round trip to and from memory. Thisoptimization can result in significant performance gains and reduction in hardware resources.

At a high level, the SDSoC execution model of a program includes the following steps.

1. Initialization of the sds_lib library occurs during the program's constructor before enteringmain().





2. Within a program, every call to a hardware function is intercepted by a function call into astub function with the same function signature (other than name) as the original function.Within the stub function, the following steps occur:

a. A synchronous accelerator task control command is sent to the hardware.

b. For each argument to the hardware function, an asynchronous data transfer request issent to the appropriate data mover, with an associated wait() handle. A non-void returnvalue is treated as an implicit output scalar argument.

c. A barrier wait() is issued for each transfer request. If a data transfer betweenaccelerators is implemented as a direct hardware stream, the barrier wait() for thistransfer occurs in the stub function for the last in the chain of accelerator functions forthis argument.

3. Cleanup of the sds_lib library occurs during the program's destructor upon exitingmain().

TIP: Steps 2a-c ensure that program correctness is preserved at entrance and exit of acceleratorpipelines, while enabling concurrent execution within the pipelines.

Sometimes the programmer has insight of potential concurrent execution of accelerator tasksthat cannot be automatically inferred by the system compiler. In this case, the sds++ systemcompiler supports a #pragma SDS async(ID) that can be inserted immediately preceding acall to a hardware function. This pragma instructs the compiler to generate a stub functionwithout any barrier wait() calls for data transfers. As a result, after issuing all data transferrequests, control returns to the program, enabling concurrent execution of the program while theaccelerator is running. In this case, it is the programmer's responsibility to insert a #pragma SDSwait(ID) within the program at appropriate synchronization points, which are resolved intosds_wait(ID) API calls to correctly synchronize hardware accelerators, their implicit datamovers, and the CPU.

IMPORTANT!: Every async(ID) pragma requires a matching wait(ID) pragma.





SDSoC Build ProcessThe SDSoC™ environment offers all of the features of a standard software developmentenvironment: optimized cross-compilers for the embedded processor application and thehardware function, robust debugging environment to help you identify and resolve issues in thecode, performance profilers to let you identify the bottlenecks and optimize your code. Withinthis environment the SDSoC build process uses a standard compilation and linking process.Similar to g++, the sds++ system compiler invokes sub-processes to accomplish compilation andlinking.

As shown in the image below, compilation is extended not only to object code that runs on theCPU, but also includes compilation and linking of hardware functions into IP blocks using theVivado® HLS tool, and creating standard object files (.o) using the target CPU toolchain. Systemlinking consists of program analysis of caller/callee relationships for all hardware functions, andgeneration of an application-specific hardware/software network to implement every hardwarefunction call. The sds++ system compiler invokes all necessary tools, including Vivado HLS(function compiler), the Vivado® Design Suite to implement the generated hardware system, andthe Arm® compiler and linker to create the application binaries that run on the CPU, invoking theaccelerator (stubs) for each hardware function by outputting a complete bootable system for anSD card.

Figure 2: SDSoC Build Process

X21126-062818

• The compilation process includes the following tasks:

○ Analyze the code and run a compilation for the main application running on the Armprocessor, and a separate compilation for each of the hardware accelerators.





○ The application code is compiled through standard GNU Arm compilation tools with anobject (.o) file produced as final output.

○ The hardware accelerated functions are run through the Vivado® HLS tools, to start theprocess of custom hardware creation, with an object (.o) file as output.

• After compilation, the linking process includes the following tasks:

○ Analyze the data movement through the design, and modify the hardware platform toaccept the accelerators.

○ Implement the hardware accelerators into the programmable logic (PL) region using theVivado® Design Suite to run synthesis and implementation, and generate the bitstream forthe device.

○ Update the software images with hardware access APIs, to call the hardware functionsfrom the embedded processor application.

○ Produce an integrated SD Card image that can boot the board with the application in anELF file.

Build Targets

As an alternative to building a complete system, you can create an emulation model that willconsist of the same platform and application binaries. In this target flow, the sds++ systemcompiler will create a simulation model using the source files for the accelerator functions.

The SDSoC environment provides two different build targets, an emulation target used for debugand validation purposes and the system hardware target used to generate the actual FPGAbinary:

• System Emulation: With system emulation you can debug RTL level transactions in the entiresystem (PS/PL). Running your application on SDSoC emulator (sdsoc_emulator) gives youvisibility of data transfers with a debugger. You can debug system hangs, and can inspectassociated data transfers in the simulation waveform view, which gives you visibility intosignals on the hardware blocks associated with the data transfer.

• Hardware: During hardware execution, you can use the actual hardware platform to run theaccelerated hardware functions. The difference between a debug system configuration andthe final build of the application code, and the hardware functions, is the inclusion of specialdebug logic in the platform, such as System ILAs and VIO debug cores, and AXI performancemonitors for debug purposes.





SDSoC Programming Flow OverviewThe flow of developing an embedded system would follow the typical steps of: codedevelopment, compilation and link for the platform/device, profile the system for performance,and measure the actual performance. The SDSoC™ environment follows this standard software-centric flow, but also supports a more hardware-centric flow for defining hardware functionsfirst, and then integrating those into the embedded application. What is unique about these twoflows is that they are a heterogeneous programming model; meaning that writing code for theCPU side of the system is going to be different from writing the code for the programmable logic.

The software-centric approach focuses on the embedded processor application, and theacceleration of specific software functions into hardware functions running in the programmablelogic (PL) region of the Xilinx® device. This requires converting the C or C++ code of the softwarefunction into Hardware Descriptive Language (HDL) that can be compiled for the programmablelogic using Vivado® HLS.

A typical accelerated hardware function would be processor intensive (eg: complex computationsthat take a long time), processing lots of data. This code should be written so that data transfersare limited to streaming data into and from the accelerator, and should leverage instruction-levelparallelism and task-level parallelism to take advantage of the massively parallel architecture ofthe programmable logic region of the device. The goal is to use parallelism to achieve the desiredperformance for accelerated functions. The goal of the accelerator would be to deliver, consumeinput data, process it, and output data as quickly as possible.

The performance of the accelerator is usually measured against the Initiation Interval (II), or thenumber of clock cycles before the function can accept new input data. The ideal would be oneclock cycle. However, if the accelerator requires multiple clock cycles to input and process data,then achieving II=1 may not be possible, and it would also require more hardware resources foreither buffering the input data, or for partitioning the data.

Once the processor and accelerator code is written, it can be compiled for emulation, or forcompiling/linking to the hardware platform. For emulation, the code compiles faster allowing forquick design iterations, where it can be used to estimate performance as well as checking dataintegrity; but runs slower than on actual hardware. Emulation is very accurate to what willexecute on the hardware, since the same CPU code will run both on QEMU and the targetdevice. During emulation the communication to/from accelerators, and accelerator workloads,can be measured and monitored using specialized APIs in the sds_lib.h or using the tracingfeature of the SDSoC environment, which will measure transactional information between theCPU and accelerator(s) and provide a visual representation of these transactions. When buildingto the hardware platform, it will run exactly what is written for the processor and for thehardware accelerators. The benefits of running on hardware would be to measure actual runtime,as well as being able to adjust the builds later for in-circuit debugging, or performance analysis.





The hardware-centric approach is used by designers experienced with developing on an FPGA.This approach lets users control what functionality will be in the accelerator, and how data/commands will be transported between the logic and the CPU. This flow uses the Vivado DesignSuite to create customized IP (Intellectual Property) containing AXI interfaces that are used tocommunicate between the PL region and the processor (PS). This IP can then be packaged withthe sdx_pack command to map the IP's AXI interfaces to a header file to create a static library.Then using this resulting include file and static library is as simple as calling a typical libraryfunction. The key is to making sure that the header file's data width matches what is expected bythe IP. refer to the SDSoC Environment User Guide (UG1027) for more information on creating andusing C-Callable IP.



https://www.xilinx.com/cgi-bin/docs/rdoc?v=2018.2;d=ug1027-sdsoc-user-guide.pdf



Chapter 2

Programming the EmbeddedProcessor Application

Creating an application for an embedded processor in the SDSoC™ environment, is similar tocreating an application for any other SoC or embedded platform. However, there are some addedconsiderations when accelerating an embedded processor application from SDK, for example.Programming for the SDSoC environment should include the following tasks:

• Identifying the appropriate function(s) in the processor application for acceleration in theprogrammable logic (PL) region.

• Allocating memory for the embedded processor application code and software functionsrunning on the processing system (PS), and for the accelerated function(s) running on the PLregions of the device.

• Enabling task-level parallelism with multiple accelerators running concurrently to optimizesystem performance.

• Validating the software-to-hardware function code conversion, to insure things work asintended.

To identify the functions that should be accelerators, or turned into hardware functions, youshould determine what kind of computational load would be required. For example, functionswhere large amounts of data are computed, or modified, would be good candidates for hardwarefunctions. However, functions written for a typical processor application may not benefit fromhardware acceleration in the SDSoC environment, and might need to be restructured to get realperformance improvements from acceleration.

Although you might not typically allocate memory for standard processor applications, leaving itto the compiler to define, when programming for hardware acceleration you should manuallydefine the memory allocation in the code. The hardware functions require physically contiguousmemory to meet the performance requirements of the hardware.

In addition, when you have defined more than one software function for acceleration, you canalso manage the scheduling of these accelerators, deciding if they can run concurrently, or needto be sequential. Understanding and managing the dataflow between the hardware functions andthe processor application is a key element of this process.




Finally, while converting software functions into hardware functions, you should test early andtest often by checking the results of the hardware function's algorithm. Validate the datareturned by the hardware function with the expected results returned by the original softwarefunction. Restructuring code into hardware functions may yield better performance, but you willwant to check it for equivalent results.

As for the remaining application functions, it is a matter of determining if the application will berunning on Linux, FreeRTOS, or standalone (baremetal). Each type will have its own pros andcons; for example, standalone will be the easiest to use since only the Arm® will be running theapplication host, but using features that are only for Linux or FreeRTOS will not be allowed.

Memory AllocationKnowing what data is going to be processed by the accelerators can help you write theapplication code to better allocate the memory being used. Generally, malloc/free will work inthe main function, but allocating memory specific to an accelerator using sds_alloc/sds_free will yield better performance due to the data being allocated and stored in physicallycontiguous memory that yields faster reads and writes to the memory. Although you can usemalloc, or simply leave allocation to the compiler, you will experience performance problems byusing a scatter-gather approach to memory access. It is strongly recommended that you allocatememory using sds_alloc for data going to the hardware functions.

The types of memory used are classified as contiguous/non-contiguous, and cacheable/non-cacheable. For contiguous memory, an array would have all the elements of the array allocatedphysically next to each other allowing for faster access times (think sequential read/writes). Usingnon-cacheable memory means that the data being transferred is not intended to be used by thePS, allowing for a higher transaction speeds. When using a cached memory allocation, there is aperformance hit for flushing the cache, as well as CPU access latencies.

You must allocate the data before calling the hardware function. The runtime will set up datamovers for you, with consideration for how memory is allocated. For example, in a matrixmultiplication design that contains 1024 elements (32x32), you must explicitly allocate memoryfor the hardware function in the main function. The following code directs the compiler toallocate memory on the heap in a physically contiguous, cacheable fashion:

int MatA[1024] = (int*)sds_alloc(1024*sizeof(int));int MatB[1024] = (int*)sds_alloc(1024*sizeof(int));int MatC[1024] = (int*)sds_alloc(1024*sizeof(int));

Allocating the memory on the heap allows for a lot more data to be processed, and to beexecuted with better performance. Once execution of this code is complete, you can release thememory using sds_free.

Chapter 2: Programming the Embedded Processor Application




Examples of memory allocation can be found in the SDSoC Examples available on the Xilinx®

GitHub repository.. The following code is from the mmultadd example available in the<install_dir>/SDx/<version>/samples folder. The code shows allocating the memoryin the main function, and performing a quick check to make sure it was properly allocated, andreleases the allocated memory if there was a problem:

int main(int argc, char* argv[]){ int test_passed = 0; float *A, *B, *C, *D, *D_sw;

A = (float *)sds_alloc(N * N * sizeof(float)); B = (float *)sds_alloc(N * N * sizeof(float)); C = (float *)sds_alloc(N * N * sizeof(float)); D = (float *)sds_alloc(N * N * sizeof(float)); D_sw = (float *)malloc(N * N * sizeof(float));

if (!A || !B || !C || !D || !D_sw) {if (A) sds_free(A);if (B) sds_free(B);if (C) sds_free(C);if (D) sds_free(D);if (D_sw) free(D_sw);return 2;

}...}

In the example above you can see that variables used by the hardware functions are allocatedusing the sds_alloc function to insure physically contiguous memory is allocated, while thesoftware-only variable (D_sw) is allocated using malloc.

At the end of the main function, all of the allocated memory is released using sds_free orfree as appropriate:

sds_free(A);sds_free(B);sds_free(C);sds_free(D);free(D_sw);

The sds_alloc function, and other SDSoC™ specific functions for memory allocation/deallocation can be found in sds_lib.h. More information on these APIs can be found in the Appendix B: SDSoC Environment API.



https://github.com/Xilinx/SDSoC_Examples



Sequential/Parallel Accelerator ExecutionAfter defining the memory allocations needed for the accelerators, you should determine how tocall the accelerators from the application code. There are multiple ways for the accelerators tooperate in the context of the main application. For example in an application in which there isonly one accelerator, calling the hardware function like any other function will achieve thedesired results of a sequential dataflow. However, for multiple accelerators, knowing whetherand how the data is shared between the accelerators will let you choose between two distinctflows: sequential (synchronous) dataflow, or parallel (asynchronous) dataflow. In a sequentialdataflow, the accelerators will operate in sequence: one will execute followed by the next,providing some benefit of acceleration in the hardware implementation. In parallel dataflow, bothaccelerators can operate concurrently, granting your application task-level parallelism forsignificant performance improvement. You can also use pipelining (pragma HLS PIPELINE),which allows both accelerators to operate concurrently on slightly skewed parts of the data set.See the SDx Pragma Reference Guide (UG1253) for more information on the pragmas discussedhere.

To implement asynchronous dataflow, you must specify #pragma SDS async(id) and#pragma SDS wait(id) in your embedded processor application. You must place thesepragmas in the application code, before and after the hardware function call, as shown in thefollowing example:

#pragma SDS async(1)mmult(A, B, C);#pragma SDS async(2)madd(D, E, F);

// Do other SW functions

#pragma SDS wait(1)#pragma SDS wait(2)

TIP: The advantage of using async/wait is that it lets the application perform other operations whilethe hardware functions are running; and lets you hold the application at the appropriate point to waitfor a hardware function to return.

The preceding code example demonstrates a typical asynchronous method. Here, the providedIDs correspond to their respective function (id = 1 for mmult, id = 2 for madd). The mmultfunction will be loaded with the inputs values of A and B, and processed. Notice in this casewhere the accelerators are data independent (data is not being shared between the accelerators),asynchronous execution is beneficial. If you determine that the data for an accelerator is notneeded by other functions on either the CPU or another accelerator, then async execution withnon-cacheable physically contiguous data may provide the best performance.



https://www.xilinx.com/cgi-bin/docs/rdoc?v=2018.2;d=ug1253-sdx-pragma-reference.pdf



IMPORTANT!: In cases where the data from one accelerator is required by a second accelerator, youshould not use async/wait. The async pragma forgoes compiler driven syncing, and thus you couldend up with incorrect results if one accelerator requires syncing prior to the start of another.

An example of direct connection is provided with the SDSoC Examples available on the Xilinx®

GitHub repository. The parallel_accel code offers a simple example of two hardware functions,matrix addition and matrix multiplication, to demonstrate async and wait which helps toachieve greater performance through system parallelism and concurrency.

The parallel_accel example provides both a sequential dataflow form of the twoaccelerators, and a paralllel dataflow form of the two accelerators, and uses performance monitorfunctions (seq_hw_ctr, par_hw_ctr) from the included sds_utils.h to measure theperformance difference. The relevant code is provided below for examination:

//Two hw functions are called back to back. First the //vadd_accel is executed, then vmul_accel is executed.//The execution of both accelerators is sequential here.//To prevent automatic dataflow between calls to the two//hw functions, async and wait pragma is used here so as//to ensure that the two hw functions will be running sequentially.seq_hw_ctr.start();// Launch Hardware Solutionfor(int itr = 0; itr < MAX_NUM_TIMES; itr++){ #pragma SDS async(1) vadd_accel(source_in1, source_in2, source_vadd_hw_results, size); #pragma SDS wait(1) #pragma SDS async(2) vmul_accel(source_in1, source_in2, source_vmul_hw_results, size); #pragma SDS wait(2)}seq_hw_ctr.stop();

//Two hw functions are called back to back.//The program running on the hardware first transfers in1 and in2 //to the vadd_accel hardware and returns immediately. Then the program //transfers in1 and in2 to the vmul_accel hardware and returns//immediately. When the program later executes to the point of //#pragma SDS wait(id), it waits for the particular output to be ready.par_hw_ctr.start();// Launch Hardware Solution#pragma SDS async(1)vadd_accel(source_in1, source_in2, source_vadd_hw_results, size);#pragma SDS async(2)vmul_accel(source_in1, source_in2, source_vmul_hw_results, size);for(int itr = 0; itr < MAX_NUM_TIMES; itr++){ #pragma SDS wait(1) #pragma SDS async(1) vadd_accel(source_in1, source_in2, source_vadd_hw_results, size); #pragma SDS wait(2) #pragma SDS async(2) vmul_accel(source_in1, source_in2, source_vmul_hw_results, size);}#pragma SDS wait(1)#pragma SDS wait(2)par_hw_ctr.stop();




https://github.com/Xilinx/SDSoC_Examples/tree/master/cpp/getting_started/parallel_accel



In the sequential dataflow example, the async and wait pragmas are used to insure that thetwo hardware functions are run sequentially. The key is the use of the wait pragma before thecall to the multiplier function, vmul_accel, which insures that the addition function,vadd_accel, completes before matrix multiplication begins. Notice also the use of theasync(2) and wait(2) pragmas to insure that the application waits for the completion of thevmul_accel hardware function before proceeding.

TIP: The async/wait pragmas are not actually needed in the preceding example, as the compiler willautomatically synchronize these functions in the manner described.

In the parallel dataflow example, the vadd_accel and vmul_accel functions are started insequence, not waiting for one to complete before calling the next. This results in nearly parallelexecution of the two hardware functions. These function calls are labeled async(1) andasync(2). Then the for loop is called to repeat the functions a number of times(MAX_NUM_TIMES), but wait(1) and wait(2) are used to wait for the prior executions tocomplete before calling the functions again.

As with parallel code, you must explicitly synchronize the function calls so that the data isavailable for the application to complete the function. Failure to program this properly can resultin deadlocks, or non-deterministic behavior. However, in some instances running concurrentaccelerators may not provide the best performance compared to other means. An example of thisis pipelining concurrent accelerators that are data dependent on each other. This would requirethe data to be synced on a pipeline stage before the accelerator can begin to process the data.

Validating the Software to HardwareConversionTesting accelerators in the SDSoC™ environment is similar to testing any other function on asoftware platform. Generally, you can write a test bench to exercise and validate the applicationcode, or this testing can be implemented as a function call from the main function with a goldendataset, and then comparing the outputs. Converting the C/C++ code of the software function tothe HDL code of the hardware function may cause the behavior of the hardware function tochange. It is a good idea to always run a verification test between the converted hardware codeand the known good software code to make sure the algorithm is maintained through thecomplete build process.

TIP: For an application that has multiple accelerators, it is best to do a bottom-up testing approach,testing each accelerator individually, and then testing all accelerators together. This should shortendebug time. Refer to the SDSoC Environment Debugging Guide (UG1282) for more information.



https://www.xilinx.com/cgi-bin/docs/rdoc?v=2018.2;d=ug1282-sdsoc-debugging-guide.pdf



Examples of verification code can be found in the SDSoC Examples available on the Xilinx®

GitHub repository. The following code is from the mmultadd example available in the<install_dir>/SDx/<version>/samples folder. The main.cpp file defines methods tocalculate golden data for the matrix addition (madd_golden) and multiplication(mmult_golden). The code for mmult_golden is provided below:

void mmult_golden(float *A, float *B, float *C){ for (int row = 0; row < N; row++) {

for (int col = 0; col < N; col++) {float result = 0.0;for (int k = 0; k < N; k++) {

result += A[row*N+k] * B[k*N+col];}C[row*N+col] = result;

} }}

Notice the function is essentially the same as the hardware function, mmult, which acceleratesthe matrix multiplication in the PL region of the device, while adding a few techniques such asarray partitioning and pipelining to achieve optimal performance. The mmult_golden simplycalculates the expected value as golden data to be compared against the results returned by theaccelerated function.

Finally, within the mmult_test function, the verification process is called to generate thegolden data and compare it to the results generated by the accelerated hardware functions. Thissection of the code is provided below:

int mmult_test(float *A, float *B, float *C, float *D, float *D_sw) { std::cout << "Testing " << NUM_TESTS << " iterations of " << N << "x"

<< N << " floating point mmultadd..." << std::endl;

perf_counter hw_ctr, sw_ctr;

for (int i = 0; i < NUM_TESTS; i++) {

init_arrays(A, B, C, D, D_sw);

float tmp[N*N], tmp1[N*N];sw_ctr.start();mmult_golden(A, B, tmp);madd_golden(tmp, C, D_sw);sw_ctr.stop();

hw_ctr.start();mmult(A, B, tmp1);madd(tmp1, C, D);hw_ctr.stop();

if (result_check(D, D_sw))






return 1; }

...}

Performance EstimationIn some cases, knowing the wall-clock time of the functions that may be turned into hardwarefunctions may be necessary. You can accurately measure the execution time of functions byusing special SDSoC API calls that measure activity based off of the free running clock of theArm processor. The API functions include sds_clock_counter() and sds_clock_frequency(). Thesefunctions can be used to log the start and end times of a function. The functionsds_clock_counter() returns the value of the free running clock register, while the functionsds_clock_frequency() will return the speed in ticks/second of the Arm processor. Refer to Appendix B: SDSoC Environment API for more information on these functions.

Note: sds_clock_frequency() is a high performance counter and offers a fine-grained measurement ofevents.

A performance counter class is provided in the sds_util.h available with the SDSoC Exampleson the Xilinx® GitHub repository. The perf_counter includes methods for capturing the startand stop clock times, and the number of function calls, as shown below:

#include "sds_lib.h"

class perf_counter{public: uint64_t tot, cnt, calls; perf_counter() : tot(0), cnt(0), calls(0) {}; inline void reset() { tot = cnt = calls = 0; } inline void start() { cnt = sds_clock_counter(); calls++; }; inline void stop() { tot += (sds_clock_counter() - cnt); }; inline uint64_t avg_cpu_cycles() { return ((tot+(calls>>1)) / calls); };};

You can also use the avg_cpu_cycles() method to return the equivalent number of averagecycles the task took in CPU cycle count.



https://github.com/Xilinx/SDSoC_Examples/blob/master/libs/sds_utils/sds_utils.h




Chapter 3

Programming Hardware FunctionsProgramming a function for hardware acceleration in the SDSoC™ environment is as simple aswriting a standard C/C++ function. However, to get the significant performance advantages ofhardware acceleration through the sds++/sdscc (referred to as sds++) system compiler, thereare a few considerations to keep in mind when writing the function, or modifying existing codeto be implemented in programmable logic.

• Defining the function interface: the datatypes of inputs and outputs, and data transfers.

• What kind of memory access the function will have: DMA (interfacing with DDR), or FIFOs.

• How will the data be accessed: contiguous or non-contiguous.

• How will the data be processed: loops, arrays.

Determining what data is going to be processed in and out of an accelerator is the first step increating a hardware function. Knowing the inputs and outputs of the hardware function, you canget an idea of what parallelism can be achieved. A critical element to writing a function foracceleration in programmable logic is that all data being used by the function can not change insize; arrays are static and cannot be resized. By being strict on the size of the array, the compilercan complete implementing the function into programmable logic, which cannot change duringrun-time.

Exporting Hardware Functions as Libraries

Once a hardware function, or a library of functions are written and optimized as needed, you cancreate an exported library for reuse in other projects. A general flow for exporting a library is tomake sure that all the function definitions are grouped appropriately, and use the sds++/sdscccommand with the -shared (which is interpreted as -fPIC for gcc) option to build a sharedlibrary when compiling the functions. More detailed information can be found in Appendix A:Exporting a Library for GCC.




C-Callable IP

An accelerator can also be created using RTL and provided as an IP core through the Vivado®

Design Suite, called a C-Callable IP. In this case, to use the C-Callable IP in the application code,you will add the static library compiled for the appropriate Arm® Processor, typically an archive(.a) file, and the header file (.h/.hpp) as source files in the SDSoC application project. Thesefiles need be added to the source directory of the application or specified for library search (-L)and include search (-I), and properly included and linked with the main function. Now, like anyother hardware function, it can be called as a typical C/C++ function, making sure the data sizesmatch to the function parameters. More information on creating and using C-Callable IP can befound in the SDSoC Environment User Guide (UG1027).

Coding GuidelinesThis section contains general coding guidelines for application programming using the SDSoC™system compilers, with the assumption of starting from application code that has already beencross-compiled for the Arm® CPU within the Zynq® device, using the GNU toolchain included aspart of the SDSoC environment.

General Hardware Function Guidelines• Hardware functions can execute concurrently under the control of a master thread. Multiple

master threads are supported.

• A top-level hardware function must be a global function, not a class method, and it cannot bean overloaded function.

• There is no support for exception handling in hardware functions.

• It is an error to refer to a global variable within a hardware function or any of its sub-functionswhen this global variable is also referenced by other functions running in software.

• Hardware functions support scalar types up to 1024 bits, including double, long long, packedstructs, etc.

• A hardware function must have at least one argument.

• An output or inout scalar argument to a hardware function can be assigned multiple times, butonly the last written value will be read upon function exit.

• Use predefined macros to guard code with #ifdef and #ifndef preprocessor statements;the macro names begin and end with two underscore characters ‘_’. For examples, see"SDSCC/SDS++ Compiler Commands" in the SDx Command and Utility Reference Guide(UG1279).

Chapter 3: Programming Hardware Functions



https://www.xilinx.com/cgi-bin/docs/rdoc?v=2018.2;d=ug1279-sdx-command-utility-reference-guide.pdf



○ The __SDSCC__ macro is defined and passed as a -D option to sub-tools whenever sds++is used to compile source files, and can be used to guard code dependent on whether it iscompiled by sds++ or by another compiler, for example a GNU host compiler.

○ When sds++ compiles source files targeted for hardware acceleration using Vivado HLS,the __SDSVHLS__ macro is defined and passed as a -D option, and can be used to guardcode dependent on whether high-level synthesis is run or not.

○ Vivado® HLS employs some 32-bit libraries irrespective of the host machine. Furthermore,the tool does not provide a true cross-compilation.

All object code for the Arm® CPUs is generated with the GNU toolchains, but the sds++compiler, built upon Clang/LLVM frameworks, is generally less forgiving of C/C++ languageviolations than the GNU compilers. As a result, you might find that some libraries needed foryour application cause front-end compiler errors when using sds++. In such cases, compile thesource files directly through the GNU toolchain rather than through sds++, either in yourmakefiles or by setting the compiler to arm-linux-gnueabihf-g++ by right-clicking on thefile (or folder) in the Project Explorer and selecting C/C++ Build → Settings → SDSCC/SDS++Compiler. Refer to "Compiling and Running Applications on an Arm® Processor" in the SDSoCEnvironment User Guide (UG1027) for more information.

Hardware Function Argument TypesThe SDSoC™ environment sds++ system compiler supports hardware function arguments withtypes that resolve to a single or array of C99 basic arithmetic type (scalar), a struct or classwhose members flatten to a single or array of C99 basic arithmetic type (hierarchical structs aresupported), an array of struct whose members flatten to a single C99 basic arithmetic type.Scalar arguments must fit in a 1024-bit container. The SDSoC environment automatically infershardware interface types for each hardware function argument based on the argument type andthe following pragmas:

#pragma SDS data copy|zero_copy#pragma SDS data access_pattern

To avoid interface incompatibilities, you should only incorporate Vivado® HLS interface typedirectives and pragmas in your source code when sds++ fails to generate a suitable hardwareinterface directive.

• Vivado HLS provides arbitrary precision types ap_fixed<int>, ap_int<int>, and anhls::stream class. In the SDSoC environment, ap_fixed<int> types must be specifiedas having widths greater than 7 but less than 1025 (7 < width < 1025). The hls::streamdata type is not supported as the function argument to any hardware function.






• By default, an array argument to a hardware function is transferred by copying the data, thatis, it is equivalent to using #pragma SDS data copy. As a consequence, an array argumentmust be either used as an input or produced as an output, but not both. For an array that isboth read and written by the hardware function, you must use #pragma SDS datazero_copy to tell the compiler that the array should be kept in the shared memory and notcopied.

• To ensure alignment across the hardware/software interface, do not use hardware functionarguments that are an array of bool.

Pointers

Pointer arguments for a hardware function require special consideration. Hardware functionsoperate on physical addresses, which typically are not available to userspace programs, sopointers cannot be embedded in data structures passed to hardware functions.

In the absence of any pragmas, a pointer argument is taken to be a scalar parameter by default,even though in C/C++ a pointer might denote a one-dimensional array type. The following arethe permitted pragmas:

• The DATA ZERO_COPY pragma provides pointer semantics using shared memory.

#pragma SDS data zero_copy

• The DATA COPY and DATA ACCESS_PATTERN pragma pair maps the argument onto astream, and requires that the array elements are accessed in sequential index order.

#pragma SDS data copy(p[0:<p_size>]) #pragma SDS data access_pattern(p:SEQUENTIAL)

The DATA COPY pragma is only required when the sds++ system compiler is unable todetermine the data transfer size and issues an error.

TIP: If you require non-sequential access to the array in the hardware function, you should change thepointer argument to an array with an explicit declaration of its dimensions, for example A[1024].

Hardware Function Call Guidelines• Stub functions generated in the SDSoC™ environment transfer the exact number of bytes

according to the compile-time determinable array bound of the corresponding argument inthe hardware function declaration. If a hardware function admits a variable data size, you canuse the following pragma to direct the SDSoC™ environment to generate code to transferdata whose size is defined by an arithmetic expression:

#pragma SDS data copy(arg[0:<C_size_expr>])#pragma SDS data zero_copy(arg[0:<C_size_expr>])





Where the <C_size_expr> must compile in the scope of the function declaration.

The zero_copy pragma directs the SDSoC environment to map the argument into sharedmemory.

IMPORTANT!: Be aware that mismatches between intended and actual data transfer sizes can causethe system to hang at runtime, requiring laborious hardware debugging. Refer to the SDSoCEnvironment Debugging Guide (UG1282).

• Align arrays transferred by DMAs on cache-line boundaries (for L1 and L2 caches). Use thesds_alloc() API provided with the SDSoC environment instead of malloc() to allocatethese arrays.

• Align arrays to page boundaries to minimize the number of pages transferred with the scatter-gather DMA, for example, for arrays allocated with malloc.

• You must use sds_alloc to allocate an array for the following two cases:

1. You are using zero-copy pragma for the array.

2. You are using pragmas to explicitly direct the system compiler to use Simple-DMA.

Note that in order to use sds_alloc() from sds_lib.h, it is necessary to includestdlib.h before including sds_lib.h. stdlib.h is included to provide the size_t type.

Data MoversA data mover is added by the sds++/sdscc compiler to move data into and out of the hardwareaccelerator. The data mover will be inferred by the compiler based on the volume of data beingtransferred, characteristics of memory being transferred, and access pattern expectations of theaccelerator consuming or producing the data in the hardware function. The data mover isimplemented into the PL region to support data transfers to and from the hardware function. Thesystem compiler will implement one or more accelerators in the programmable region, includingthe data movers, automating control signals, and interrupts.

You can specify the SDS DATA COPY or DATA ZERO_COPY pragmas in the source code of thehardware function to influence the behavior of the data movers. The DATA COPY pragmaindicates that data is explicitly copied between memory and the hardware function, using asuitable data mover for the transfer. The DATA ZERO_COPY pragma means that the hardwarefunction accesses the data directly from shared memory through an AXI master bus interface.






With the data size boundaries known, methods of transferring data can be optimized. As anexample, if the data is a large image (1920x1080) accessing the data would be easier by storing itin DDR in a physically contiguous way. Retrieving the data can be done through direct interfacingto the DDR and then determining if it needs to be sequentially accessed or not. By default, sds++ compiler will infer the data movers, but in cases where the data being transferred is large, youwill need to identify the appropriate data mover to use. Refer to the SDSoC Environment Profilingand Optimization Guide (UG1235) for information on setting up and optimizing the data motionnetwork.

To incorporate these types of data movers, pragmas are used to tell the compiler how tointerface the accelerator to the rest of the system. For example, for storing and retrieving animage in DDR the following SDS pragmas can be used:

#pragma SDS data copy(data[0:size])#pragma SDS data copy(out[0:size])#pragma DATA ACCESS_PATTERN(data:SEQUENTIAL, out:SEQUENTIAL)void accelerator(float *data, int *out, int size) { ... }

IMPORTANT!: Using the COPY pragma also requires the ACCESS_PATTERN pragma to be used.

The DATA ACCESS_PATTERN pragma is used to specify how the array data is being transferred.It is need when the data is determined at runtime, but of unbounded size at compile. ForSEQUENTIAL data access, SDSoC will create a streaming interface like a FIFO, while a RANDOMdata access pattern will create a RAM interface, which is the default. A key point is thatSEQUENTIAL will only access elements from the array one time, while RANDOM can access anyarray value in any order. Note though that using a random access pattern will still transfer thecomplete volume of data, regardless of the volume being accessed.

You can also use the SDS pragma DATA MEM_ATTRIBUTE as a hint to the compiler to trust thatmemory is physically contiguous or not, and is cacheable or not.

Knowing how the data is being allocated can help tune the system performance depending onthe accelerator. For example, the compiler can use simple DMA transfer for physically contiguousmemory, which is smaller and faster than AXI_DMA_SG. For physically contiguous memory, youmust use sds_alloc, while for non-physically contiguous memory use malloc. Specifying theDATA MEM ATTRIBUTE helps determine what kind of data mover the compiler will use.

See the SDx Pragma Reference Guide (UG1253) for more information on SDS pragmas andexamples.

You can also direct the creation of data movers as elements of a packaged C-Callable IP byapplying pragmas to the software function signature defined in the header file associated withthe C-Callable IP. Refer to the SDSoC Environment User Guide (UG1027) for more information.



https://www.xilinx.com/cgi-bin/docs/rdoc?v=2018.2;d=ug1235-sdsoc-optimization-guide.pdf





The Function BodyAfter determining the function interfaces and the data transfer mechanism, writing the functionbody is all that remains. The body of a hardware function should not be all that different from afunction written for the processor. However, a key point to remember is that there areopportunities to improve the performance of the accelerator, and of the overall application. Tothis end, you can examine and rewrite the structure of hardware functions to increaseinstruction-level or task-level parallelism, use bit-accurate datatypes, manage loop unrolling andpipelining, and overall dataflow.

DatatypesAs it is faster to write and verify the code by using native C datatypes such as int, float, ordouble, it is a common practice to use these datatypes when coding for the first time. However,the code is implemented in hardware, and all the operator sizes used in the hardware aredependent on the datatypes used in the accelerator code. The default native C/C++ datatypescan result in larger and slower hardware resources that can limit the performance of the . Instead,you should consider using bit-accurate datatypes to ensure the code is optimized forimplementation in hardware. Using bit accurate, or arbitrary precision datatypes results inhardware operators which are smaller and faster. This allows more logic to be placed into theprogrammable logic, and also allows the logic to execute at higher clock frequencies, while usingless power.

RECOMMENDED: Consider using bit-accurate datatypes instead of native C/C++ datatypes in yourcode.

Arbitrary Precision Integer TypesArbitrary precision integer datatypes are defined by ap_int or ap_uint for signed andunsigned integer respectively inside the header file ap_int.h. To use arbitrary precision integerdata type:

• Add header file ap_int.h to the source code.

• Change the bit types to ap_int<N> or ap_uint<N>, where N is a bit-size from 1 to 1024.





The following example shows how the header file is added and the two variables areimplemented to use 9-bit integer and 10-bit unsigned integer.

#include “ap_int.h” ap_int<9> var1 // 9 bit signed integerap_uint<10> var2 // 10 bit unsigned integer

Arbitrary Precision Fixed-Point Data TypesSome existing applications use floating point data types as they are written for other hardwarearchitectures. However, fixed-point data types are a useful replacement for floating point typeswhich require many clock cycles to complete. You should carefully evaluate trade-offs in power,cost, productivity, and precision when choosing to implement floating-point vs. fixed-pointarithmetic for your application and accelerators.

As discussed in Deep Learning with INT8 Optimization on Xilinx Devices (WP486), using fixed pointarithmetic instead of floating point for applications like machine learning can increase powerefficiency, and lower the total power required. Unless the entire range of the floating-point typeis required, the same accuracy can often be implemented with a fixed-point type resulting in thesame accuracy with smaller and faster hardware. The paper Reduce Power and Cost by Convertingfrom Floating Point to Fixed Point (WP491) provides some examples of this conversion.

Fixed-point data types model the data as an integer and fraction bits. The fixed point data typerequires the ap_fixed header, and supports both a signed and unsigned form as follows:

• Header file: ap_fixed.h

• Signed fixed point: ap_fixed<W,I,Q,O,N>

• Unsigned fixed point: ap_ufixed<W,I,Q,O,N>

○ W = Total width <1024 bits

○ I = Integer bit width. The value of I must be less than or equal to the width (W). Thenumber of bits to represent the fractional part is W minus I. Only a constant integerexpression can be used to specify the integer width.

○ Q = Quantization mode. Only predefined enumerated values can be used to specify Q. Theaccepted values are:

- AP_RND: Rounding to plus infinity.

- AP_RND_ZERO: Rounding to zero.

- AP_RND_MIN_INF: Rounding to minus infinity.

- AP_RND_INF: Rounding to infinity.

- AP_RND_CONV: Convergent rounding.

- AP_TRN: Truncation. This is the default value when Q is not specified.



https://www.xilinx.com/cgi-bin/docs/ndoc?t=white_papers;d=wp486-deep-learning-int8.pdf

https://www.xilinx.com/cgi-bin/docs/ndoc?t=white_papers;d=wp491-floating-to-fixed-point.pdf



- AP_TRN_ZERO: Truncation to zero.

○ O = Overflow mode. Only predefined enumerated values can be used to specify O. Theaccepted values are:

- AP_SAT: Saturation.

- AP_SAT_ZERO: Saturation to zero.

- AP_SAT_SYM: Symmetrical saturation.

- AP_WRAP: Wrap-around. This is the default value when O is not specified.

- AP_WRAP_SM: Sign magnitude wrap-around.

○ N= The number of saturation bits in the overflow WRAP modes. Only a constant integerexpression can be used as the parameter value. The default value is zero.

TIP: The ap_fixed and ap_ufixed datatypes permit shorthand definition, with only W and I beingrequired, and other parameters assigned default values. However, to define Q or N, you must alsospecify the parameters before those, even if you just specify the default values.

In the example code below, the ap_fixed type is used to define a signed 18-bit variable with 6bits representing the integer value above the binary point, and by implication, 12-bitsrepresenting the fractional value below the binary point. The quantization mode is set to roundto plus infinity (AP_RND). Since the overflow mode and saturation bits are not specified, thedefaults AP_WRAP and 0 are used.

#include <ap_fixed.h>... ap_fixed<18,6,AP_RND> my_type;...

When performing calculations where the variables have different numbers of bits (W), ordifferent precision (I), the binary point is automatically aligned. Refer to "C++ Arbitrary PrecisionFixed-Point Types" in the Vivado Design Suite User Guide: High-Level Synthesis (UG902) for moreinformation on using fixed-point datatypes.

Array ConfigurationThe compiler maps large arrays to the memory (known as block RAM or BRAM) on theprogrammable region. These BRAM can have a maximum of two access points or ports. This canlimit the performance of the application as all the elements of an array cannot be accessed inparallel when implemented in hardware.



https://www.xilinx.com/cgi-bin/docs/rdoc?v=2018.2;d=ug902-vivado-high-level-synthesis.pdf



Depending on the performance requirements, you may need to access some or all of theelements of an array in the same clock cycle. To achieve this, the #pragma HLSARRAY_PARTITION can be used to instruct the compiler to split the elements of an array andmap it to smaller arrays, or to individual registers. The compiler provides three types of arraypartitioning, as shown in the following figure. The three types of partitioning are:

• block: The original array is split into equally sized blocks of consecutive elements of theoriginal array.

• cyclic: The original array is split into equally sized blocks interleaving the elements of theoriginal array.

• complete: Split the array into its individual elements. This corresponds to resolving amemory into individual registers. This is the default for the ARRAY_PARTITION pragma.

Figure 3: Partitioning Arrays

0 1 2 ... N-3 N-2 N-1

0 1 ... (N/2-1)

N/2 ... N-2 N-1

0 2 ... N-2

1 ... N-3 N-1

0N-3

N-11N-2

...2

block

cyclic

complete

X14251-061518

For block and cyclic partitioning, the factor option specifies the number of arrays that arecreated. In the preceding figure, a factor of 2 is used to split the array into two smaller arrays. Ifthe number of elements in the array is not an integer multiple of the factor, the later arrays willhave fewer elements.

When partitioning multi-dimensional arrays, the dimension option is used to specify whichdimension is partitioned. The following figure shows how the dimension option is used topartition the following example code in three different ways

void foo (...) { // my_array[dim=1][dim=2][dim=3] // The following three pragma results are shown in the figure below // #pragma HLS ARRAY_PARTITION variable=my_array dim=3 <block|cyclic> factor=2 // #pragma HLS ARRAY_PARTITION variable=my_array dim=1 <block|cyclic>





factor=2 // #pragma HLS ARRAY_PARTITION variable=my_array dim=0 complete int my_array[10][6][4]; ... }

Figure 4: Partitioning the Dimensions of an Array

my_array_0[10][6]my_array_1[10][6]my_array_2[10][6]my_array_3[10][6]

my_array_0[6][4]my_array_1[6][4]my_array_2[6][4]my_array_3[6][4]my_array_4[6][4]my_array_5[6][4]my_array_6[6][4]my_array_7[6][4]my_array_8[6][4]my_array_9[6][4]

my_array[10][6][4] partition dimension 3

my_array[10][6][4] partition dimension 1

my_array[10][6][4] partition dimension 0 10x6x4 = 240 registersX14304-061518

The examples in the figure demonstrate how partitioning dimension 3 results in 4 separate arraysand partitioning dimension 1 results in 10 separate arrays. If 0 is specified as the dimension, alldimensions are partitioned.

The Importance of Careful Partitioning

A complete partition of the array maps all the array elements to the individual registers. Thishelps in improving the kernel performance because all of these registers can be accessedconcurrently in a same cycle. However, a word of caution, complete partitioning of the largearrays consumes a lot of FPGA resources. It may even cause the compilation process to slowdown and face capacity issue. You should partition the array only when it is needed. Considerselectively consider partitioning a particular dimension, or doing block or cycle partitioning. Thefollowing is an example of array partitioning using a common matrix multiplication algorithm.

Choosing a Specific Dimension to Partition

Suppose A and B are two-dimensional arrays representing two matrices. Consider the MatrixMultiplication algorithm below:

int A[64][64];int B[64][64];

ROW_WISE: for (int i = 0; i < 64; i++) { COL_WISE : for (int j = 0; j < 64; j++) { #pragma HLS PIPELINE int result = 0; COMPUTE_LOOP: for (int k = 0; k < 64; k++) {





result += A[i ][ k] * B[k ][ j]; } C[i][ j] = result; }}

Due to the PIPELINE pragma, the ROW_WISE and COL_WISE loop will be flattened together andCOMPUTE_LOOP will get fully unrolled. In order to concurrently execute each iteration (k) of theCOMPUTE_LOOP, the code must access each column of matrix A and each row of matrix B inparallel. Therefore, the matrix A should be split in the second dimension, and matrix B should besplit in the first dimension.

#pragma HLS ARRAY_PARTITION variable=A dim=2 complete#pragma HLS ARRAY_PARTITION variable=B dim=1 complete

Choosing Between Cyclic and Block Partitions

Here the same matrix multiplication algorithm is used to demonstrate choosing between cyclicand block partitioning, and determining the appropriate factor, by understanding the array accesspattern of the underlying algorithm.

int A[64 * 64];int B[64 * 64];#pragma HLS ARRAY_PARTITION variable=A dim=1 cyclic factor=64#pragma HLS ARRAY_PARTITION variable=B dim=1 block factor=64 ROW_WISE: for (int i = 0; i < 64; i++) { COL_WISE : for (int j = 0; j < 64; j++) { #pragma HLS PIPELINE int result = 0; COMPUTE_LOOP: for (int k = 0; k < 64; k++) { result += A[i * 64 + k] * B[k * 64 + j]; } C[i* 64 + j] = result; }}

In this version of the code A and B are now one-dimensional arrays. In order to access eachcolumn of matrix A, and each row of matrix B in parallel, cyclic and block partitions are used asshown in the above example. To access each column of matrix A in parallel, cyclic partitioningis applied with the factor specified as the row size, in this case 64. Similarly, to access each rowof matrix B in parallel, block partitioning is applied with the factor specified as the columnsize, or 64.

Minimizing Array Accesses with Caching

As arrays are mapped to BRAM with limited number of access ports, repeated array accesses canlimit the performance of the accelerator. You should have a good understanding of the arrayaccess pattern of the algorithm, and limit the array accesses by locally caching the data toimprove the performance of the .





The following code example shows a case in which accesses to an array can limit performance inthe final implementation. In this example, there are three accesses to the array mem[N] to createa summed result.

#include "array_mem_bottleneck.h"dout_t array_mem_bottleneck(din_t mem[N]) { dout_t sum=0; int i; SUM_LOOP:for(i=2;i<N;++i) sum += mem[i] + mem[i-1] + mem[i-2]; return sum;}

The code in the preceding example can be rewritten as shown in the following example to allowthe code to be pipelined with a II=1. By performing pre-reads and manually pipelining the dataaccesses, there is only one array read specified inside each iteration of the loop. This ensures thatonly a single-port BRAM is needed to achieve the performance.

#include "array_mem_perform.h"dout_t array_mem_perform(din_t mem[N]) { din_t tmp0, tmp1, tmp2; dout_t sum=0; int i; tmp0 = mem[0]; tmp1 = mem[1]; SUM_LOOP:for (i = 2; i < N; i++) { tmp2 = mem[i]; sum += tmp2 + tmp1 + tmp0; tmp0 = tmp1; tmp1 = tmp2; } return sum;}

RECOMMENDED: Consider minimizing the array access by caching to local registers to improve thepipelining performance depending on the algorithm.

LoopsLoops are an important aspect for a high performance accelerator. Generally, loops are eitherpipelined or unrolled to take advantage of the highly distributed and parallel FPGA architectureto provide a performance boost compared to running on a CPU.

By default, loops are neither pipelined nor unrolled. Each iteration of the loop takes at least oneclock cycle to execute in hardware. Thinking from the hardware perspective, there is an implicit“wait until clock” for the loop body. The next iteration of a loop only starts when the previousiteration is finished.





Loop PipeliningBy default, every iteration of a loop only starts when the previous iteration has finished. In theloop example below, a single iteration of the loop adds two variables, and stores the result in athird variable. Assume that in hardware this loop takes 3 cycles to finish one iteration. Also,assume that the loop variable len is 20, i.e., the vadd loop runs for 20 iterations in the .Therefore, it requires a total of 60 clock cycles (20 iterations * 3 cycles) to complete all theoperations of this loop.

vadd: for(int i = 0; i < len; i++) { c[i] = a[i] + b[i];}

TIP: It is good practice to always label a loop as shown in the above code example (vadd:…). Thispractice helps with debugging when working in the environment. Do note that the labels will generatewarnings during compilation, which can be safely ignored.

Pipelining the loop executes subsequent iterations in a pipelined manner. This means thatsubsequent iterations of the loop overlap and run concurrently, executing at different sections ofthe loop-body. Pipelining a loop can be enabled by the pragma HLS PIPELINE. Note that thepragma is placed inside the body of the loop.

vadd: for(int i = 0; i < len; i++) { #pragma HLS PIPELINE c[i] = a[i] + b[i];}

In the example above, it is assumed that every iteration of the loop takes three cycles: read, add,write. Without pipelining, each successive iteration of the loop starts in every third cycle. Withpipelining the loop can start subsequent iterations of the loop in fewer than three cycles, such asin every second cycle, or in every cycle.

The number of cycles it takes to start the next iteration of a loop is called the initiation interval(II) of the pipelined loop. So II=2 means each successive iteration of the loop starts every twocycles. An II=1 is the ideal case, where each iteration of the loop starts in the very next cycle.When you use the pragma HLS PIPELINE the compiler always tries to achieve II=1performance.

The following figure illustrates the difference in execution between pipelined and non-pipelinedloops. In this figure, (A) shows the default sequential operation where there are 3 clock cyclesbetween each input read (II=3), and it requires 8 clock cycles before the last output write isperformed.





Figure 5: Loop Pipelining

void func(m,n,o) { for (i=2;i>=0;i--) { op_Read; op_Compute; op_Write;

}}

4 cycles

RD

3 cycles

8 cycles

1 cycleRD CMP WR

RD CMP WR

RD CMP WR

(A) Without Loop Pipelining (B) With Loop PipeliningX14277-061518

CMP WR RD CMP WR RD CMP WR

In the pipelined version of the loop shown in (B), a new input sample is read every cycle (II=1)and the final output is written after only 4 clock cycles: substantially improving both the II andlatency while using the same hardware resources.

IMPORTANT!: Pipelining a loop causes any loops nested inside the pipelined loop to get unrolled

If there are data dependencies inside a loop it may not be possible to achieve II=1, and a largerinitiation interval may be the result. Loop dependencies are discussed in Loop Dependencies.

Loop UnrollingThe compiler can also unroll a loop, either partially or completely to perform multiple loopiterations in parallel. This is done using the HLS UNROLL pragma. Unrolling a loop can lead to avery fast design, with significant parallelism. However, because all the operations of the loopiterations will be executed in parallel, a large amount of programmable logic resource will berequired to implement the hardware. As a result, the compiler can face challenges dealing withsuch a large number of resources and can face capacity problems that slow down the compilationprocess. It is a good guideline to unroll loops that have a small loop body, or a small number ofiterations.

vadd: for(int i = 0; i < 20; i++) { #pragma HLS UNROLL c[i] = a[i] + b[i];}





In the preceding example, you can see pragma HLS UNROLL has been inserted into the body ofthe loop to instruct the compiler to unroll the loop completely. In other words, all 20 iterations ofthe loop will be executed in parallel if that is permitted by any data dependency.

Completely unrolling a loop can consume significant device resources, while partially unrollingthe loop provides some performance improvement without causing a significant impact onhardware resources.

Partially Unrolled loop

In order to completely unroll a loop, the loop must have a constant bound (20 in the exampleabove). However, partial unrolling is possible for loops with a variable bound. A partially unrolledloop means that only a certain number of loop iterations can be executed in parallel.

The following code examples illustrates how partially unrolled loops work:

array_sum:for(int i=0;i<4;i++){ #pragma HLS UNROLL factor=2 sum += arr[i];}

In the above example the UNROLL pragma is given a factor of 2. This is the equivalent ofmanually duplicating the loop body and running the two loops concurrently for half as manyiterations. The code below shows how this would be written. This transformation allows twoiterations of the above loop to execute in parallel.

array_sum_unrolled:for(int i=0;i<2;i+=2){ // Manual unroll by a factor 2 sum += arr[i]; sum += arr[i+1];}

Just like data dependencies inside a loop impact the initiation interval of a pipelined loop, anunrolled loop will perform operations in parallel only if data dependencies allow it. If operationsin one iteration of the loop require the result from a previous iteration, they cannot execute inparallel, but will execute as soon as the data from one iteration is available to the next.

RECOMMENDED: A good methodology is to PIPELINE loops first, and then UNROLL loops with smallloop bodies and limited iterations to improve performance further.





Loop DependenciesData dependencies in loops can impact the results of loop pipelining or unrolling. These loopdependencies can be within a single iteration of a loop, or between different iterations of a loop.The easiest way to understand loop dependencies is to examine an extreme example. In thefollowing code example, the result of the loop is used as the loop continuation or exit condition.Each iteration of the loop must finish before the next can start.

Minim_Loop: while (a != b) { if (a > b) a -= b; else b -= a;}

This loop cannot be pipelined. The next iteration of the loop cannot begin until the previousiteration ends.

Dealing with various types of dependencies with the compiler is an extensive topic requiring adetailed understanding of the high-level synthesis procedures underlying the compiler. Refer tothe Vivado Design Suite User Guide: High-Level Synthesis (UG902) for more information on"Dependencies with Vivado HLS."

Nested LoopsCoding with nested loops is a common practice. Understanding how loops are pipelined in anested loop structure is key to achieving the desired performance.

If the pragma HLS PIPELINE is applied to a loop nested inside another loop, the compilerattempts to flatten the loops to create a single loop, and apply the PIPELINE pragma to theconstructed loop. The loop flattening helps in improving the performance of the .

The compiler is able to flatten the following types of nested loops:

1. Perfect nested loop:

• Only the inner loop has a loop body.

• There is no logic or operations specified between the loop declarations.

• All the loop bounds are constant.

2. Semi-perfect nested loop:

• Only the inner loop has a loop body.

• There is no logic or operations specified between the loop declarations.

• The inner loop bound must constant, but the outer loop bound can be variable.






The following code example illustrates the structure of a perfect nested loop:

ROW_LOOP: for(int i=0; i< MAX_HEIGHT; i++) { COL_LOOP: For(int j=0; j< MAX_WIDTH; j++) { #pragma HLS PIPELINE // Main computation per pixel }}

The above example shows a nested loop structure with two loops that performs somecomputation on incoming pixel data. In most cases, you will want to process a pixel in everycycle, hence PIPELINE is applied to the nested loop body structure. The compiler will be able toflatten the nested loop structure in the example because it is a perfect nested loop.

The nested loop in the preceding example contains no logic between the two loop declarations.No logic is placed between the ROW_LOOP and COL_LOOP; all of the processing logic is insidethe COL_LOOP. Also, both the loops have a fixed number of iterations. These two criteria helpthe compiler flatten the loops and apply the PIPELINE constraint.

RECOMMENDED: If the outer loop has a variable boundary, then the compiler can still flatten theloop. You should always try to have a constant boundary for the inner loop.

Sequential LoopsIf there are multiple loops in the design, by default they do not overlap, and will executesequentially. This section introduces the concept of dataflow optimization for sequential loops.Consider the following code example:

void adder(unsigned int *in, unsigned int *out, int inc, int size) {

unsigned int in_internal[MAX_SIZE]; unsigned int out_internal[MAX_SIZE]; mem_rd: for (int i = 0 ; i < size ; i++){ #pragma HLS PIPELINE // Reading from the input vector "in" and saving to internal variable in_internal[i] = in[i]; } compute: for (int i=0; i<size; i++) { #pragma HLS PIPELINE out_internal[i] = in_internal[i] + inc; }

mem_wr: for(int i=0; i<size; i++) { #pragma HLS PIPELINE out[i] = out_internal[i]; }}

In the above example, three sequential loops are shown: mem_rd, compute, and mem_wr.





• The mem_rd loop reads input vector data from the memory interface and stores it in internalstorage.

• The main compute loop reads from the internal storage and performs an increment operationand saves the result to another internal storage.

• The mem_wr loop writes the data back to memory from the internal storage.

By default, these loops are executed sequentially without any overlap. First the mem_rd loop willfinish reading all the input data before the compute loop will start its operation. Similarly thecompute loop will finish processing the data before the mem_wr loop will start to write the data.However, the execution of these loops can be overlapped, allowing the compute (or mem_wr)loop to start as soon as there is enough data available to feed its operation, before the mem_rd(or compute) loop has finished processing its data.

The loop execution can be overlapped using dataflow optimization as described in the DataflowOptimization section.

Dataflow OptimizationDataflow optimization is a powerful technique to improve the performance by enabling task levelpipelining and parallelism. It allows the compiler to schedule multiple functions to runconcurrently to achieve higher throughput and lower latency. This is also known as task-levelparallelism.

The following figure shows a conceptual view of dataflow pipelining. The default behavior is toexecute and complete func_A, then func_B, and finally func_C. With the HLS DATAFLOWpragma enabled, the compiler can schedule each function to execute as soon as data is available.In this example, the original top function has a latency and interval of eight clock cycles. WithDATAFLOW optimization, the interval is reduced to only three clock cycles.





Figure 6: Dataflow Optimization

void top (a,b,c,d) { ... func_A(a,b,i1); func_B(c,i1,i2); func_C(i2,d)

return d;}

func_Afunc_Bfunc_C

8 cycles

func_A func_B func_C

8 cycles

3 cycles

func_Afunc_B

func_C

func_Afunc_B

func_C

5 cycles

(A) Without Dataflow Pipelining (B) With Dataflow Pipelining

X14266-042118

Dataflow Coding example

In the following dataflow coding example you should notice the following:

1. The HLS DATAFLOW pragma is applied to instruct the compiler to enable dataflowoptimization.

2. The stream class is used as a data transferring channel between each of the functions in thedataflow region.

TIP: The stream class infers a first-in first-out (FIFO) memory circuit in the programmable logic. Thismemory circuit, which acts as a queue in software programming, provides data-level synchronizationbetween the functions and achieves better performance. For additional details on the hls::streamclass, refer to the Vivado Design Suite User Guide: High-Level Synthesis (UG902).

void compute_kernel(ap_int<256> *inx, ap_int<256> *outx, DTYPE alpha) { hls::stream<unsigned int>inFifo; #pragma HLS STREAM variable=inFifo depth=32 hls::stream<unsigned int>outFifo #pragma HLS STREAM variable=inFifo depth=32

#pragma HLS DATAFLOW read_data(inx, inFifo); // Do computation with the acquired data compute(inFifo, outFifo, alpha); write_data(outx, outFifo); return;}






Canonical Forms of Dataflow Optimization

Xilinx recommends writing the code inside a dataflow region using canonical forms. There arecanonical forms for dataflow optimizations for both functions and loops.

• Functions: The canonical form coding guideline for dataflow inside a function specifies:

1. Use only the following types of variables inside the dataflow region

a. Local non-static scalar/array/pointer variables.

b. Local static hls::stream variables.

2. Function calls transfer data only in the forward direction.

3. Array or hls::stream should have only one producer function and one consumerfunction.

4. The function arguments (variables coming from outside the dataflow region) should onlybe read, or written, not both. If performing both read and write on the same functionargument then read should happen before write.

5. The local variables (those that are transferring data in forward direction) should be writtenbefore being read.

The following code example illustrates the canonical form for dataflow within a function. Notethat the first function (func1) reads the inputs and the last function (func3) writes theoutputs. Also note that one function creates output values that are passed to the nextfunction as input parameters.

void dataflow(Input0, Input1, Output0, Output1) { UserDataType C0, C1, C2; #pragma HLS DATAFLOW func1(read Input0, read Input1, write C0, write C1); func2(read C0, read C1, write C2); func3(read C2, write Output0, write Output1);}

• Loop: The canonical form coding guideline for dataflow inside a loop body includes the codingguidelines for a function defined above, and also specifies the following:

1. Initial value 0.

2. The loop condition is formed by a comparison of the loop variable with a numericalconstant or variable that does not vary inside the loop body.

3. Increment by 1.





The following code example illustrates the canonical form for dataflow within a loop.

void dataflow(Input0, Input1, Output0, Output1) { UserDataType C0, C1, C2; for (int i = 0; i < N; ++i) { #pragma HLS DATAFLOW func1(read Input0, read Input1, write C0, write C1); func2(read C0, read C0, read C1, write C2); func3(read C2, write Output0, write Output1); }}

Troubleshooting Dataflow

The following behaviors can prevent the compiler from performing DATAFLOW optimizations:

1. Single producer-consumer violations.

2. Bypassing tasks.

3. Feedback between tasks.

4. Conditional execution of tasks.

5. Loops with multiple exit conditions or conditions defined within the loop.

If any of the above conditions occur inside the dataflow region, you may need to re-architect thecode to successfully achieve dataflow optimization.





Chapter 4

Using External I/OHardware accelerators generated in the SDSoC™ environment can communicate with systeminputs and outputs either directly through hardware connections, or though memory buffers (e.g.,a frame buffer). Examples of system I/O include analog-to-digital and digital-to-analogconverters, image, radar, LiDAR, and ultrasonic sensors, and HDMI™ multimedia streams.

A platform exports stream connections in hardware that can be accessed in software by callingplatform library functions as described in Accessing External I/O using Memory Buffers and Accessing External I/O using Direct Hardware Connections. Direct hardware connections areimplemented over AXI4-Stream channels, and connections to memory buffers are realizedthrough function calls implemented by the standard data movers supported in the SDSoCEnvironment.

For information and examples that show how to create SDSoC platforms, refer to the SDSoCEnvironment Platform Development Guide (UG1146).

Accessing External I/O using MemoryBuffersThis section uses the motion-detect ZC702 + HDMI IO FMC or the ZC706 + HDMI IO FMCplatform found under the "Boards, Kits, & Modules" page of the SDSoC DevelopmentEnvironment. The preconfigured SDSoC™ platform is responsible for the HDMI data transfer toexternal memory. The application must call the platform interfaces to process the data from theframe buffer in DDR memory. The following figure shows an example of how the design isconfigured.


https://www.xilinx.com/cgi-bin/docs/rdoc?v=2018.2;d=ug1146-sdsoc-platform-development.pdf

https://www.xilinx.com/products/design-tools/software-zone/sdsoc.html#boardskits




Figure 7: Motion Detect Design Configuration

Programmable Logic

DDR Memory

Processing System

Platform

SDSoCFunctions

PlatformIPs

HDMI HDMI

SobelFilter

SobelFilter

Diff MedianFilter

Combiner PlatformIPs

X17232-121417

The SDSoC Environment accesses the external frame buffer through an accelerator interface tothe platform. The zc702_hdmi platform provides a software interface to access the video framebuffer through the Video4Linux2 (V4L2) API. The V4L2 framework provides an API accessinga collection of device drivers supporting real-time video capture in Linux. This API is theapplication development platform I/O entry point.

The motion_demo_processing function in the following code snippet fromm2m_sw_pipeline.c demonstrates the function call interface.

extern void motion_demo_processing(unsigned short int *prev_buffer, unsigned short int *in_buffer, unsigned short int *out_buffer, int fps_enable, int height, int width, int stride);...unsigned short *out_ptr = v_pipe->drm.d_buff[buf_next->index].drm_buff;unsigned short *in_ptr1 = buf_prev->v412_buff;unsigned short *in_ptr2 = buf_next->v412_buff;v_pipe->events[PROCESS_IN].counter_val++;

motion_demo_processing(in_ptr1, in_ptr2, out_ptr, v_pipe->fps_enable, (int)m2m_sw_stream_handle.video_in.format.height, (int)m2m_sw_stream_handle.video_in.format.width, (int)m2m_sw_stream_handle.video_in.format.bytesperline/2);

Chapter 4: Using External I/O




The application accesses this API in motion_detect.c, where motion_demo_procesing isdefined and called by the img_process function.

void motion_demo_processing(unsigned short int *prev_buffer,unsigned short int *in_buffer,unsigned short int *out_buffer,int fps_enable,int height, int width, int stride)

{int param0=0, param1=1, param2=2;

img_process(prev_buffer, in_buffer, out_buffer, height, width, stride);}

Finally, img_process calls the various filters and transforms to process the data.

void img_process(unsigned short int *frame_prev,unsigned short int *frame_curr,unsigned short int *frame_out,int param0, int param1, int param2)

{...}

By using a platform API to access the frame buffers, the application developer does not programat the driver level to process the video frames.

You can find the platform used for the code snippets on the SDSoC Downloads Page with thename ZC702 + HDMI IO FMC or ZC706 + HDMI IO FMC. To access the project in theSDSoC environment use the following steps:

1. Download and extract the platform to your system.

2. Open SDx™ and create a new application project.

3. From the Platform dialog box, select Add Custom Platform.

4. From the Specify Custom Platform Location dialog box, select the location of thedownloaded platform, and click OK.

5. From the Platform dialog box, select the custom platform named zc702_trd or zc706_trd, andclick Next.

6. From the System Configuration dialog box, leave the default settings and click Next.

7. From the Templates dialog box, select the Dense Optical Flow (1PPC) template, andclick Finish.






Accessing External I/O using DirectHardware ConnectionsWhereas the example in Accessing External I/O using Memory Buffers demonstrated howapplications can access system I/O through memory buffers, a platform can also provide directhardware connectivity to hardware accelerators within an application generated in the SDSoC™Environment. The following figure shows how a function s2mm_data_copy communicates tothe platform using an AXI4-Stream channel, and writes to DDR memory using the zero_copydatamover (implemented as an AXI4 master interface). This design template is called aximmdesign example in the samples/platforms/zc702_axis_io platform.

Figure 8: AXI4 Data Mover Design

Programmable Logic

DDR Memory

Processing System

Platform

SDSoCFunctions

pf_read_stream s2mm_data_copy

Input PlatformIP

X17233-121417

In this example, the zc702_axis_io platform proxies actual I/O by providing a free-runningbinary counter (labeled Platform IP in the diagram) running at 50 MHz, connected to an AXI4-Stream Data FIFO IP block that exports an AXI4-Stream master interface to the platform clockedat the data motion clock (which might differ from the 50 MHz input clock).





The direct I/O interface is specified using the sys_port pragma for a stream port. In the codesnippet below, the direct connection is specified for stream_fifo_M_AXIS.

#pragma SDS data sys_port (fifo:stream_fifo_M_AXIS)#pragma SDS data zero_copy(buf)int s2mm_data_copy(unsigned *fifo, unsigned buf[BUF_SIZE]) {#pragma HLS interface axis port=fifo for(int i=0; i<BUF_SIZE; i++) {#pragma HLS pipeline

buf[i] = *fifo; } return 0;}

In the main application code below, the rbuf0 variable maps the fifo input stream to thes2mm_data_copy function, so the sds++ compiler creates a direct hardware connection over anAXI4-Stream channel. Because the s2mm_data_copy function transfers buf using thezero_copy data mover, the buffer must be allocated in physically contiguous memory usingsds_alloc, and released using sds_free, as shown below.

int main() { unsigned *bufs[NUM_BUFFERS]; bool error = false; unsigned* rbuf0;

for(int i=0; i<NUM_BUFFERS; i++) {bufs[i] = (unsigned*) sds_alloc(BUF_SIZE * sizeof(unsigned));

}

// Flush the platform FIFO of start-up garbage s2mm_data_copy(rbuf0, bufs[0]); s2mm_data_copy(rbuf0, bufs[0]); s2mm_data_copy(rbuf0, bufs[0]);

for(int i=0; i<NUM_BUFFERS; i++) {s2mm_data_copy(rbuf0, bufs[i]);

}

error = check(bufs);

printf("TEST %s\n\r", (error ? "FAILED" : "PASSED"));

for(int i=0; i<NUM_BUFFERS; i++) {sds_free(bufs[i]);

} return 0;}

Information on creating a platform using AXI4-Stream to write to memory directly can be foundin the "SDSoC Platform Examples" appendix of SDSoC Environment Platform Development Guide(UG1146).






Appendix A

Exporting a Library for GCCThis chapter demonstrates how to use the sds++ compiler to build a library of softwarefunctions with entry points into hardware functions implemented in programmable logic. Thislibrary can be linked into applications using the standard GCC linker for Zynq®-7000 SoC andZynq UltraScale+ MPSoC devices.

In addition to the library, sds++ generates a complete boot image for the hardware functionsand data motion network, including an FPGA bitstream. Using this library, you can developsoftware applications that call the hardware functions (and fixed hardware) using standard GCCtoolchains. You are still targeting the same hardware system and using the sds++-generatedboot environment, but you are free to develop your software application using the GNUtoolchain in the software development environment of your choice.

IMPORTANT!: In the current SDSoC™ release, libraries are not thread-safe, so they must be calledinto from a single thread within an application, which could consist of many threads and processes.

Building a Shared LibraryTIP: Shared libraries can be only created for application projects targeting Linux OS.

To build a shared library, sds++ requires at least one accelerator. The following example providesthree entry points into two hardware accelerators: a matrix multiplier and a matrix adder. You canfind these files in the samples/libmatrix/build directory.

• mmult_accel.cpp – Accelerator code for the matrix multiplier

• mmult_accel.h – Header file for the matrix multiplier

• madd_accel.cpp – Accelerator code for the matrix adder

• madd_accel.h – Header file for the matrix adder

• matrix.cpp – Code that calls the accelerators and determines the data motion network

• matrix.h – Header file for the library

Appendix A: Exporting a Library for GCC




The matrix.cpp file contains functions that define the accelerator interfaces as well as howthe hardware functions communicate with the platform (i.e., the data motion networks betweenplatform and accelerators). The function madd calls a single matrix adder accelerator, and thefunction mmult calls a single matrix multiplier accelerator. Another function mmultadd isimplemented using two hardware functions, with the output of the matrix multiplier connecteddirectly to the input of the matrix adder.

/* matrix.cpp */#include "madd_accel.h"#include "mmult_accel.h"

void madd(float in_A[MSIZE*MSIZE], float in_B[MSIZE*MSIZE], float out_C[MSIZE*MSIZE]){ madd_accel(in_A, in_B, out_C);}

void mmult(float in_A[MSIZE*MSIZE], float in_B[MSIZE*MSIZE], float out_C[MSIZE*MSIZE]){ mmult_accel(in_A, in_B, out_C);}

void mmultadd(float in_A[MSIZE*MSIZE], float in_B[MSIZE*MSIZE], float in_C[MSIZE*MSIZE], float out_D[MSIZE*MSIZE]){ float tmp[MSIZE * MSIZE];

mmult_accel(in_A, in_B, tmp); madd_accel(tmp, in_C, out_D);}

The matrix.h file defines the function interfaces to the shared library, and will be included inthe application source code.

/* matrix.h */#ifndef MATRIX_H_#define MATRIX_H_

#define MSIZE 16

void madd(float in_A[MSIZE*MSIZE], float in_B[MSIZE*MSIZE], float out_C[MSIZE*MSIZE]);

void mmult(float in_A[MSIZE*MSIZE], float in_B[MSIZE*MSIZE], float out_C[MSIZE*MSIZE]);

void mmultadd(float in_A[MSIZE*MSIZE], float in_B[MSIZE*MSIZE], float in_C[MSIZE*MSIZE], float out_D[MSIZE*MSIZE]);

#endif /* MATRIX_H_ */





In the samples/libmatrix/build/shared folder, the Makefile shows how the project isbuilt by specifying that the functions mmult_accel, madd, and mmult_add must beimplemented in programmable logic.

SDSFLAGS = \ -sds-pf ${PLATFORM} \ -sds-hw mmult_accel mmult_accel.cpp -sds-end \ -sds-hw madd_accel madd_accel.cpp -sds-end

Object files are generated with position independent code (PIC), like standard shared libraries,using the GCC standard -fpic option:

sds++ ${SDSFLAGS} -c -fpic –o mmult_accel.o mmult_accel.cpp sds++ ${SDSFLAGS} -c -fpic –o madd_accel.o madd_accel.cpp sds++ ${SDSFLAGS} -c -fpic –o matrix.o matrix.cpp

Link the objects files by using the using the GCC standard –shared switch:

sds++ ${SDSFLAGS} -shared -o libmatrix.so mmult_accel.o madd_accel.o matrix.o

After building the project, the following files will be generated:

• libmatrix.so – Shared library suitable for linking using GCC and for runtime use

• sd_card – Directory containing an SD card image for booting the board

Delivering a Library

The following structure allows compiling and linking into applications using GCC in standardways.

<path_to_library>/include/matrix.h<path_to_library>/lib/libmatrix.so<path_to_library>/sd_card

IMPORTANT!: The sd_card folder is to be copied into an SD card and used to boot the board. Thisimage includes a copy of the libmatrix.so file that is used at runtime.





Compiling and Linking Against a LibraryThe following is an example of using the library with a GCC compiler. The library is used byincluding the header file matrix.h and then calling the necessary library functions.

/* main.cpp (pseudocode) */#include "matrix.h" int main(int argc, char* argv[]){ float *A, *B, *C, *D; float *J, *K, *L; float *X, *Y, *Z; ... mmultadd(A, B, C, D); ... mmult(J, K, L); ... madd(X, Y, Z); ...}

To compile against a library, the compiler needs the header file. The path to the header file isspecified using the -I switch. You can find example files in the samples/libmatrix/usedirectory.

TIP: For explanation purposes, the code above is only pseudocode and not the same as the main.cppfile in the directory. The file has more code that allows full compilation and execution.

gcc –I <path_to_library>/include –o main.o main.c

To link against the library, the linker needs the library. The path to the library is specified usingthe -L switch. Also, ask the linker to link against the library using the -l switch.

gcc –I <path_to_library>/lib –o main.elf main.o -lmatrix

For detailed information on using the GCC compiler and linker switches refer to the GCCdocumentation.

Use a library at runtime

At runtime, the loader will look for the shared library when loading the executable. After bootingthe board into a Linux prompt and before executing the ELF file, add the path to the library tothe LD_LIBRARY_PATH environment variable. The sd_card created when building the libraryalready has the library, so the path to the mount point for the sd_card must be specified.





For example, if the sd_card is mounted at /mnt, use this command:

export LD_LIBRARY_PATH=/mnt

Exporting a Shared LibraryThe following steps demonstrate how to export an SDSoC environment shared library with thecorresponding SD card boot image using the SDSoC environment GUI.

1. Select File → New → SDSoC Project to bring up the New Project dialog box.2. Create a new SDSoC project.

a. Type libmatrix in the Project name field.

b. Select Platform to be zc702.

c. Put a checkmark on the Shared Library checkbox.

d. Click Next.3. Choose the application template.

a. Select Matrix Shared Library from the Available Templates.

b. Click Finish.

A new SDSoc shared library application project called libmatrix is created in the ProjectExplorer view. The project includes two hardware functions mmult_accel andmadd_accel that are visible in the SDSoC Project Overview.

4. Build the library.

a. In the Project Explorer view, select the libmatrix project.

b. Select Project → Build Project.

After the build completes, there will be a boot SD card image under the Debug (or currentconfiguration) folder.





Appendix B

SDSoC Environment APIThis section describes functions in sds_lib available for applications developed in the SDSoCEnvironment.

Note: To use the library, #include "sds_lib.h" in source files. You must include stdlib.h beforeincluding sds_lib.h to provide the size_t type declaration.

The SDSoC™ Environment API provides functions to map memory spaces, and to wait forasynchronous accelerator calls to complete.

• void sds_wait(unsigned int id): Wait for the first accelerator in the queue identifiedby id, to complete. The recommended alternative is the use #pragma SDS wait(id), asdescribed in pragma SDS async in the SDx Pragma Reference Guide (UG1253).

• void *sds_alloc(size_t size): Allocate a physically contiguous array of size bytes.

• void *sds_alloc_non_cacheable(size_t size):

Allocate a physically contiguous array of size bytes that is marked as non-cacheable.Memory allocated by this function is not cached in the processing system. Pointers to thismemory should be passed to a hardware function in conjunction with

#pragma SDS data mem_attribute (p:NON_CACHEABLE)

• void sds_free(void *memptr): Free an array allocated through sds_alloc()

• void *sds_mmap(void *physical_addr, size_t size, void*virtual_addr): Create a virtual address mapping to access a memory of size byteslocated at physical address physical_addr.

• physical_addr: physical address to be mapped.

• size: size of physical address to be mapped.

• virtual_addr:

○ If not null, it is considered to be the virtual-address already mapped to thephysical_addr, and sds_mmap keeps track of the mapping.

○ If null, sds_mmap invokes mmap() to generate the virtual address, and virtual_addris assigned this value.


https://www.xilinx.com/cgi-bin/docs/rdoc?v=2018.2;d=ug1253-sdx-pragma-reference.pdf;a=xPragmaSDSAsync




• void *sds_munmap(void *virtual_addr): Unmaps a virtual address associated with aphysical address created using sds_mmap().

• unsigned long long sds_clock_counter(void): Returns the value associated witha free-running counter used for fine grain time interval measurements.

• unsigned long long sds_clock_frequency(void): Returns the frequency (in ticks/second) associated with the free-running counter that is read by calls tosds_clock_counter. This is used to translate counter ticks to seconds.

Appendix B: SDSoC Environment API




Appendix C

Additional Resources and LegalNotices

Xilinx ResourcesFor support resources such as Answers, Documentation, Downloads, and Forums, see XilinxSupport.

Documentation Navigator and DesignHubsXilinx® Documentation Navigator provides access to Xilinx documents, videos, and supportresources, which you can filter and search to find information. The Xilinx DocumentationNavigator (DocNav) is installed with SDSoC™ and SDAccel™. To open it:

• On Windows, select Start → All Programs → Xilinx Design Tools → DocNav.

• At the Linux command prompt, enter docnav.

Xilinx Design Hubs provide links to documentation organized by design tasks and other topics,which you can use to learn key concepts and address frequently asked questions. To access theDesign Hubs:

• In the Xilinx Documentation Navigator, click the Design Hubs View tab.

• On the Xilinx website, see the Design Hubs page.

Note: For more information on Documentation Navigator, see the Documentation Navigator page on theXilinx website.


https://www.xilinx.com/support

https://www.xilinx.com/support

https://www.xilinx.com/cgi-bin/docs/ndoc?t=design+hubs

https://www.xilinx.com/cgi-bin/docs/rdoc?t=docnav



ReferencesThese documents provide supplemental material useful with this guide:

1. SDSoC Environments Release Notes, Installation, and Licensing Guide (UG1294)

2. SDSoC Environment User Guide (UG1027)

3. SDSoC Environment Getting Started Tutorial (UG1028)

4. SDSoC Environment Platform Creation Tutorial (UG1236)

5. SDSoC Environment Platform Development Guide (UG1146)

6. SDSoC Environment Profiling and Optimization Guide (UG1235)

7. SDx Command and Utility Reference Guide (UG1279)

8. SDSoC Environment Programmers Guide (UG1278)

9. SDSoC Environment Debugging Guide (UG1282)

10. SDx Pragma Reference Guide (UG1253)

11. UltraFast Embedded Design Methodology Guide (UG1046)

12. Zynq-7000 SoC Software Developers Guide (UG821)

13. Zynq UltraScale+ MPSoC: Software Developers Guide (UG1137)

14. ZC702 Evaluation Board for the Zynq-7000 XC7Z020 SoC User Guide (UG850)

15. ZCU102 Evaluation Board User Guide (UG1182)

16. Vivado Design Suite User Guide: High-Level Synthesis (UG902)

17. Vivado Design Suite: Creating and Packaging Custom IP (UG1118)

18. SDSoC Development Environment web page

19. Vivado® Design Suite Documentation

Training Resources1. SDSoC Development Environment and Methodology

2. Advanced SDSoC Development Environment and Methodology

Appendix C: Additional Resources and Legal Notices


https://www.xilinx.com/cgi-bin/docs/rdoc?v=2018.2;d=ug1294-sdsoc-rnil.pdf


https://github.com/Xilinx/SDSoC-Tutorials/blob/master/getting-started-tutorial/README.md

https://github.com/Xilinx/SDSoC-Tutorials/tree/master/platform-creation-tutorial


https://www.xilinx.com/cgi-bin/docs/rdoc?v=2018.2;d=ug1235-sdsoc-optimization-guide.pdf

https://www.xilinx.com/cgi-bin/docs/rdoc?v=2018.2;d=ug1279-sdx-command-utility-reference-guide.pdf

https://www.xilinx.com/cgi-bin/docs/rdoc?v=2018.2;d=ug1278-sdsoc-programmers-guide.pdf



https://www.xilinx.com/cgi-bin/docs/rdoc?d=ug1046-ultrafast-design-methodology-guide.pdf

https://www.xilinx.com/cgi-bin/docs/ndoc?t=user_guides;d=ug821-zynq-7000-swdev.pdf

https://www.xilinx.com/cgi-bin/docs/ndoc?t=user_guides;d=ug1137-zynq-ultrascale-mpsoc-swdev.pdf

https://www.xilinx.com/support/documentation/boards_and_kits/zc702_zvik/ug850-zc702-eval-bd.pdf

https://www.xilinx.com/cgi-bin/docs/bkdoc?k=zcu102;d=ug1182-zcu102-eval-bd.pdf


https://www.xilinx.com/cgi-bin/docs/rdoc?v=2018.2;d=ug1118-vivado-creating-packaging-custom-ip.pdf

https://www.xilinx.com/products/design-tools/software-zone/sdsoc.html

https://www.xilinx.com/cgi-bin/docs/rdoc?t=vivado+docs

https://www.xilinx.com/training/courses/sdsoc-development-environment-method.html

https://www.xilinx.com/training/courses/advanced-sdsoc-development-environment-methodology.html



Please Read: Important Legal NoticesThe information disclosed to you hereunder (the "Materials") is provided solely for the selectionand use of Xilinx products. To the maximum extent permitted by applicable law: (1) Materials aremade available "AS IS" and with all faults, Xilinx hereby DISCLAIMS ALL WARRANTIES ANDCONDITIONS, EXPRESS, IMPLIED, OR STATUTORY, INCLUDING BUT NOT LIMITED TOWARRANTIES OF MERCHANTABILITY, NON-INFRINGEMENT, OR FITNESS FOR ANYPARTICULAR PURPOSE; and (2) Xilinx shall not be liable (whether in contract or tort, includingnegligence, or under any other theory of liability) for any loss or damage of any kind or naturerelated to, arising under, or in connection with, the Materials (including your use of theMaterials), including for any direct, indirect, special, incidental, or consequential loss or damage(including loss of data, profits, goodwill, or any type of loss or damage suffered as a result of anyaction brought by a third party) even if such damage or loss was reasonably foreseeable or Xilinxhad been advised of the possibility of the same. Xilinx assumes no obligation to correct anyerrors contained in the Materials or to notify you of updates to the Materials or to productspecifications. You may not reproduce, modify, distribute, or publicly display the Materialswithout prior written consent. Certain products are subject to the terms and conditions ofXilinx's limited warranty, please refer to Xilinx's Terms of Sale which can be viewed at https://www.xilinx.com/legal.htm#tos; IP cores may be subject to warranty and support terms containedin a license issued to you by Xilinx. Xilinx products are not designed or intended to be fail-safe orfor use in any application requiring fail-safe performance; you assume sole risk and liability foruse of Xilinx products in such critical applications, please refer to Xilinx's Terms of Sale which canbe viewed at https://www.xilinx.com/legal.htm#tos.

AUTOMOTIVE APPLICATIONS DISCLAIMER

AUTOMOTIVE PRODUCTS (IDENTIFIED AS "XA" IN THE PART NUMBER) ARE NOTWARRANTED FOR USE IN THE DEPLOYMENT OF AIRBAGS OR FOR USE IN APPLICATIONSTHAT AFFECT CONTROL OF A VEHICLE ("SAFETY APPLICATION") UNLESS THERE IS ASAFETY CONCEPT OR REDUNDANCY FEATURE CONSISTENT WITH THE ISO 26262AUTOMOTIVE SAFETY STANDARD ("SAFETY DESIGN"). CUSTOMER SHALL, PRIOR TO USINGOR DISTRIBUTING ANY SYSTEMS THAT INCORPORATE PRODUCTS, THOROUGHLY TESTSUCH SYSTEMS FOR SAFETY PURPOSES. USE OF PRODUCTS IN A SAFETY APPLICATIONWITHOUT A SAFETY DESIGN IS FULLY AT THE RISK OF CUSTOMER, SUBJECT ONLY TOAPPLICABLE LAWS AND REGULATIONS GOVERNING LIMITATIONS ON PRODUCTLIABILITY.

Copyright

© Copyright 2018 Xilinx, Inc. Xilinx, the Xilinx logo, Artix, ISE, Kintex, Spartan, Virtex, Vivado,Zynq, and other designated brands included herein are trademarks of Xilinx in the United Statesand other countries. OpenCL and the OpenCL logo are trademarks of Apple Inc. used bypermission by Khronos. All other trademarks are the property of their respective owners.

Appendix C: Additional Resources and Legal Notices


https://www.xilinx.com/legal.htm#tos





Documents

SDSoC Programmers Guide - china.xilinx.com · Appendix C: Additional Resources and Legal Notices ... • Pipelining: A technique to increase instruction-level parallelism in the hardware