High-Level Synthesis (HLS) and SDSoC Development Environments

Simplified Programming Experience for Software, Hardware and Systems Engineers

Vivado HLS

Comprehensive Integration with

the SDSoC Environment

VHDL or Verilog

C, C++ or SystemC

RTL Implementation

Micro Architecture Exploration

Algorithmic Specification

Rapid RTL architecture exploration via Directives

Co-optimization with RTL synthesis for optimal QoR

Generates AXI4-based IP for Vivado IP Integrator

Over 1,000 customers

Leveraged by LogiCORE IP developers (mainly Video IPs)

Libraries

Libraries:

Arbitrary Precision

Video, OpenCV

Linear algebra

DSP: FFT and FIR

Vivado HLS System IP Integration Flow

IP Catalog

C-based IP Creation

Libraries

Arbitrary Precision

Linear algebra

IP: FFT and FIR

System Integration

Vivado IP Integrator

Vivado HLS Integrates into System Flows

C, C++, SystemC

VHDL or Verilog

System Generator for DSP

Vivado RTL

Vivado High-Level SynthesisServes a Wide Range of Applications across Markets

Communications

LTE MIMO receiver

Advanced wireless antenna

positioning

Audio, Video, Broadcast3D cameras

Video transport

Consumer3D television

eReaders

Aerospace and DefenseRadar, Sonar

Signals Intelligence

Industrial, Scientific, MedicalUltrasound systems

Motor controllers

Automotive

Infotainment

Driver assistance

Computing & StorageHigh performance computing

Database acceleration

Test & MeasurementCommunications instruments

Semiconductor ATE

54257**slide

Design Exploration with Directives

The most important HLS compiler directives are familiar to

performance-oriented software programmers

Use hardware buffers to improve communication bandwidth

between accelerator and external memory

–Copy loops at the function boundary when multiple accesses required and

to burst data into local buffers

Microarchitecture Optimizations

Directives and Configurations Description

PIPELINEReduces the initiation interval by allowing the concurrent

execution of operations within a loop or function.

DATAFLOWEnables functions and loops to execute concurrently. Avoid at the top-

level hardware function.

INLINEInline a function to function hierarchy, enable logic optimization across

function boundaries and reduce function call overhead.

UNROLLUnroll for-loops to create multiple independent operations rather than a

single collection of operations.

ARRAY_PARTITIONPartition array into smaller arrays or individual registers to increase

concurrent access to data and remove block RAM bottlenecks.

Verification Productivity

Input RTL Simulation Time C Simulation Time Acceleration

10 frames of video data ~2 days 10 seconds ~12,000X

Verified RTL

HDL-based Design

C-based Design

Floating point math

–Declare variables as float or double

or half precision type

–Uses Xilinx floating point cores

Fixed point math (ap_fixed.h)

– Arbitrary precision fixed point with

saturation, rounding options

math functions (hls_math.h)

Video function (hls_video.h)

–Memory line buffer, memory

windows

– Video algorithm function

Linear algebra library

(hls_linear_algebra.h)

HLS Library

C/C++ - standard types

– char (8-bit), short (16-bit), int (32-bit), long long (64-bit)

–May result in significantly larger area, slower speed

Arbitrary precision data types

–Use Arbitrary precision data types

– Smaller and Faster Hardware

Data Types and Bit Accuracy: Example

ap_uint<8> X;

ap_int<25> Y;

Pipelined loops

–Combined with array partitioning to achieve II=1

Loop and Function Pipelining

Latency = 3 cycles

Without Pipelining

Initiation Interval = 3 cycles

RD CMP WR RD CMP WR

Loop:for(i=1;i<3;i++) {

op_Read;

op_Compute;

op_Write;

Loop Latency = 6 cycles

With Pipelining

Latency = 3 cycles

Initiation Interval = 1 cycle

RD CMP WR

Loop Latency = 4 cycles

void foo(...) {

op_Read;

op_Compute;

op_Write;

for (index_b = 0; index_b < B_NCOLS; index_b++) {

#pragma HLS PIPELINE II=1

float result = 0;

for (index_d = 0; index_d < A_NCOLS; index_d++) {

float product_term = in_A[index_a][index_d] * in_B[index_d][index_b];

result += product_term;

out_C[index_a * B_NCOLS + index_b] = result;

Loops can be Fully or Partially unrolled

–No manual code changes!

– Fully unrolled – best performance (may result in more area)

– Partially unrolled – explore performance / area tradeoff

Minimizing Latency: Unrolled Loops

void foo_top (…) {...

Add: for (i=3;i>=0;i--) {b = a[i] + b;

foo_top

Fully Unrolled

3 2 1 0

Option 1

Option 2

Option 3

Latency = 4, #Add = 1 (rolled)

Latency = 2, #Add = 2 (partially unrolled: factor = 2)

Latency = 1, #Add = 4 (fully unrolled)

High-Level Synthesis (HLS) -Sobel Filter Optimization Example

Reference Application – Sobel Filter in HLS

Get an input frame from HDMI

Convert RGB to YC

Apply an Sobel Filter

Convert YC to RGB

Send to HDMI output

void img_process(unsigned int *rgb_data_in, unsigned int *rgb_data_out,unsigned short *yc_data_in,unsigned short *yc_sobel_out)

rgb_pad2ycbcr(rgb_data_in, yc_data_in);sobel_filter(yc_data_in, yc_sobel_out);ycbcr2rgb_pad(yc_sobel_out, rgb_data_out);

Naïve Sobel Filter Implementation

Iterate over an input video image

Applying 3x3 sobel filter

Writing to output image

void sobel_filter(unsigned short *yc_in,unsigned short *yc_out, short *x_op, short *y_op){

int row, col;for(row = 0; row < NUMROWS; row++){for(col = 0; col < NUMCOLS; col++){

unsigned short input_data, unsigned char edge;if((col < NUMCOLS) & (row < NUMROWS))

input_data = yc_in[row*NUMCOLS+col];if( isEdge(row, col)) edge=0;else{

short x_weight = 0, y_weight = 0;for(char I = 0; i < 3; i++){for(char j = 0; j < 3; j++){

unsigned short temp = (yc_in[index(row, col, I, j)]);x_weight += (temp * x_op[i][j]);y_weight += (temp * y_op[i][j]);

} }edge = ABS(x_weight) + ABS(y_weight);

}yc_out[index(row, col)] = edge;

0.1 FPS @ 1080p

Video 1: Run the Naïve Sobel on a Zynq board

Problem 1: Non-Sequential Overlapped Memory Access void sobel_filter(unsigned short *yc_in,unsigned short *yc_out, short *x_op, short *y_op){

int row, col;for(row = 0; row < NUMROWS; row++){for(col = 0; col < NUMCOLS; col++){

unsigned short input_data, unsigned char edge;if((col < NUMCOLS) & (row < NUMROWS))

9 memory access per iteration

Problem 1: Non-Sequential Overlapped Memory

Problem 1:

Problem 1:

Problem 1:

Problem 1:

Memory Access: 9 * 48 = 432 pixels

For 1080p: 18,608,436 pixels

Solution 1: Direct Stream using Line Buffer and Window

Shift line buffer Read 1 pix

Solution

Read 1 pix

Shift line buffer

Processing Window

Read 1 pix

Shift line buffer

Processing Window

Memory Access: 1 * 80 = 80 pixels

With 1080p, 2,073,600 pixels (9x less!)

Shift line buffer

Read 1 pix

Processing Window

Solution 1: Use Line Buffer and Window Classes

int row, col;ap_linebuffer<unsigned char, 3, NUMCOLS> buff_A;ap_window<unsigned char,3,3> buff_C;for(row = 0; row < NUMROWS; row++){for(col = 0; col < NUMCOLS; col++){

unsigned short input_data, unsigned char edge;update_linebuffer(buff_A, yc_in[index(row,col)]);update_window(buff_C);if((col < NUMCOLS) & (row < NUMROWS))

unsigned short temp = (buff_C[index(row, col, I, j)]);x_weight += (temp * x_op[i][j]);y_weight += (temp * y_op[i][j]);

1 FPS @ 1080p

10x speedup

Video 2: Run the Solution 1 on a Zynq board

Problem 2: Loop Iterations Are Sequential

Executing sequentially

void sobel_filter(unsigned short *yc_in,unsigned short *yc_out, short *x_op, short *y_op){int row, col;for(row = 0; row < NUMROWS; row++){for(col = 0; col < NUMCOLS; col++){unsigned short input_data, unsigned char edge;if((col < NUMCOLS) & (row < NUMROWS))

Solution 2: Pipeline Loop Iterations

Assuming 60 cycles / loop iteration

60*1920*1080 =

124,416,000 cycles

60+(1920*1080) =

2,073,660 cycles (60x speedup)

Solution 2: Add PIPELINE Pragma

int row, col;ap_linebuffer<unsigned char, 3, NUMCOLS> buff_A;ap_window<unsigned char,3,3> buff_C;for(row = 0; row < NUMROWS; row++){for(col = 0; col < NUMCOLS; col++){

#pragma AP PIPELINE II = 1unsigned short input_data, unsigned char edge;update_linebuffer(buff_A, yc_in[index(row,col)]);update_window(buff_C);if((col < NUMCOLS) & (row < NUMROWS))

unsigned short temp = (buff_C[index(row, col, I, j)]);x_weight += (temp * x_op[i][j]);y_weight += (temp * y_op[i][j]);

60 FPS @ 1080p

60x speedup

Video 3: Run the Solution 2 on a Zynq board

Sobel Edge Detection – Demo

HDMI cable

1080p 60fps Video source

ZC702 Board (ZYNQ 7020)

SDSoC Sobel Filter Demo - Summary

A naïve sobel filter implementation 0.1

Problem 1: Non-sequential overlapped memory accesses

Solution 1: Reducing memory access by 9x using line buffer

and window

Problem 2: Sequential loop iteration

Solution 2: Pipeline loop iteration using the pipeline pragma

600x Speedup with only ~15 lines of code change

Let SDSoC generate an optimized HW/SW system

so you can focus on your algorithm optimization

High-Level Synthesis (HLS) and SDSoC Development Environments › www › images › ... · Vivado...

Documents

High-Level Synthesis with Vivado HLS

Implementing Carrier Phase Recovery Loop Using Vivado HLS

Seminar XILINX SDSoC/SDAccel/HLS - plc2.com · Seminar XILINX SDSoC/SDAccel/HLS Agenda n XILINX SDx Tools Overview n High-Level-Synthesis (HLS) Overview n XILINX Devices and Platforms

introducing Dynamic Memory Management in Vivado …users.auth.gr/ksiop/research_page/CAD_Tools/ARC2015_DMMHLS... · introducing Dynamic Memory Management in Vivado-HLS for Scalable

Vivado HLS Design Flow Lab - Xilinx · 01/03/2012 · This lab comprises 9 primary steps: You will create a new project in Vivado HLS, run simulation, run debugger, synthesize the

İSTANBUL TEKNİK ÜNİVERSİTESİ ELEKTRİK ELEKTRONİK … › ~orssi › thesis › 2018 › YakupGorur_bit.pdf · “ùerit Takip Algoritmalarının SDSoC ve Vivado Platformları

Introduction to High-Level Synthesis with Vivado HLS Objectives

Vivado High-Level Synthesis Meet-Up June 18th 2012files.meetup.com/3753142/VHLS_NY_Final_Distributed.pdf · • C-Based HLS Coding for Hardware Designers • C-Based HLS Coding for

Managing customized FPGA Linaro Conference Vancouver, CAN ... · Linaro Conference Vancouver, CAN - 19SEP2018. Title: Managing customized FPGA accelerators with SDSoC! ... •Vivado

Using Vivado-HLS for Structural Design: a NoC …Using Vivado-HLS for Structural Design: a NoC Case Study Zhipeng Zhao ECE Department Carnegie Mellon University Pittsburgh, PA 15213

Vivado Design Suite Tutorial - Xilinx · 2019-10-13 · High-Level Synthesis 4 UG871 (v2012.4) January 25, 2013 Chapter 1 Vivado HLS: Introduction Tutorial Introduction This guide

Your Innovation Powered by Xilinx · Distinguished Engineer, Microsoft Azure. Vivado OS & Firmware SDK SDAccel, Data Center Platform (FaaS, Alveo) SDSoC, Embedded ... AI Inference

Vivado HLS Tutorial - Cornell University · Vivado HLS Tutorial Steve Dai, Sean Lai, Zhiru Zhang School of Electrical and Computer Engineering ECE 5775 (Fall’17) High-Level Digital

Vivado HLS Design Flow Lab - University of Texas at Austinusers.ece.utexas.edu/~gerstl/ee382v_f14/soc/vivado_hls/VivadoHLS... · Lab Workbook Vivado HLS Design Flow Lab ZedBoard 1-1

Floating-Point Design with Vivado HLS - Xilinx

Platform Development Guide · SDSoC Platform Development Guide 2 Se n d Fe e d b a c k. T a b l e o f C o n t e n t s ... Mapping SDSoC Tcl Commands to Vivado Properties

SDSoC と Vivado

Vivado HLS を使用する Lucas-Kanade オプティカルフロー …

Using Vivado&HLS& - Freebertrand.granado.free.fr/.../12_Using_VivadoHLS.pdf · 2018. 12. 6. · – Describe how projects are created and maintained in Vivado HLS – State various

Using OpenCV and Vivado HLS to Accelerate Embedded Vision