35
High-Level Synthesis (HLS) and SDSoC Development Environments Simplified Programming Experience for Software, Hardware and Systems Engineers

High-Level Synthesis (HLS) and SDSoC Development Environments › www › images › ... · Vivado HLS Comprehensive Integration with the SDSoC Environment VHDL or Verilog C, C++

  • Upload
    others

  • View
    22

  • Download
    0

Embed Size (px)

Citation preview

Page 1: High-Level Synthesis (HLS) and SDSoC Development Environments › www › images › ... · Vivado HLS Comprehensive Integration with the SDSoC Environment VHDL or Verilog C, C++

High-Level Synthesis (HLS) and SDSoC Development Environments

Simplified Programming Experience for Software, Hardware and Systems Engineers

Page 2: High-Level Synthesis (HLS) and SDSoC Development Environments › www › images › ... · Vivado HLS Comprehensive Integration with the SDSoC Environment VHDL or Verilog C, C++

© Copyright 2017 Xilinx.

Vivado HLS

Comprehensive Integration with

the SDSoC Environment

VHDL or Verilog

C, C++ or SystemC

RTL Implementation

Micro Architecture Exploration

Algorithmic Specification

Rapid RTL architecture exploration via Directives

Co-optimization with RTL synthesis for optimal QoR

Generates AXI4-based IP for Vivado IP Integrator

Over 1,000 customers

Leveraged by LogiCORE IP developers (mainly Video IPs)

Libraries

Libraries:

Arbitrary Precision

Video, OpenCV

Math

Linear algebra

DSP: FFT and FIR

Page 3: High-Level Synthesis (HLS) and SDSoC Development Environments › www › images › ... · Vivado HLS Comprehensive Integration with the SDSoC Environment VHDL or Verilog C, C++

© Copyright 2017 Xilinx.

Vivado HLS System IP Integration Flow

IP Catalog

C-based IP Creation

Libraries

Arbitrary Precision

Video

Math

Linear algebra

IP: FFT and FIR

System Integration

Vivado IP Integrator

Vivado HLS Integrates into System Flows

C, C++, SystemC

VHDL or Verilog

System Generator for DSP

Vivado RTL

Page 3

Page 4: High-Level Synthesis (HLS) and SDSoC Development Environments › www › images › ... · Vivado HLS Comprehensive Integration with the SDSoC Environment VHDL or Verilog C, C++

© Copyright 2017 Xilinx.

Vivado High-Level SynthesisServes a Wide Range of Applications across Markets

Communications

LTE MIMO receiver

Advanced wireless antenna

positioning

Audio, Video, Broadcast3D cameras

Video transport

Consumer3D television

eReaders

Aerospace and DefenseRadar, Sonar

Signals Intelligence

Industrial, Scientific, MedicalUltrasound systems

Motor controllers

Automotive

Infotainment

Driver assistance

Computing & StorageHigh performance computing

Database acceleration

Test & MeasurementCommunications instruments

Semiconductor ATE

Page 4

Page 5: High-Level Synthesis (HLS) and SDSoC Development Environments › www › images › ... · Vivado HLS Comprehensive Integration with the SDSoC Environment VHDL or Verilog C, C++

© Copyright 2017 Xilinx.

1- 7

54257**slide

Design Exploration with Directives

Page 6: High-Level Synthesis (HLS) and SDSoC Development Environments › www › images › ... · Vivado HLS Comprehensive Integration with the SDSoC Environment VHDL or Verilog C, C++

© Copyright 2017 Xilinx.

The most important HLS compiler directives are familiar to

performance-oriented software programmers

Use hardware buffers to improve communication bandwidth

between accelerator and external memory

–Copy loops at the function boundary when multiple accesses required and

to burst data into local buffers

Page 6

Microarchitecture Optimizations

Directives and Configurations Description

PIPELINEReduces the initiation interval by allowing the concurrent

execution of operations within a loop or function.

DATAFLOWEnables functions and loops to execute concurrently. Avoid at the top-

level hardware function.

INLINEInline a function to function hierarchy, enable logic optimization across

function boundaries and reduce function call overhead.

UNROLLUnroll for-loops to create multiple independent operations rather than a

single collection of operations.

ARRAY_PARTITIONPartition array into smaller arrays or individual registers to increase

concurrent access to data and remove block RAM bottlenecks.

Page 7: High-Level Synthesis (HLS) and SDSoC Development Environments › www › images › ... · Vivado HLS Comprehensive Integration with the SDSoC Environment VHDL or Verilog C, C++

© Copyright 2017 Xilinx.

Verification Productivity

Input RTL Simulation Time C Simulation Time Acceleration

10 frames of video data ~2 days 10 seconds ~12,000X

RTL

C

Verified RTL

RTL

Verified RTL

HDL-based Design

C-based Design

Page 7

Page 8: High-Level Synthesis (HLS) and SDSoC Development Environments › www › images › ... · Vivado HLS Comprehensive Integration with the SDSoC Environment VHDL or Verilog C, C++

© Copyright 2017 Xilinx.

Floating point math

–Declare variables as float or double

or half precision type

–Uses Xilinx floating point cores

Fixed point math (ap_fixed.h)

– Arbitrary precision fixed point with

saturation, rounding options

math functions (hls_math.h)

Video function (hls_video.h)

–Memory line buffer, memory

windows

– Video algorithm function

Linear algebra library

(hls_linear_algebra.h)

Page 8

HLS Library

Page 9: High-Level Synthesis (HLS) and SDSoC Development Environments › www › images › ... · Vivado HLS Comprehensive Integration with the SDSoC Environment VHDL or Verilog C, C++

© Copyright 2017 Xilinx.

C/C++ - standard types

– char (8-bit), short (16-bit), int (32-bit), long long (64-bit)

–May result in significantly larger area, slower speed

Arbitrary precision data types

–Use Arbitrary precision data types

– Smaller and Faster Hardware

Page 9

Data Types and Bit Accuracy: Example

ap_uint<8> X;

ap_int<25> Y;

Page 10: High-Level Synthesis (HLS) and SDSoC Development Environments › www › images › ... · Vivado HLS Comprehensive Integration with the SDSoC Environment VHDL or Verilog C, C++

© Copyright 2017 Xilinx.

Pipelined loops

–Combined with array partitioning to achieve II=1

Loop and Function Pipelining

Latency = 3 cycles

Without Pipelining

Initiation Interval = 3 cycles

RD CMP WR RD CMP WR

Loop:for(i=1;i<3;i++) {

op_Read;

op_Compute;

op_Write;

}

RD

CMP

WR

Loop Latency = 6 cycles

With Pipelining

Latency = 3 cycles

Initiation Interval = 1 cycle

RD CMP WR

RD CMP WR

Loop Latency = 4 cycles

void foo(...) {

op_Read;

op_Compute;

op_Write;

}

RD

CMP

WR

for (index_b = 0; index_b < B_NCOLS; index_b++) {

#pragma HLS PIPELINE II=1

float result = 0;

for (index_d = 0; index_d < A_NCOLS; index_d++) {

float product_term = in_A[index_a][index_d] * in_B[index_d][index_b];

result += product_term;

}

out_C[index_a * B_NCOLS + index_b] = result;

}

Page 11: High-Level Synthesis (HLS) and SDSoC Development Environments › www › images › ... · Vivado HLS Comprehensive Integration with the SDSoC Environment VHDL or Verilog C, C++

© Copyright 2017 Xilinx.

Loops can be Fully or Partially unrolled

–No manual code changes!

– Fully unrolled – best performance (may result in more area)

– Partially unrolled – explore performance / area tradeoff

Page 11

Minimizing Latency: Unrolled Loops

void foo_top (…) {...

Add: for (i=3;i>=0;i--) {b = a[i] + b;

...}

foo_top

+

+

+

+

a[3]

a[2]

a[1]

a[0]

Fully Unrolled

b

clk

3 2 1 0

3 2 1 0

3

2

1

0

Option 1

Option 2

Option 3

Latency = 4, #Add = 1 (rolled)

Latency = 2, #Add = 2 (partially unrolled: factor = 2)

Latency = 1, #Add = 4 (fully unrolled)

Page 12: High-Level Synthesis (HLS) and SDSoC Development Environments › www › images › ... · Vivado HLS Comprehensive Integration with the SDSoC Environment VHDL or Verilog C, C++

High-Level Synthesis (HLS) -Sobel Filter Optimization Example

Page 13: High-Level Synthesis (HLS) and SDSoC Development Environments › www › images › ... · Vivado HLS Comprehensive Integration with the SDSoC Environment VHDL or Verilog C, C++

© Copyright 2017 Xilinx.

Reference Application – Sobel Filter in HLS

Page 13

Get an input frame from HDMI

Convert RGB to YC

Apply an Sobel Filter

Convert YC to RGB

Send to HDMI output

void img_process(unsigned int *rgb_data_in, unsigned int *rgb_data_out,unsigned short *yc_data_in,unsigned short *yc_sobel_out)

{

rgb_pad2ycbcr(rgb_data_in, yc_data_in);sobel_filter(yc_data_in, yc_sobel_out);ycbcr2rgb_pad(yc_sobel_out, rgb_data_out);

}

Page 14: High-Level Synthesis (HLS) and SDSoC Development Environments › www › images › ... · Vivado HLS Comprehensive Integration with the SDSoC Environment VHDL or Verilog C, C++

© Copyright 2017 Xilinx.

Page 14

Naïve Sobel Filter Implementation

Iterate over an input video image

Applying 3x3 sobel filter

Writing to output image

void sobel_filter(unsigned short *yc_in,unsigned short *yc_out, short *x_op, short *y_op){

int row, col;for(row = 0; row < NUMROWS; row++){for(col = 0; col < NUMCOLS; col++){

unsigned short input_data, unsigned char edge;if((col < NUMCOLS) & (row < NUMROWS))

input_data = yc_in[row*NUMCOLS+col];if( isEdge(row, col)) edge=0;else{

short x_weight = 0, y_weight = 0;for(char I = 0; i < 3; i++){for(char j = 0; j < 3; j++){

unsigned short temp = (yc_in[index(row, col, I, j)]);x_weight += (temp * x_op[i][j]);y_weight += (temp * y_op[i][j]);

} }edge = ABS(x_weight) + ABS(y_weight);

}yc_out[index(row, col)] = edge;

}}

}

Page 15: High-Level Synthesis (HLS) and SDSoC Development Environments › www › images › ... · Vivado HLS Comprehensive Integration with the SDSoC Environment VHDL or Verilog C, C++

© Copyright 2017 Xilinx.

0.1 FPS @ 1080p

Page 15

Video 1: Run the Naïve Sobel on a Zynq board

Page 16: High-Level Synthesis (HLS) and SDSoC Development Environments › www › images › ... · Vivado HLS Comprehensive Integration with the SDSoC Environment VHDL or Verilog C, C++

© Copyright 2017 Xilinx.

Page 16

Problem 1: Non-Sequential Overlapped Memory Access void sobel_filter(unsigned short *yc_in,unsigned short *yc_out, short *x_op, short *y_op){

int row, col;for(row = 0; row < NUMROWS; row++){for(col = 0; col < NUMCOLS; col++){

unsigned short input_data, unsigned char edge;if((col < NUMCOLS) & (row < NUMROWS))

input_data = yc_in[row*NUMCOLS+col];if( isEdge(row, col)) edge=0;else{

short x_weight = 0, y_weight = 0;for(char I = 0; i < 3; i++){for(char j = 0; j < 3; j++){

unsigned short temp = (yc_in[index(row, col, I, j)]);x_weight += (temp * x_op[i][j]);y_weight += (temp * y_op[i][j]);

} }edge = ABS(x_weight) + ABS(y_weight);

}yc_out[index(row, col)] = edge;

}}

}

9 memory access per iteration

Page 17: High-Level Synthesis (HLS) and SDSoC Development Environments › www › images › ... · Vivado HLS Comprehensive Integration with the SDSoC Environment VHDL or Verilog C, C++

© Copyright 2017 Xilinx.

Page 17

Problem 1: Non-Sequential Overlapped Memory

Read 9

pix

Page 18: High-Level Synthesis (HLS) and SDSoC Development Environments › www › images › ... · Vivado HLS Comprehensive Integration with the SDSoC Environment VHDL or Verilog C, C++

© Copyright 2017 Xilinx.

Page 18

Problem 1:

Read 9

pix

Page 19: High-Level Synthesis (HLS) and SDSoC Development Environments › www › images › ... · Vivado HLS Comprehensive Integration with the SDSoC Environment VHDL or Verilog C, C++

© Copyright 2017 Xilinx.

Page 19

Problem 1:

Read 9

pix

Page 20: High-Level Synthesis (HLS) and SDSoC Development Environments › www › images › ... · Vivado HLS Comprehensive Integration with the SDSoC Environment VHDL or Verilog C, C++

© Copyright 2017 Xilinx.

Page 20

Problem 1:

Read 9

pix

Page 21: High-Level Synthesis (HLS) and SDSoC Development Environments › www › images › ... · Vivado HLS Comprehensive Integration with the SDSoC Environment VHDL or Verilog C, C++

© Copyright 2017 Xilinx.

Page 21

Problem 1:

Read 9

pix

Memory Access: 9 * 48 = 432 pixels

For 1080p: 18,608,436 pixels

Page 22: High-Level Synthesis (HLS) and SDSoC Development Environments › www › images › ... · Vivado HLS Comprehensive Integration with the SDSoC Environment VHDL or Verilog C, C++

© Copyright 2017 Xilinx.

Page 22

Solution 1: Direct Stream using Line Buffer and Window

Shift line buffer Read 1 pix

Page 23: High-Level Synthesis (HLS) and SDSoC Development Environments › www › images › ... · Vivado HLS Comprehensive Integration with the SDSoC Environment VHDL or Verilog C, C++

© Copyright 2017 Xilinx.

Page 23

Solution

Shift line buffer Read 1 pix

Page 24: High-Level Synthesis (HLS) and SDSoC Development Environments › www › images › ... · Vivado HLS Comprehensive Integration with the SDSoC Environment VHDL or Verilog C, C++

© Copyright 2017 Xilinx.

Page 24

Shift line buffer Read 1 pix

Page 25: High-Level Synthesis (HLS) and SDSoC Development Environments › www › images › ... · Vivado HLS Comprehensive Integration with the SDSoC Environment VHDL or Verilog C, C++

© Copyright 2017 Xilinx.

Page 25

Read 1 pix

Shift line buffer

Processing Window

Page 26: High-Level Synthesis (HLS) and SDSoC Development Environments › www › images › ... · Vivado HLS Comprehensive Integration with the SDSoC Environment VHDL or Verilog C, C++

© Copyright 2017 Xilinx.

Page 26

Read 1 pix

Shift line buffer

Processing Window

Page 27: High-Level Synthesis (HLS) and SDSoC Development Environments › www › images › ... · Vivado HLS Comprehensive Integration with the SDSoC Environment VHDL or Verilog C, C++

© Copyright 2017 Xilinx.

Page 27

Memory Access: 1 * 80 = 80 pixels

With 1080p, 2,073,600 pixels (9x less!)

Shift line buffer

Read 1 pix

Processing Window

Page 28: High-Level Synthesis (HLS) and SDSoC Development Environments › www › images › ... · Vivado HLS Comprehensive Integration with the SDSoC Environment VHDL or Verilog C, C++

© Copyright 2017 Xilinx.

Page 28

Solution 1: Use Line Buffer and Window Classes

void sobel_filter(unsigned short *yc_in,unsigned short *yc_out, short *x_op, short *y_op){

int row, col;ap_linebuffer<unsigned char, 3, NUMCOLS> buff_A;ap_window<unsigned char,3,3> buff_C;for(row = 0; row < NUMROWS; row++){for(col = 0; col < NUMCOLS; col++){

unsigned short input_data, unsigned char edge;update_linebuffer(buff_A, yc_in[index(row,col)]);update_window(buff_C);if((col < NUMCOLS) & (row < NUMROWS))

input_data = yc_in[row*NUMCOLS+col];if( isEdge(row, col)) edge=0;else{

short x_weight = 0, y_weight = 0;for(char I = 0; i < 3; i++){for(char j = 0; j < 3; j++){

unsigned short temp = (buff_C[index(row, col, I, j)]);x_weight += (temp * x_op[i][j]);y_weight += (temp * y_op[i][j]);

} }edge = ABS(x_weight) + ABS(y_weight);

}yc_out[index(row, col)] = edge;

}}

}

Page 29: High-Level Synthesis (HLS) and SDSoC Development Environments › www › images › ... · Vivado HLS Comprehensive Integration with the SDSoC Environment VHDL or Verilog C, C++

© Copyright 2017 Xilinx.

1 FPS @ 1080p

10x speedup

Page 29

Video 2: Run the Solution 1 on a Zynq board

Page 30: High-Level Synthesis (HLS) and SDSoC Development Environments › www › images › ... · Vivado HLS Comprehensive Integration with the SDSoC Environment VHDL or Verilog C, C++

© Copyright 2017 Xilinx.

Page 30

Problem 2: Loop Iterations Are Sequential

Executing sequentially

void sobel_filter(unsigned short *yc_in,unsigned short *yc_out, short *x_op, short *y_op){int row, col;for(row = 0; row < NUMROWS; row++){for(col = 0; col < NUMCOLS; col++){unsigned short input_data, unsigned char edge;if((col < NUMCOLS) & (row < NUMROWS))

input_data = yc_in[row*NUMCOLS+col];if( isEdge(row, col)) edge=0;else{

short x_weight = 0, y_weight = 0;for(char I = 0; i < 3; i++){for(char j = 0; j < 3; j++){

unsigned short temp = (yc_in[index(row, col, I, j)]);x_weight += (temp * x_op[i][j]);y_weight += (temp * y_op[i][j]);

} }edge = ABS(x_weight) + ABS(y_weight);

}yc_out[index(row, col)] = edge;

}}

}

Page 31: High-Level Synthesis (HLS) and SDSoC Development Environments › www › images › ... · Vivado HLS Comprehensive Integration with the SDSoC Environment VHDL or Verilog C, C++

© Copyright 2017 Xilinx.

Page 32

Solution 2: Pipeline Loop Iterations

Assuming 60 cycles / loop iteration

60*1920*1080 =

124,416,000 cycles

60+(1920*1080) =

2,073,660 cycles (60x speedup)

Page 32: High-Level Synthesis (HLS) and SDSoC Development Environments › www › images › ... · Vivado HLS Comprehensive Integration with the SDSoC Environment VHDL or Verilog C, C++

© Copyright 2017 Xilinx.

Page 33

Solution 2: Add PIPELINE Pragma

void sobel_filter(unsigned short *yc_in,unsigned short *yc_out, short *x_op, short *y_op){

int row, col;ap_linebuffer<unsigned char, 3, NUMCOLS> buff_A;ap_window<unsigned char,3,3> buff_C;for(row = 0; row < NUMROWS; row++){for(col = 0; col < NUMCOLS; col++){

#pragma AP PIPELINE II = 1unsigned short input_data, unsigned char edge;update_linebuffer(buff_A, yc_in[index(row,col)]);update_window(buff_C);if((col < NUMCOLS) & (row < NUMROWS))

input_data = yc_in[row*NUMCOLS+col];if( isEdge(row, col)) edge=0;else{

short x_weight = 0, y_weight = 0;for(char I = 0; i < 3; i++){for(char j = 0; j < 3; j++){

unsigned short temp = (buff_C[index(row, col, I, j)]);x_weight += (temp * x_op[i][j]);y_weight += (temp * y_op[i][j]);

} }edge = ABS(x_weight) + ABS(y_weight);

}yc_out[index(row, col)] = edge;

}}

}

Page 33: High-Level Synthesis (HLS) and SDSoC Development Environments › www › images › ... · Vivado HLS Comprehensive Integration with the SDSoC Environment VHDL or Verilog C, C++

© Copyright 2017 Xilinx.

60 FPS @ 1080p

60x speedup

Page 34

Video 3: Run the Solution 2 on a Zynq board

Page 34: High-Level Synthesis (HLS) and SDSoC Development Environments › www › images › ... · Vivado HLS Comprehensive Integration with the SDSoC Environment VHDL or Verilog C, C++

© Copyright 2017 Xilinx.

Page 35

Sobel Edge Detection – Demo

HDMI cable

1080p 60fps Video source

ZC702 Board (ZYNQ 7020)

Page 35: High-Level Synthesis (HLS) and SDSoC Development Environments › www › images › ... · Vivado HLS Comprehensive Integration with the SDSoC Environment VHDL or Verilog C, C++

© Copyright 2017 Xilinx.

Page 36

SDSoC Sobel Filter Demo - Summary

FPS

A naïve sobel filter implementation 0.1

Problem 1: Non-sequential overlapped memory accesses

Solution 1: Reducing memory access by 9x using line buffer

and window

1.0

Problem 2: Sequential loop iteration

Solution 2: Pipeline loop iteration using the pipeline pragma

60.0

600x Speedup with only ~15 lines of code change

Let SDSoC generate an optimized HW/SW system

so you can focus on your algorithm optimization