View
22
Download
0
Category
Preview:
Citation preview
High-Level Synthesis (HLS) and SDSoC Development Environments
Simplified Programming Experience for Software, Hardware and Systems Engineers
© Copyright 2017 Xilinx.
Vivado HLS
Comprehensive Integration with
the SDSoC Environment
VHDL or Verilog
C, C++ or SystemC
RTL Implementation
Micro Architecture Exploration
Algorithmic Specification
Rapid RTL architecture exploration via Directives
Co-optimization with RTL synthesis for optimal QoR
Generates AXI4-based IP for Vivado IP Integrator
Over 1,000 customers
Leveraged by LogiCORE IP developers (mainly Video IPs)
Libraries
Libraries:
Arbitrary Precision
Video, OpenCV
Math
Linear algebra
DSP: FFT and FIR
© Copyright 2017 Xilinx.
Vivado HLS System IP Integration Flow
IP Catalog
C-based IP Creation
Libraries
Arbitrary Precision
Video
Math
Linear algebra
IP: FFT and FIR
System Integration
Vivado IP Integrator
Vivado HLS Integrates into System Flows
C, C++, SystemC
VHDL or Verilog
System Generator for DSP
Vivado RTL
Page 3
© Copyright 2017 Xilinx.
Vivado High-Level SynthesisServes a Wide Range of Applications across Markets
Communications
LTE MIMO receiver
Advanced wireless antenna
positioning
Audio, Video, Broadcast3D cameras
Video transport
Consumer3D television
eReaders
Aerospace and DefenseRadar, Sonar
Signals Intelligence
Industrial, Scientific, MedicalUltrasound systems
Motor controllers
Automotive
Infotainment
Driver assistance
Computing & StorageHigh performance computing
Database acceleration
Test & MeasurementCommunications instruments
Semiconductor ATE
Page 4
© Copyright 2017 Xilinx.
1- 7
54257**slide
Design Exploration with Directives
© Copyright 2017 Xilinx.
The most important HLS compiler directives are familiar to
performance-oriented software programmers
Use hardware buffers to improve communication bandwidth
between accelerator and external memory
–Copy loops at the function boundary when multiple accesses required and
to burst data into local buffers
Page 6
Microarchitecture Optimizations
Directives and Configurations Description
PIPELINEReduces the initiation interval by allowing the concurrent
execution of operations within a loop or function.
DATAFLOWEnables functions and loops to execute concurrently. Avoid at the top-
level hardware function.
INLINEInline a function to function hierarchy, enable logic optimization across
function boundaries and reduce function call overhead.
UNROLLUnroll for-loops to create multiple independent operations rather than a
single collection of operations.
ARRAY_PARTITIONPartition array into smaller arrays or individual registers to increase
concurrent access to data and remove block RAM bottlenecks.
© Copyright 2017 Xilinx.
Verification Productivity
Input RTL Simulation Time C Simulation Time Acceleration
10 frames of video data ~2 days 10 seconds ~12,000X
RTL
C
Verified RTL
RTL
Verified RTL
HDL-based Design
C-based Design
Page 7
© Copyright 2017 Xilinx.
Floating point math
–Declare variables as float or double
or half precision type
–Uses Xilinx floating point cores
Fixed point math (ap_fixed.h)
– Arbitrary precision fixed point with
saturation, rounding options
math functions (hls_math.h)
Video function (hls_video.h)
–Memory line buffer, memory
windows
– Video algorithm function
Linear algebra library
(hls_linear_algebra.h)
Page 8
HLS Library
© Copyright 2017 Xilinx.
C/C++ - standard types
– char (8-bit), short (16-bit), int (32-bit), long long (64-bit)
–May result in significantly larger area, slower speed
Arbitrary precision data types
–Use Arbitrary precision data types
– Smaller and Faster Hardware
Page 9
Data Types and Bit Accuracy: Example
ap_uint<8> X;
ap_int<25> Y;
© Copyright 2017 Xilinx.
Pipelined loops
–Combined with array partitioning to achieve II=1
Loop and Function Pipelining
Latency = 3 cycles
Without Pipelining
Initiation Interval = 3 cycles
RD CMP WR RD CMP WR
Loop:for(i=1;i<3;i++) {
op_Read;
op_Compute;
op_Write;
}
RD
CMP
WR
Loop Latency = 6 cycles
With Pipelining
Latency = 3 cycles
Initiation Interval = 1 cycle
RD CMP WR
RD CMP WR
Loop Latency = 4 cycles
void foo(...) {
op_Read;
op_Compute;
op_Write;
}
RD
CMP
WR
for (index_b = 0; index_b < B_NCOLS; index_b++) {
#pragma HLS PIPELINE II=1
float result = 0;
for (index_d = 0; index_d < A_NCOLS; index_d++) {
float product_term = in_A[index_a][index_d] * in_B[index_d][index_b];
result += product_term;
}
out_C[index_a * B_NCOLS + index_b] = result;
}
© Copyright 2017 Xilinx.
Loops can be Fully or Partially unrolled
–No manual code changes!
– Fully unrolled – best performance (may result in more area)
– Partially unrolled – explore performance / area tradeoff
Page 11
Minimizing Latency: Unrolled Loops
void foo_top (…) {...
Add: for (i=3;i>=0;i--) {b = a[i] + b;
...}
foo_top
+
+
+
+
a[3]
a[2]
a[1]
a[0]
Fully Unrolled
b
clk
3 2 1 0
3 2 1 0
3
2
1
0
Option 1
Option 2
Option 3
Latency = 4, #Add = 1 (rolled)
Latency = 2, #Add = 2 (partially unrolled: factor = 2)
Latency = 1, #Add = 4 (fully unrolled)
High-Level Synthesis (HLS) -Sobel Filter Optimization Example
© Copyright 2017 Xilinx.
Reference Application – Sobel Filter in HLS
Page 13
Get an input frame from HDMI
Convert RGB to YC
Apply an Sobel Filter
Convert YC to RGB
Send to HDMI output
void img_process(unsigned int *rgb_data_in, unsigned int *rgb_data_out,unsigned short *yc_data_in,unsigned short *yc_sobel_out)
{
rgb_pad2ycbcr(rgb_data_in, yc_data_in);sobel_filter(yc_data_in, yc_sobel_out);ycbcr2rgb_pad(yc_sobel_out, rgb_data_out);
}
© Copyright 2017 Xilinx.
Page 14
Naïve Sobel Filter Implementation
Iterate over an input video image
Applying 3x3 sobel filter
Writing to output image
void sobel_filter(unsigned short *yc_in,unsigned short *yc_out, short *x_op, short *y_op){
int row, col;for(row = 0; row < NUMROWS; row++){for(col = 0; col < NUMCOLS; col++){
unsigned short input_data, unsigned char edge;if((col < NUMCOLS) & (row < NUMROWS))
input_data = yc_in[row*NUMCOLS+col];if( isEdge(row, col)) edge=0;else{
short x_weight = 0, y_weight = 0;for(char I = 0; i < 3; i++){for(char j = 0; j < 3; j++){
unsigned short temp = (yc_in[index(row, col, I, j)]);x_weight += (temp * x_op[i][j]);y_weight += (temp * y_op[i][j]);
} }edge = ABS(x_weight) + ABS(y_weight);
}yc_out[index(row, col)] = edge;
}}
}
© Copyright 2017 Xilinx.
0.1 FPS @ 1080p
Page 15
Video 1: Run the Naïve Sobel on a Zynq board
© Copyright 2017 Xilinx.
Page 16
Problem 1: Non-Sequential Overlapped Memory Access void sobel_filter(unsigned short *yc_in,unsigned short *yc_out, short *x_op, short *y_op){
int row, col;for(row = 0; row < NUMROWS; row++){for(col = 0; col < NUMCOLS; col++){
unsigned short input_data, unsigned char edge;if((col < NUMCOLS) & (row < NUMROWS))
input_data = yc_in[row*NUMCOLS+col];if( isEdge(row, col)) edge=0;else{
short x_weight = 0, y_weight = 0;for(char I = 0; i < 3; i++){for(char j = 0; j < 3; j++){
unsigned short temp = (yc_in[index(row, col, I, j)]);x_weight += (temp * x_op[i][j]);y_weight += (temp * y_op[i][j]);
} }edge = ABS(x_weight) + ABS(y_weight);
}yc_out[index(row, col)] = edge;
}}
}
9 memory access per iteration
© Copyright 2017 Xilinx.
Page 17
Problem 1: Non-Sequential Overlapped Memory
Read 9
pix
© Copyright 2017 Xilinx.
Page 18
Problem 1:
Read 9
pix
© Copyright 2017 Xilinx.
Page 19
Problem 1:
Read 9
pix
© Copyright 2017 Xilinx.
Page 20
Problem 1:
Read 9
pix
© Copyright 2017 Xilinx.
Page 21
Problem 1:
Read 9
pix
Memory Access: 9 * 48 = 432 pixels
For 1080p: 18,608,436 pixels
© Copyright 2017 Xilinx.
Page 22
Solution 1: Direct Stream using Line Buffer and Window
Shift line buffer Read 1 pix
© Copyright 2017 Xilinx.
Page 23
Solution
Shift line buffer Read 1 pix
© Copyright 2017 Xilinx.
Page 24
Shift line buffer Read 1 pix
© Copyright 2017 Xilinx.
Page 25
Read 1 pix
Shift line buffer
Processing Window
© Copyright 2017 Xilinx.
Page 26
Read 1 pix
Shift line buffer
Processing Window
© Copyright 2017 Xilinx.
Page 27
Memory Access: 1 * 80 = 80 pixels
With 1080p, 2,073,600 pixels (9x less!)
Shift line buffer
Read 1 pix
Processing Window
© Copyright 2017 Xilinx.
Page 28
Solution 1: Use Line Buffer and Window Classes
void sobel_filter(unsigned short *yc_in,unsigned short *yc_out, short *x_op, short *y_op){
int row, col;ap_linebuffer<unsigned char, 3, NUMCOLS> buff_A;ap_window<unsigned char,3,3> buff_C;for(row = 0; row < NUMROWS; row++){for(col = 0; col < NUMCOLS; col++){
unsigned short input_data, unsigned char edge;update_linebuffer(buff_A, yc_in[index(row,col)]);update_window(buff_C);if((col < NUMCOLS) & (row < NUMROWS))
input_data = yc_in[row*NUMCOLS+col];if( isEdge(row, col)) edge=0;else{
short x_weight = 0, y_weight = 0;for(char I = 0; i < 3; i++){for(char j = 0; j < 3; j++){
unsigned short temp = (buff_C[index(row, col, I, j)]);x_weight += (temp * x_op[i][j]);y_weight += (temp * y_op[i][j]);
} }edge = ABS(x_weight) + ABS(y_weight);
}yc_out[index(row, col)] = edge;
}}
}
© Copyright 2017 Xilinx.
1 FPS @ 1080p
10x speedup
Page 29
Video 2: Run the Solution 1 on a Zynq board
© Copyright 2017 Xilinx.
Page 30
Problem 2: Loop Iterations Are Sequential
Executing sequentially
void sobel_filter(unsigned short *yc_in,unsigned short *yc_out, short *x_op, short *y_op){int row, col;for(row = 0; row < NUMROWS; row++){for(col = 0; col < NUMCOLS; col++){unsigned short input_data, unsigned char edge;if((col < NUMCOLS) & (row < NUMROWS))
input_data = yc_in[row*NUMCOLS+col];if( isEdge(row, col)) edge=0;else{
short x_weight = 0, y_weight = 0;for(char I = 0; i < 3; i++){for(char j = 0; j < 3; j++){
unsigned short temp = (yc_in[index(row, col, I, j)]);x_weight += (temp * x_op[i][j]);y_weight += (temp * y_op[i][j]);
} }edge = ABS(x_weight) + ABS(y_weight);
}yc_out[index(row, col)] = edge;
}}
}
© Copyright 2017 Xilinx.
Page 32
Solution 2: Pipeline Loop Iterations
Assuming 60 cycles / loop iteration
60*1920*1080 =
124,416,000 cycles
60+(1920*1080) =
2,073,660 cycles (60x speedup)
…
© Copyright 2017 Xilinx.
Page 33
Solution 2: Add PIPELINE Pragma
void sobel_filter(unsigned short *yc_in,unsigned short *yc_out, short *x_op, short *y_op){
int row, col;ap_linebuffer<unsigned char, 3, NUMCOLS> buff_A;ap_window<unsigned char,3,3> buff_C;for(row = 0; row < NUMROWS; row++){for(col = 0; col < NUMCOLS; col++){
#pragma AP PIPELINE II = 1unsigned short input_data, unsigned char edge;update_linebuffer(buff_A, yc_in[index(row,col)]);update_window(buff_C);if((col < NUMCOLS) & (row < NUMROWS))
input_data = yc_in[row*NUMCOLS+col];if( isEdge(row, col)) edge=0;else{
short x_weight = 0, y_weight = 0;for(char I = 0; i < 3; i++){for(char j = 0; j < 3; j++){
unsigned short temp = (buff_C[index(row, col, I, j)]);x_weight += (temp * x_op[i][j]);y_weight += (temp * y_op[i][j]);
} }edge = ABS(x_weight) + ABS(y_weight);
}yc_out[index(row, col)] = edge;
}}
}
© Copyright 2017 Xilinx.
60 FPS @ 1080p
60x speedup
Page 34
Video 3: Run the Solution 2 on a Zynq board
© Copyright 2017 Xilinx.
Page 35
Sobel Edge Detection – Demo
HDMI cable
1080p 60fps Video source
ZC702 Board (ZYNQ 7020)
© Copyright 2017 Xilinx.
Page 36
SDSoC Sobel Filter Demo - Summary
FPS
A naïve sobel filter implementation 0.1
Problem 1: Non-sequential overlapped memory accesses
Solution 1: Reducing memory access by 9x using line buffer
and window
1.0
Problem 2: Sequential loop iteration
Solution 2: Pipeline loop iteration using the pipeline pragma
60.0
600x Speedup with only ~15 lines of code change
Let SDSoC generate an optimized HW/SW system
so you can focus on your algorithm optimization
Recommended