Synthesis of Platform Architectures from OpenCL Programs

Synthesis of Platform Architectures from OpenCL Programs

Muhsen Owaida

KonstantisDaloukas

NikolaosBellas

Christos D.Antonopoulos

Department of Computer and Communication EngineeringUniversity of Thessaly

Volos, Greece

05/02/23 FCCM 2011 2

Introduction• High Level Synthesis (HLS) has been at the

research forefront in the last few years.

• Variety of Programming Models have been introduced: C/C++, C-like Languages,

MATLAB, CUDA.• Obstacles:

– Parallelism Expression.– Extensive Compiler Transformations &

Optimizations.

05/02/23 FCCM 2011 3

Motivation• Lack of parallel programming language for

reconfigurable platforms.

• A major shift of Computing industry toward many-core computing systems.

• Reconfigurable fabrics bear a strong resemblance to many core systems.

05/02/23 FCCM 2011 4

Contribution• Silicon-OpenCL “SOpenCL”.• A tool flow to convert an

unmodified OpenCL application into a SoC design with HW/SW components.

• A template-based hardware accelerator generation.

• Decouple data movement and computations.

Front End

Back End

_kerne VecAdd2D(int *A, int* B, int* C){ int I = get_local_id(0); int j = get_local_id(1); C[i*width + j] = A[i*width + j] + B[i*width + j];}

On-Chip CPU

On-Chip-Buss

HWAccelerator

HWAccelerator

Off-Chip Memory

Simulation & Verification

C Function

Drivers& runtimeSystem-On-Chip

SoC

OpenCL Kernel

StreamingUnit

Datapath

Input data

Output data

Architectural Template

05/02/23 FCCM 2011 5

Outline• High-Level Synthesis

• OpenCL Programming Model

• SOpenCL – Front-End

– Back-End

– Run-Time

• Experimental Evaluation

• Conclusion

05/02/23 6

OpenCL Programming Language• Open Computing Language• OpenCL expresses parallelism at its finest granularity.• Computation-grid partitioned in a 3-dimensional space of

work groups.

x = 0, y = 0

Work item (idx*Sx + x, idy*Sy + y)

x = Sx - 1, y = 0


x = 0, y = Sy - 1


x = Sx - 1, y = Sy - 1


Work group (idx, idy)

void chromaMotionCompensation(char* refF, char* outF, int FWidth){int i = get_local_id(0);int j = get_local_id(1);int refX = get_group_id(0);int refY = get_group_id(1);int PixX = get_global_id(0);int PixY = get_global_id(1);Pval = (DXDY * refF[ (refY + j )*FWidth + (refX + i ) + dxDY * refF[ ( refY + j ) * FWidth + ( refX + i + 1) + DXdy * refF[ ( refY + j + 1 ) * FWidth + ( refX + i ) + dxdy * refF[ ( refY + j + 1 ) * FWidth + ( refX + i + 1 ) ]+ 32 ) >> 6; if( Pval < 0) Pval = 0 else if( Pval > 255) Pval = 255; outF[ PixY * FWidth + PixX ] = Pval;}

Computation Grid

Gx

Gy

Sx

Sy

Work-Item Thread

05/02/23 FCCM 2011 7

Data Movement• Explicit Data Movement: Local Buffers and

Global Buffers.

05/02/23 FCCM 2011 8

Outline

• High-Level Synthesis



– Back-End

– Run-Time


• Conclusion

05/02/23 FCCM 2011 9

SOpenCL Front-End (I)Granularity Coarsening

• Work Item represents a light computational load.• Coarsen the granularity due to limited resources and memory

bandwidth.

void chromaMotionCompensation(char* refF, char* outF, int FWidth){int i = get_local_id(0);int j = get_local_id(1);int refX = get_group_id(0);int refY = get_group_id(1);int PixX = get_global_id(0);int PixY = get_global_id(1);

Pval = ( DXDY * refF[ ( refY + j ) * FWidth + ( refX + i ) + dxDY * refF[ ( refY + j ) * FWidth + ( refX + i + 1 ) + DXdy * refF[ ( refY + j + 1 ) * FWidth + ( refX + i ) + dxdy * refF[ (refY + j + 1)*FWidth + (refX + i + 1) ] + 32 ) >> 6; if( Pval < 0) Pval = 0 else if( Pval > 255) Pval = 255; outF[ PixY * FWidth + PixX ] = Pval;}

void chromaMotionCompensation(char* refF, char* outF, int* local_size, int FWidth, int refX, int refY, int PX_init, int PY_init ) { int kernel_i, kernel_j, kernel_k, Pval, i, j, PixX, PixY; for (kernel_k = 0; kernel_k < local_size[2]; kernel_k++) { for (kernel_j = 0; kernel_j < local_size[1]; kernel_j++) { for (kernel_i = 0; kernel_i < local_size[0]; kernel_i++) { PixX = PX_init + kernel_i; PixY = PY_init + kernel_j; i = kernel_i; j = kernel_j;

Pval = ( DXDY * refF[ ( refY + j ) * FWidth + ( refX + i ) + dxDY * refF[ ( refY + j ) * FWidth + ( refX + i + 1 ) + DXdy * refF[ ( refY + j + 1 ) * FWidth + ( refX + i ) + dxdy * refF[ (refY + j + 1 )*FWidth + (refX + i + 1) ]+ 32 ) >> 6; if( Pval < 0) Pval = 0 else if( Pval > 255) Pval = 255; outF[ PixY * FWidth + PixX ] = Pval; }}}}

OpenCL KernelC function

SOpenCL Front-End (II) Barrier Elimination

05/02/23 FCCM 2011 10

triple_nested_loop { Statements_block1

} //barrier(); triple_nested_loop { Statements_block2

}

Statements_block1

barrier(); Statements_block2

OpenCL code

C code

05/02/23 FCCM 2011 11

Outline




– Back-End

– Run-Time


• Conclusion

05/02/23 FCCM 2011 12

Hardware Generation• Perform a series of optimizations and

Transformations.– Uses LLVM Compiler Infrastructure.

• Generate synthesizable Verilog.• Generate Test bench and simulation files.

C code(Nested loop)

LLVMCompiler

Optimize LLVM-IR Predication Code

slicing

SMS modscheduling

Veriloggeneration

Simulation

SynthesisFinal bitstream

AcceleratorTemplate

User PerformanceRequirements

SynthesizableVerilog

Test bench

05/02/23 FCCM 2011 13

IF Conversion

• Predication: If-conversion necessary for the application of Modulo-Scheduler.

Predication Codeslicing

SMS modscheduling

Veriloggeneration

bb0:r0 = cmp eq t, 0br r0, bb1, bb2

bb1:r1 = load Abr bb3

bb2:r2 = add a, 1br bb3

bb3:r4 = phi r1, bb1, r2, bb2br bb4

bb0: r0 = cmp eq t, 0 p0 = xor r0, true(r0) r1 = load A(p0) r2 = add a, 1 r4 = select r0, r1, r2 br bb4

Most-inner loop body (LLVM assembly)

Predicates

05/02/23FCCM 2011

Code Slicing

• Decouple Data movement and computations.

• Input Streaming Kernel

• Output Streaming Kernel

• Computational Kernel

Sout Kernel: ind = phi [0, preh], [i2, body] i2 = add ind, 1 i6 = add a2, ind gep4 = getelementptr i8* x1, i6 store i23, i8* gep4

Computational Kernel: i46 = phi [true, preh], [i41, body] ind = phi [0, preh], [i2, body] i2 = add ind, 1 i7 = pop i8* gep0 i10 = pop i8* gep1 i9 = mul i7, a3 i12 = mul i10, a4 i19 = add i9, 32 i20 = add i19, i12 i23 = ashr i22, 6 push i23, i8* gep4 i40 = icmpeq i2, 8 i41 = xor i40, true br i40, exit, body

body: i46 = phi [true, preh], [i41, body] ind = phi [0, preh], [i2, body] i0 = add a0, ind i2 = add ind, 1 i3 = add a0, i2 gep0 = getelementptr i8* x0, i0 gep1 = getelementptr i8* x0, i3 i7 = load i8* gep0 i10 = load i8* gep1 i6 = add a2, ind gep4 = getelementptr i8* x1, i6 i9 = mul i7, a3 i12 = mul i10, a4 i19 = add i9, 32 i20 = add i19, i12 i23 = ashr i22, 6 store i23, i8* gep4 i40 = icmpeq i2, 8 i41 = xor i40, true br i40, exit, body

PredicatedLLVM Loop


SMS modscheduling

Veriloggeneration

Part of Chroma Interpolation LLVM

Termination

Computation

Sin Kernel: ind = phi [0, preh], [i2, body] i0 = add a0, ind i2 = add ind, 1 i3 = add a0, i2 gep0 = getelementptr i8* x0, i0 gep1 = getelementptr i8* x0, i3 i7 = load i8* gep0 i10 = load i8* gep1

05/02/23 FCCM 2011 15

Modulo Scheduling

• Software Pipelining:– II: Initiation Interval.

• Swing Modulo Scheduling (SMS). • Valid Bits used to implement Prologue and Epilogue.


SMS modscheduling

Veriloggeneration

Stage AStage BStage CStage DStage E





II

Iter 1Iter 2

Prologue

Kernel

Epilogue

Fill Pipeline

Steady State

Drain Pipeline

Iter N-1Iter N

05/02/23 FCCM 2011 16

Verilog Generation

Arbiter

Sin Align Unit Sout Align Unit

Sin Requests

Generator

Cache Unit

Sout AGU

Sin AGU

Data_lineData_line

AddressAddress Data_inData_in

Data_outData_out

AddressAddress

Sin0Sin0 Sin1Sin1 Sout0Sout0

Streaming UnitStreaming UnitSystem InterconnectSystem Interconnect

Local requestLocal request

FU

Data

TerminateTerminate Sin0Sin0 Sin1Sin1 Sout0Sout0

Data PathData Path

Named Register

Named Register

Memory Mapped Registers

Memory Mapped Registers

Multiplexer

TunnelTunnel

Data

FU

Multiplexer

Data

FU

Multiplexer

DataData

Sin Kernel: ind = phi [0, preh], [i2, body] i0 = add a0, ind i2 = add ind, 1 i3 = add a0, i2 gep0 = getelementptr i8* x0, i0 gep1 = getelementptr i8* x0, i3 i7 = load i8* gep0 i10 = load i8* gep1

Computational Kernel: i46 = phi [true, preh], [i41, body] ind = phi [0, preh], [i2, body] i2 = add ind, 1 i7 = pop i8* gep0 i9 = mul i7, a3 i10 = pop i8* gep1 i12 = mul i10, a4 i19 = add i9, 32 i20 = add i19, i12 i23 = ashr i22, 6 push i23, i8* gep4 i40 = icmpeq i2, 8 i41 = xor i40, true br i40, exit, body

Sout Kernel: ind = phi [0, preh], [i2, body] i2 = add ind, 1 i6 = add a2, ind gep4 = getelementptr i8* x1, i6 store i23, i8* gep4

Feed Data in Order


SMS modscheduling

Veriloggeneration

Write Data in Order

FU types,Bitwidths,

I/O Bandwidth

Requests/DataFIFO Size

05/02/23 FCCM 2011 17

Outline



• SOpenCL– Front-End

– Back-End

– Run-Time


• Conclusion

05/02/23 FCCM 2011 18

Run-Time

• The OpenCL main program is executed as a main thread in the host processor of the platform (e.g. PowerPC).

• Work-tasks are created by the helper thread.

HostMain thread

Hosthelperthread

CommandQueue

Enqueue OpenCL

command

1

Accelerator

Work queue

InitializeAccelerator

Finish signal

Enqueue new Work tasks

2

3

4

5

Work thread(PowerPC)

05/02/23 FCCM 2011 19

Outline



• SOpenCL Front-End

• SOpenCL Back-End

• Run-Time


• Conclusion

05/02/23 FCCM 2011 20

Experimental Evaluation• We tested the SOpenCL methodology on six OpenCL and

C applications.• we evaluated our designs on a Xilinx Virtex-5 FX70

FPGA. • We used Xilinx ISE 11.4 toolset for synthesis, placement

and routing.• Evaluation Methodology:

– Three levels of resources availability {Ca, Cb, Cc}.– Three Requests/Data FIFO Sizes.– Cache Usage.

05/02/23 FCCM 2011 21

Results

MatMul.

0

5

10

15

20

25

2 4 8 2 4 8 2 4 8

Data/Req-FIFO Size

Exe.

tim

e (m

s)

300

330

360

390

420

450

Rea

ds Is

sued

x10

00

Exe. time #Reads

CA CB CCCA CB

VAdd

0.0

0.1

0.2

0.3

0.4

0.5

0.6

2 4 8 2 4 8 2 4 8

Data/Req-FIFO Size

Exe.

tim

e (m

s)

0

5

10

15

20

25

30

Read

s Is

sued

x10

00

Exe. time #Reads

CA CB CC

05/02/23 FCCM 2011 22

Results

• The Cache is useful for applications with temporal locality.

LMC

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

CA Cb Cc Ca Cb Cc

Datapath Configuration

Exe.

time

(ms)

0

0.5

1

1.5

2

2.5

3

3.5

Read

s Is

sued

x10

00

Exe. time #Reads

With Cache Without Cache

1-D DCT

0.000

0.005

0.010

0.015

0.020

0.025

Ca Cb Cc Ca Cb Cc

Datapath Configurations

Exe.

tim

e (m

s)

0

0.14

0.28

0.42

0.56

0.7

Read

s Is

sued

x10

00

Exe. time #Reads


CMC

0.000

0.004

0.008

0.012

0.016

0.020

Ca Cb Cc Ca Cb Cc

Datapath Configuration

Exe.

tim

e (m

s)

00.080.160.240.320.40.480.560.640.720.8

Read

s Is

sued

x10

00

Exe. time #Reads


05/02/23 FCCM 2011 23

Outline




– Back-End

– Run-Time


• Conclusion

05/02/23 FCCM 2011 24

Conclusion• SOpenCL, a tool flow to produce the hardware and

software architecture of accelerator-based SoCs.

• OpenCL serves as a unified programming model for:– Heterogeneous many-core platforms.– Reconfigurable platforms (like FPGA).

• Future Work:– Multiple accelerators support.– Automatic hardware configurations selection.

05/02/23 FCCM 2011 25

Questions

Thank you for your attention

Devices & Hardware

Synthesis of Platform Architectures from OpenCL Programs