Introduction to circuit design using Celoxica’s Handel-C
Presenter: Mr. David SandersCo-Sponsored by: The Internet Innovation Centre and IEEE Computer Society
2
Agenda
FPGA overview Purpose of Handel-C Comparison with ANSI C
Handel-C data types parallelism special Handel-C constructs and data types
Hardware implementation of Handel-C constructs (Some of them)
Optimization and retiming features of Celoxica’s design tool
3
FPGAs
Field Programmable Gate Array A user programmable logic device with a
collection of Look-Up-Tables (LUTs), routing resources, and Input/Output blocks (IOBs).
LUTs contain varying number of inputs depending on vendor/technology used. Usually at least 4 inputs, but state of the art LUTs can have up to 8 (Altera’s 8-input fracturable LUT).
Modern FPGAs also included dedicated RAM blocks, ALUs, multipliers, or hard or soft-core processors (eg. ARM, NIOS II, MicroBlaze, PPC).
4
So what does Handel-C do for us? Those who have programmed in VHDL/Verilog
know that you must think in terms of a state machine, and write the code accordingly.
Handel-C is one level of abstraction higher than an HDL. Compiler deals with the state machine generation
automatically. Will create a netlist, but not an FPGA programming
files The FPGA vendor tools are still required.
It does however provide scripts for automating the place and route, and bit stream creation for Xilinx and Altera tools.
5
Still no free lunch!!
Just as in Professor McLeod’s talk last time, there is no free lunch.
The state machine generated by the Handel-C compiler uses One-Hot Encoding.
Not necessarily optimal for every design, but still gives good results in practice.
6
Other Capabilities of Handel-C Design Suite Provides support for Altera, Xilinx, and Actel FPGAs
Handel-C compiler can take advantage of technology available in a particular device (RAMs, ALUs, Multipliers, etc.)
Compiler can provide output in different formats: Netlist (EDIF) VHDL code from C code Debug (can be used with SystemC or ANSI C front-end for
verification) Provides a Platform Abstraction Layer (PAL)
Set of common utilities for hardware devices commonly found on development boards Video, Keyboard, Mouse, Ethernet, RS-232, LED output, General I/O, etc.
Provides support for integration with company specific tools and/or intellectual property. Quartus II, SOPC Builder, NIOS II processor, MicroBlaze processor
7
Handel-C Data Types
Handel-C supports all of the primitive integral types provided by ANSI C, (signed and unsigned). char, int, short, long
Variables are implemented as registers. Depth of an array must be specified at compile time. Can also declare variables of arbitrary width from 1 to 128 bits.
eg.unsigned 8 myVariable;signed 25 myVariable2[15];
No native floating point types or calculations in the current version. Course instructor claims it will be included in the next release.
8
Operators
All the operators from ANSI C, plus a few others: Relational: !=, ==, <, >, <=, >= (GT and LT expensive to
evaluate with combinational logic). Operands must have same width. Result is a 1 bit value.
Logical: &&, ||, ! Take 1 bit unsigned operands, however...X || y compiler will take this as: x!=0 || y!=0
Bitwise: ^, |, &, ~ Operands must have equal width. Shift: <<, >> For a << b, b must have a width of
ceil(log2(width(a)+1)) Macros provided by the Platform Developers Kit (PDK).
9
…the others
Bit manipulation Take: <- Drop: \\
1 0 0 1 1a =
0 1 1b = a <- 3
1 0 c = a \\ 3
Very cheap in hardware since these operators are implemented as wires.
Range selection: Expression[n:m] (bits n to m) a[3:1] = 0 0 1
Concatenation: expression 1 @ expression 2
d = a @ a[3:1] = 1 0 0 1 1 0 0 1
10
Parallelism
Since logic circuit operation is highly parallel by nature, it is necessary for a design tool to support parallelism.
Accomplished in Handel-C by using a par statement, as opposed to a seq statement, where the code is executed sequentially.
11
static unsigned 8 a = 2;static unsigned 8 b = 1;
par{ a++; b = a + 10;}
Each Handel-C assignment takes 1 clock cycle.Both statements begin execution at the same time, therefore both statements take only 1 clock cycle combined. Operations are performed on the value that the variable contained before the start of the previous cycle.Results: a = 3, b = 12
static unsigned 8 a = 2;static unsigned 8 b = 1;
seq{ a++; b = a + 10;}
The seq block operates in the same manner as you would expect from an ANSI C program.
Results: a = 3, b = 13
12
Signals
However, occasionally we need to use a value immediately after assigning it in a par block.
This can be done by declaring a variable as a signal.
The value of a signal lasts only for the duration of the current clock cycle.
signal unsigned 8 a;static unsigned 8 b;
par{ a = 7; b = a;}
Results: a = 0, b = 7
signal unsigned 8 a;static unsigned 8 b;
seq{ a = 7; b = a;}
Results: a = 0, b = 0
13
Nesting seq and par
Can be nested as in the following example:
par { seq { /*some statements to be executed sequentially */ } seq { /* these statements are executed sequentially, but in parallel with previous seq block */ }}
par will not return until all of the statements/sub-blocks have completed.
14
Special Data Types
Input/Output Obviously there must be a mechanism for
performing I/O with the FPGA. Handel-C has data types for buses or interfaces.
(input, output, tri-state). Also supports ports
I/O between modules/components in a design, not a physical pin.
15
I/O Declaration Examples
interface bus_in(type portName) Name() with {data = {Pin List}};
Input interface prototype:
interface bus_in(unsigned 2 val) myInput() with {data = {“P1”,”P2”}};
unsigned 2 inData;
inData = myInput.val; //read the value {P1 P2}
Input interface usage:
16
Examples cont’d
Output interface prototype:
interface bus_out() Name(type portName=Expression) with {data = {Pin List}};
static unsigned 8 counter = 0;interface bus_out() CountOut(unsigned 8 outVal=counter+1);while(1){ counter++;}
Output interface usage:
17
RAM and ROM
No such thing as malloc() on an FPGA Instead, Handel-C allows you to store variables in FPGA dedicated
RAM blocks
ram int 9 myRam[256]; /* a RAM block that holds 256, 9-bit integers */
static rom int 9 myRom[3] = {100,200,300}; /* must be static or global */
Different from arrays because declaring an array is the same as declaring multiple variables
This means that an array’s indices can be accessed simultaneously RAMs cannot because they only have 1 or 2 ports.
myRam[25]++; /*Read, Write, Modify = undefined results */par /* 2 modifies during same cycle -> This also won’t work */{ myRam[0] = 100; myRam[2] = 498;}
18
If/Else
Handel-C if/else syntax is almost the same as in ANSI C.
The exception: The condition of the if() must take 0 clock cycles to evaluate. This implies that there can not be any variable assignment in the condition expression.
if( (z = x + y) == 6) //legal in ANSI C, but not in Handel-C
19
Loops
while(), for(), do…while() All have same syntax as in ANSI C Same limitation applies to the conditions as
with if/else. When programming a PC, it is good practice
to use a for loop when the context calls for it. When writing C code for circuits, it’s almost
never good practice to use for() loops at all. One clock cycle overhead per iteration.
20
While Loop Optimization
The limitations of a for() loop can be avoided by incrementing a counter variable in parallel with the body of a while() loop.
static unsigned 4 x = 15;par{ do{
//do something } while(x != 0); x--;}
21
Macros, Channels, Prialt, and Semaphores Scenario: Suppose you need to design a
circuit that calculates pixel values in a frame buffer, and that each calculation takes 4 or 5 clock cycles. However you need to calculate one pixel every clock cycle to meet a display timing constraint.
Possible Solution: Duplicate the calculation code 5 times, and have each block store values in the proper place in the frame buffer.
22
Macros
Macros can be used to implement parameterizable code, or to provide code re-use.
Like a regular function without parameter types. For the solution to our scenario, declaring a macro
would look like:
macro proc myCalculation(dataSource)
{
//receive data from source
//Perform 3-5 clock cycles worth of calculations
}
23
Channels
Handel-C provides a channel type to allow for synchronization or communication between parallel processes.
Declaration: chan <type> <channelName>
Data can then be sent over the channel, or received from it, but only in one direction.
Two parallel blocks of code:
chan unsigned 8 dataPipe;static unsigned 8 someData = 5;…dataPipe ! someData;…
static unsigned 8 recvData;…dataPipe ? recvData;…
Must be declared with global scope.Each channel operation will block if the other party is not ready.
24
Prialt
Now suppose we have 5 of our ‘worker’ processes running in parallel. How do we use them to achieve our goal?
Each operation will complete in 3-5 cycles, so we don’t know which of the 5 will be free to perform the next pixel calculation.
But if we send data down a channel sequentially to each of the 5 processes, we might block on one of them, when another is not doing anything…wasted clock cycles.
Prialt is the solution for this.
25
Prialt
Similar to a case statement that chooses the first channel able to receive data.
In other words, it gives a priority to each channel.prialt{ case channel1 ! data ; break; case channel2 ! data ; break; default: break;}
If default is not used, then prialt will block on the last case statement if a prior one was not taken.
Need to be careful that process aren’t starved. Wasted resources
26
Semaphores
Once a process has finished its computation we need to update the frame buffer (FB), which is typically implemented in a RAM block for FPGA area efficiency.
Recall that a RAM block typically only has one write port, therefore we can’t have each process write to the frame buffer because we can’t guarantee that simultaneous access will not happen.
One solution is to have each process send the result down a separate channel to another process that deals with FB access.
But this is a section on semaphores, so we’ll go with them instead.
27
Semaphores
Semaphores can be used to guard critical sections of code against parallel access.
More like a mutex from POSIX threads. trysema() and releasesema() methods used to check if
critical section is free. eg.
sema fbGuard;
…
while(trysema(fbGuard)==0); delay; /*loop until semaphore is free */
/* critical section of code, ie. Frame buffer access */
releasesema(fbGuard); /*skipping this step could result in deadlock*/
…
28
Putting it all together…#define NUM_CHANNELS 5#define SCR_WIDTH 4#define SCR_HEIGHT 4set clock = external;
typedef struct point //just as in ANSI C{ unsigned 2 x; unsigned 2 y;} point;
sema fbGuard;
//you can even send structures over channelschan point dataChannels[NUM_CHANNELS];
ram unsigned 8 frameBuffer[SCR_WIDTH*SCR_HEIGHT];
macro proc increment(p){ if(p.x==SCR_WIDTH-1) { par { p.x=0; p.y++; } } else p.x++;}
macro proc coordGen(){ point pGen; pGen.x = 0; pGen.y = 0; while(1) { prialt { case dataChannels[0] ! pGen: increment(pGen); break; case dataChannels[1] ! pGen: increment(pGen); break; case dataChannels[2] ! pGen: increment(pGen); break; case dataChannels[3] ! pGen: increment(pGen); break; case dataChannels[4] ! pGen: increment(pGen); break; default: delay; break;
} }}
29
void main(){ par { //create the coord generator and the worker processes coordGen(); worker(dataChannels[0]); worker(dataChannels[1]); worker(dataChannels[2]); worker(dataChannels[3]); worker(dataChannels[4]);
//will never return because at //least 1 process has an infinite loop }}
macro proc worker(channel){ point p; static unsigned 8 pixel = 0; //loop forever waiting for data to compute pixels with
while(1) { channel ? p;
if(p.x <- 1 == 0 && p.y <- 1 == 0 ) //x, y are even { pixel = 2; delay; delay; } else if(p.x <- 1 == 1 && p.y <- 1 == 1 ) //both odd { pixel = 1; delay; } else //x is even/odd and y is odd/even pixel = 3;
//critical section while(trysema(fbGuard) == 0) delay; frameBuffer[[email protected]] = pixel; releasesema(fbGuard); }}
30
Mapping Handel-C to Logic
Ultimately, the statements you write in Handel-C must be mapped to logic by the compiler.
The following slides show the mapping for some of the constructs discussed so far. assignment seq and par if while do…while
The following logic circuits are taken from the course notes from Celoxica’s DK training course.
37
Why Retime?
Many designs will require the use of a multiplier, divider, or other large combinational logic circuit. The propagation delay through deep logic can be quite long.
Having even one path in the design with a long delay could cause the maximum clock rate to drop significantly to the point where timing constraints cannot be met.
Retiming involves moving/adding flip-flops around the data path to reduce the depth of logic, and ultimately reduce the critical path delay.
38
Simple Example1
x = a+b+c+d;
The result is calculated through two adder stages. However we can pipeline the result by inserting registers at intermediate locations.
1: Example adapted from Celoxica’s Handel-C and DK training course notes.
The adder stages are split with two registers. This reduces the propagation delay of each stage, allowing a higher clock frequency.
The consequence is that the result is delayed by one cycle.
39
Programming for Retiming
Retiming is not a trivial task, it is extremely time consuming to do by hand, especially for large designs.
Handel-C design tools can perform retiming automatically if the code is written properly.
The compiler will add/remove/move flip-flops as necessary, but will not alter the timing of the design.
Therefore to use retiming, the design must be pipelined, or have extra pipelining stages built-in.
The compiler can then shift logic and flip-flops around without altering the timing of the design.
40
Programming Example
Example: x = a*b+c*d;
unsigned 8 x[3]; //3 retiming stages;
interface bus_out() sumOut(unsigned 8 out = x[2]) with {data ={"P2","P3","P4","P5","P6","P7","P8","P9"}};interface bus_clock_in(unsigned 8 in) input() with {data ={"P10","P11","P12","P13","P14","P15","P16","P17"}};
void main(){ unsigned 8 data[4];
while(1) { par { //get the input and shift the previous inputs data[0] = input.in; data[1] = data[0]; data[2] = data[1]; data[3] = data[2]; x[0] = data[0]*data[1] + data[2]*data[3]; x[1] = x[0]; //extra stages x[2] = x[1]; } }}
Output is the last of the retiming stages.
Coded like you would without retiming.
Result is shifted through the retiming registers.
41
FIR Example
One of the exercises at the training course was to code a nine tap FIR filter that was pipelined and retimed automatically. Nine multiplications of data and coefficients, followed by
summation of the nine products. Very deep logic
Xilinx Spartan™ 3 chip was targeted. The fmax results were recorded for various number of extra retiming stages.
42
0
20
40
60
80
100
120
140
160
Frequency(MHz)
1 2 3 4 5 6
# of Ret i mi ng Stages
Fmax (MHz)
Fmax (MHz)
0
100
200
300
400
500
600
700
800
900
1000
# of Fl i pFl ops
1 2 3 4 5 6
# of Reti mi ng Stages
Fl i p Fl op Usage Before and Af ter Reti mi ng
FF BeforeFF Af ter
43
Final Notes
Not enough time to cover everything Handel-C has to offer. pointers, macro expressions
There are ways to create parameterizable code. Allows the designer to easily vary the # of worker
processes, or pipeline/retiming stages, for example.
More information available at www.celoxica.com