9. High Level Programming for FPGA Based Image and Video Processing Using Hardware Skeletons

  • Upload
    gutic18

  • View
    216

  • Download
    0

Embed Size (px)

Citation preview

  • 8/3/2019 9. High Level Programming for FPGA Based Image and Video Processing Using Hardware Skeletons

    1/8

    High Level Programming for FPGA Based Image and Video Processing

    using Hardware Skeletons

    K Benkrid, D Crookes

    , J Smith

    and A Benkrid

    School of Computer Science, The Queens University of Belfast, Belfast BT7 1NN, UK

    VisiCom division of Titan corp., 10052 Mesa Ridge Court San Diego, CA, 92121 USA

    Abstract

    In this paper, we present a new approach to developing a

    general framework for efficient FPGA based Image

    Processing algorithms. This approach is based on the

    new concept of Hardware Skeletons. A hardware

    skeleton is a parameterised description of a task-specific

    architecture, to which the user can supply parameters

    such as values, functions or even other skeletons. Askeleton contains built-in rules that will apply

    optimisations specific to the target hardware at the

    implementation phase. The framework contains a library

    of reusable skeletons for a range of common Image

    Processing operations. The library also contains high

    level skeletons for common combinations of basic image

    operations. Given a complete algorithm description in

    terms of skeletons, an efficient hardware configuration is

    generated automatically. We have developed a library of

    hardware skeletons for common image processing tasks,

    with optimised implementations specifically for Xilinx

    XC4000 FPGAs. This paper presents and illustrates our

    hardware skeleton approach in the context of some

    common image processing tasks, based on an

    implementation on VISICOMs VigraVisionTM FPGA

    based video board.

    1. Introduction

    Many modern Image Processing (IP) applications (such

    as processing video and very large images) are so

    computationally demanding that special purpose

    hardware solutions need to be considered.

    Reconfigurable hardware in the form of FPGAs can offer

    the performance advantages of a custom hardware

    solution, while their inherent reprogrammability feature

    makes them multi-purpose and reusable. However, a big

    disadvantage is the low level, hardware-orientedprogramming model needed to fully exploit the FPGAs

    potential performance.

    Despite the great amount of research done on

    FPGAs, many FPGA-based applications have been

    algorithm specific. An environment for developing

    applications needs more than just a library of static

    FPGA configurations, perhaps parameterisable (e.g. in

    terms of input data wordlength), since it should allow

    the user to experiment with alternative algorithms and

    develop his/her own algorithms. There is a need for

    bridging the gap between high level application-oriented

    software and low level FPGA hardware. Many

    behavioural synthesis tools have been developed to

    satisfy this requirement [1][2][3]. These tools allow the

    user to program FPGAs at a high level (e.g. in a C-like

    syntax) without having to deal with low level hardware

    details (e.g. scheduling, allocation, pipelining etc.).

    However, although behavioural synthesis tools have

    developed enormously, structural design techniques

    often still result in circuits that are substantially smaller

    and faster than those developed using only behavioural

    synthesis tools [4][5].

    This paper presents a framework for developing

    efficient hardware solutions specifically for image

    processing applications. This framework gives the

    benefits of an application-oriented, high level

    programming model, but does not sacrifice significantly

    the performance of the solution. Our approach to this is

    to use a concept which has proved relatively successful

    in developing software for parallel machines, namely

    skeletons [6][7][8]. Skeletons are reusable,parameterised fragments or frameworks to which the

    user can supply components (e.g. functions). It is

    common for skeletons to include functions as

    parameters which are applied by the skeleton to a data

    set. The implementation of a skeleton is normally

    optimised for a specific target machine.

    In this paper we introduce the concept of hardware

    skeletons. A hardware skeleton is a parameterised

    description of a task-specific architecture, to which the

    user can supply parameters such as values, functions

    (parameterised functional blocks) or even other

    skeletons. Certain combinations of basic skeletons can

    form the basis of additional, higher level skeletons. To

    present the concept, the paper first identifies a usefulhigh level model for describing image processing

    operations. The common basic tasks, which we identify,

    will form the basis of a Hardware Skeleton Library.

    Next, we outline the strategy which the system employs

    to generate efficient FPGA configurations from a given

    operation description. A layered implementation of the

    hardware skeleton library is then presented. Finally, the

    Proceedings of the 9

    th

    Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM01)0-7695-2667-5 $20.00 2001 IEEE

  • 8/3/2019 9. High Level Programming for FPGA Based Image and Video Processing Using Hardware Skeletons

    2/8

    implementation of our system on a commercial FPGA

    video board is presented.

    2. A high level model for IP operations

    Many image processing algorithm can be described in

    terms of a Directed Acyclic Graph (DAG), where

    vertices represent IP tasks, and the directed edges

    represent the data flow.

    IP tasks

    Input2 (e.g. image)Input1 (e.g. image)

    Output (e.g. image, histogram etc.)

    Figure 1. A hypothetical image processing algorithm

    modelled as a DAG

    Nodes are typically simple tasks such as adding two

    input images, or an image convolution. Common IP tasks

    can be classified in terms of the locality of their data

    access requirements into three categories:

    Point operations: The same operation is applied to

    each individual pixel of one or many source images to

    produce a corresponding result pixel in the new image.

    These include: relational operations (e.g. , , =),

    arithmetic operations (e.g. +, -, *, ), logicaloperations (e.g. AND, OR) and Look-Up tables.

    The operation could either be between two images or

    between an image and a scalar value.

    Neighbourhood operations: In neighbourhood

    operations, a new pixel value is calculated using only

    the pixel values in the neighbourhood of the original

    pixel and the weights in a window (e.g. convolution).

    This is done for all image pixels, and results in a new

    image. Neighbourhood operations are completely

    defined by a local operation between corresponding

    pixels and window values (e.g. multiplication), a

    global operation (e.g. accumulation) which reduces the

    window of intermediate results to a single result pixel,

    and a window (with given shape and coefficients) [9].

    Global operations: These operations operate globally

    on the whole image. We can distinguish two common

    types of simple global operations:

    - Reduction to Scalar (RS): These operate on the

    whole image to produce a scalar as a result.

    Examples include count, global maximum, global

    minimum and global accumulation ().

    - Reduction to Vector (RV): This operation operates

    on the whole image to produce a vector as a result.

    Examples include histogramming and cumulative

    histogramming.

    The properties of an item of data (represented by an

    edge in the DAG) are of two kinds:

    Data typeThis is defined by two properties:

    - Structure: could be an image, a vector or a scalar.- Pixel type: which, for the purpose of this work,

    could be either an integer or a boolean.

    Data representationA data representation is defined by three properties:

    - The data could be in bit serial, or in bit parallel

    with an associated word size or, in digit serial

    representation, with a particular digit and word

    sizes.

    - If data is in bit serial (or digit serial), it can then be

    processed either MSB (or MSD) First or LSB (or

    LSD) First.

    - Number System which, for the purpose of this work,

    could be one ofunsigned integer, 2s complement,

    or Signed Digit (SD) number representation [10].Note that Binary representation corresponds to bit

    parallel with a word size one (denoted asparallel(1)).

    A node with a particular set of logical Inputs/Outputs

    could be implemented by a range of different possible

    implementations as illustrated, for example, for the

    Absolute value operation in Figure 2. It is normal (but

    not compulsory) for the input and output representations

    to be the same.

    (d)

    Bit Parallel

    2s complement

    Bit Parallel

    2s complement

    Absolute valueAbsolute value

    Int

    Int

    (a) (b)

    Bit Serial

    SD, MSDF

    Bit Serial

    SD, MSDF

    (c)

    Bit Serial 2s

    complement, MSBF

    Bit Serial 2s

    complement, MSBF

    Absolute value Absolute value

    Figure 2. A DAG node (a) with several possible

    implementations (b), (c) and (d)

    The Hardware Skeleton Library contains parameterised

    descriptions of architectures not only for the full range

    of basic operations (nodes), but possibly with different

    versions for different data representation combinations.

    3. Implementation strategy

    The users first task will be to represent the algorithm in

    terms of a DAG, without initially being concerned with

    data type or data representation considerations. Once

    this is done, an analysis of the properties of the input

    and output data formats of the nodes will identify a

    range of possible implementations of each node. For

    Proceedings of the 9

    th

    Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM01)0-7695-2667-5 $20.00 2001 IEEE

  • 8/3/2019 9. High Level Programming for FPGA Based Image and Video Processing Using Hardware Skeletons

    3/8

    instance, the result of an N-bit integer image comparison

    operation could be either an N-bit integer image or a (1-

    bit) binary image. The choice will depend on subsequent

    processing of the result image, and on what skeletons are

    available. As a first step, the set of all possible

    implementations should first be considered by the user.

    The library of Hardware Skeletons (e.g. neighbourhood

    operations, point operations, etc.), in which eachcomponent has a set of different implementations (e.g.

    bit serial, bit parallel), is the basis of this phase. The

    implementations of the library components are optimised

    for specific target architectures (e.g. bit parallel adder

    units based on dedicated fast carry logic on Xilinx 4000).

    The range of possible implementations generated for a

    particular algorithm depends on the extent of this library.

    To select the optimum skeleton from the set of

    possible choices, the cost of each choice needs to be

    found, in terms of speed and area. This involves

    estimating the expected performance or effectively

    generating the FPGA configuration for each option,

    including the application of the optimisations for each

    skeleton. This cost based analysis enables the user to

    settle on a final DAG with all attributes (data type and

    representation) defined. The corresponding FPGA

    implementation is finally generated, in the form of EDIF

    netlist, for the chosen solution. This is performed by a

    Prolog-based [11] Hardware Description Environment,

    called HIDE4k, developed at Queens University Belfast

    [5, 12, 13]. This enables highly scaleable and parameter-

    ised component descriptions to be written, and generates

    pre-placed configurations in EDIF format for Xilinx

    XC4000 series [14]. The resulting EDIF file is finally fed

    to Xilinx Placement and Routing (PAR) tools to generate

    the FPGA configuration bitstream (see Figure 3).

    Note that during the process of implementing a

    DAG, the following issues arise:

    Data representation conversionSince many data representations might be used within

    the DAG, data representation converters may be

    needed to convert between different representations

    (e.g. from bit serial to bit parallel, or from Signed Digit

    to twos complement etc.)

    Data synchronisationWhen there are two or more inputs to a DAG node

    (vertex), any branch that arrives earlier than the others

    should be forced to wait for the slowest branches by

    adding appropriate delays to the fastest branches. This

    is performed automatically by our system so that the

    user does not have to deal with low level data

    synchronisation issues.

    As a result, the users programming model is essentially

    the set of hardware skeletons provided by the HardwareSkeleton Library. These skeletons can be accessed

    either textually or, even more conveniently, via a GUI.

    4. Implementing the Skeleton Library

    We have implemented our Hardware Skeleton Library

    as a hierarchy of three levels of hardware block

    descriptions. At the bottom level is the arithmetic

    library (see figure 4). This provides basic arithmetic

    units (e.g. adders, multipliers) parameterised for

    different number representations (e.g. bit serial, bit

    parallel, 2s complement, unsigned etc.). Immediately

    on top of this level, we find the basic image operations

    library. The latter provides implementations for thebasic image operations presented in section 2 above

    (e.g. basic neighbourhood operations). Finally, the top

    level provides implementations for high level

    (compound) skeletons.

    Basic Image Operations Library(e.g. neighbourhood operations)

    High Level (compound)Skeletons library

    To Image Processing Application Developer

    Arithmetic Cores Library

    Figure 4. Hierarchical implementation of the Hardware

    Skeleton Library

    The following section considers each of these three

    levels in more detail.

    DAG withlogical data

    types

    Solutiongeneration

    A DAG set ofavailable

    implementations

    Cost BasedAnalysis

    DAG with specificdata representation

    choices

    Hardware SkeletonLibrary

    Optimisation

    A DAG set ofoptimised

    implementations

    CodeGenerator EDIF

    HIDE4kSystem

    Xilinx PARtools

    Xilinx XC4000FPGA

    Bitstream

    Figure 3. Implementation strategy

    Proceedings of the 9

    th

    Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM01)0-7695-2667-5 $20.00 2001 IEEE

  • 8/3/2019 9. High Level Programming for FPGA Based Image and Video Processing Using Hardware Skeletons

    4/8

    4.1 Arithmetic cores library

    This library provides the basic building blocksrequired for image processing operations (andsignal processing in general). It includesadders, multipliers, dividers, shifts and delays. The basic functions required for nearly any

    signal processing operation includeaddition/subtraction, shifts and delays. Theseblocks can then be used to construct the morecomplicated structures such as multipliers,dividers and maximum/minimum selectors.

    Versions of these cores are provided fordifferent number representations. At the timeof writing, the following numberrepresentations are supported:

    Bit parallel (N bits) 2s complement -

    Bit serial 2s complement MSBF

    Bit serial 2s complement LSBF

    Bit serial Signed Digit MSDF

    The implementation of these cores is optimised for a

    specific target architecture (Xilinx XC4000 FPGAs for

    our particular case study). This should take advantage of

    the particular features of the target architecture (e.g. 4

    input LUTs, synchronous RAMs, dedicated fast carry

    logic for XC4000). The core descriptions are held in

    HIDE4k with rules for core-specific optimisations as part

    of the core. For instance, a constant coefficient

    multiplication will apply CSD coding of the multiplier

    coefficient to reduce the consumed hardware [15][16].

    Such optimisations are often not performed by

    behavioural synthesis tools.

    4.2 Basic image operations library

    This library provides implementations of the basic image

    operations presented in section 2.

    Consider the case of basic neighbourhood

    operations. As mentioned in section 2, a neighbourhood

    operation is defined in terms of a local and global

    operation. Local operations include multiplication and

    addition. Global operations include accumulation,

    maximum and minimum. These form the five basic

    Image Algebra neighbourhood operations as shown in

    Table 1 [9].

    Neighbourhood Operation Local Op. GlobalOp.

    Convolution * Multiplicative maximum * Max

    Multiplicative minimum * Min

    Additive maximum + Max

    Additive minimum + Min

    Table 1. Image Algebra core operation set

    The architecture of a generic PxQ neighbourhoodoperation (with a local operation L and a global one G)

    requires (Q-1) line buffers, PxQ replicated localoperation blocks, and a single PxQ-input globaloperation block, implemented as a tree of two-input

    reduction blocks when bit parallel arithmetic is used, as

    shown in figure 5.

    Line

    Buffer2

    Line

    BufferQ-1

    Line

    Buffer1

    L

    G

    G

    G

    G

    G

    G

    L

    P

    G

    L

    L

    L

    L

    L

    L

    L

    L

    L

    L

    Pixel

    buffers

    Figure 5. A general 2D PxQ neighbourhood operation

    This architecture is parameterisable or scaleable in

    terms of [17]:

    - The window size (PxQ)- The window coefficients

    - The image size (line buffer size LB)

    - The pixel wordlength

    - The local and global operations (L and G)

    - The number representation (arithmetic type)

    A generic description of a neighbourhood operation

    would then be given by:

    neighbourhood_op(Arithmetic_type, Local_op,Global_op, Window, Pixel_wordlength, Image_Size)

    Our HIDE4k system is capable of generating pre-placed

    FPGA architectures in EDIF format from such generic

    description. A ~30K line EDIF description is generated

    in 1~2 sec. The resulting architectures are tailored to the

    particular neighbourhood operation in hand (e.g.

    specific window coefficients). Their performance (speed

    Proceedings of the 9

    th

    Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM01)0-7695-2667-5 $20.00 2001 IEEE

  • 8/3/2019 9. High Level Programming for FPGA Based Image and Video Processing Using Hardware Skeletons

    5/8

    and area) rivals those obtained with a careful hand design

    [5].

    A common skeleton used at this level is the reduce

    skeleton which reduces a set of N inputs into one result

    using a tree of 2-operands operations (Op) as shown in

    Figure 6.

    Op

    Op

    Op

    Op

    Op

    Op

    Op

    Op

    Op

    Op

    Figure 6. Reduce skeleton

    At the time of writing, the supported 2-input reduction

    operations (Op) are addition (+), subtraction (-),

    maximum (max) and minimum (min).

    4.3 High level (compound) skeletons library

    This library contains efficient implementations of a set of

    compound skeletons. These compound skeletons result

    from the process of identifying, by experience, common

    ways of assembling primitive operations and providing

    optimised implementations of these. To demonstrate this

    concept, we will present two examples of such

    compound skeletons.

    Pipeline skeletonIn this type of operations, two or more IP operations are

    cascaded in a pipeline as shown in Figure 7. The input of

    each pipeline stage is provided by the output of the

    previous pipeline stage.

    Operation 1 Operation 2 Operation N

    Figure 7. Pipeline skeleton

    This structure is described by:

    Op_description = pipeline([Op1_desc, Op2_desc,, OpN_desc]

    )

    where OpI_desc {I = 1,2,, N} is the high level

    description of each operation in the pipeline.

    For instance, an Open operation (see Figure 8)

    applied to 256x256 images of 8-bits/pixel, using 2s

    complement bit parallel arithmetic, would be describedby:

    Open = pipeline([neighbourhood(tc_par, add, min,

    [[0,0,0],[0,0,0],[0,0,0]], 8, 256),neighbourhood(tc_par, add, max,[[0,0,0],[0,0,0],[0,0,0]], 8, 256)]

    )

    0 0 0

    0 0 0

    0 0 0

    Additive Maximum

    0 0 0

    0 0 0

    0 0 0

    Additive Minimum

    Figure 8. Open operation

    Parallel skeletonA number of common image processing algorithms

    comprise several concurrent neighbourhood operations

    (simple or compound) which share the same input

    image, and whose templates have the same size and

    shape. The results of these parallel operations then used

    in a reduce operation. Sobel, Prewitt, Roberts and

    Kirsch edge detectors [18], are examples of such

    operations.

    Op

    Op

    Op

    Par_desc2

    Par_desc1

    Parallel_op1

    Parallel_opN

    Parallel_opN-1

    Parallel_op2

    Figure 9. Parallel skeleton

    The high level description (Par_desc) of this operationwould be defined as follows:

    Par_desc = Op(Par_desc1, Par_desc2) (2)Par_desc1 and Par_desc2 are defined eitherrecursively as compound operations of the form (2)

    itself, or as (terminal) pipeline skeletons of the form

    (1). Note that the prefixes + and - can, for

    readability, be written in infix form. For instance, a

    Sobel operation (see Figure 10) will be described by:

    Proceedings of the 9

    th

    Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM01)0-7695-2667-5 $20.00 2001 IEEE

  • 8/3/2019 9. High Level Programming for FPGA Based Image and Video Processing Using Hardware Skeletons

    6/8

    pipeline([neighbourhood(tc_par, mult, accum,[[1,2,1],[0,0,0],[-1,-2,-1]], 8, Buf_size), abs])

    +pipeline([neighbourhood(tc_par, mult, accum,[[1,0,-1],[2,0,-2],[1,0,-1]], 8, Buf_size), abs])

    Absoluteoperation

    -1 ~ 1

    -2 ~ 2

    -1 ~ 1

    convolution

    +

    Absoluteoperation

    1 2 1

    ~ ~ ~

    -1 -2 -1

    convolution

    Figure 10. Sobel operation

    In such operations, only one set of line and pixel buffers

    is needed to synchronise the supply of pixels for all

    parallel operations instead of allocating separate line

    buffers for each neighbourhood operation. This is

    because all neighbourhood operations are applied to the

    same image.

    Note that the skeletons (1) and (2) can be nested toany depth and interchangeably (FPGA area permitting).

    5. Hardware implementation on a

    commercial FPGA based video board

    To assess the effectiveness of our skeleton-based

    approach, we have implemented our system on a

    commercial FPGA based video processing board. The

    latter is a single slot PCI card which combines video

    acquisition, FPGA based real-time processing, and

    display [19]. A functional block diagram of the

    VigraVision board is given by figure 11.

    Imageacquisition

    block

    FPGA based

    Video Processor

    Imagedisplayblock

    Camera

    VigraVision Board

    RGB

    Monitor

    Host PCI bus

    DRAM

    Figure 11. Block diagram of the VigraVision video

    board

    Bit parallel arithmetic has been chosen to implement theIP operations on the onboard FPGA (XC4013E-3). This

    choice is motivated by the fact that bit parallel

    architectures often lead to a better time-hardware product

    than bit serial ones. This is mainly due to the existence of

    dedicated fast carry logic on Xilinx FPGAs [5].

    However, in the context of processing real time video,

    the VigraVision board influences the choice of the

    arithmetic. If bit serial arithmetic is to be used, there is a

    need to generate a bit clock from the pixel clock. The bit

    clock frequency is N times the pixels clock (for an

    N-bit pixel). For practical real time video processing,

    the luminance pixel sampling rate is 13.5 MHz. This

    implies a bit clock frequency of 108 MHz for 8-bit

    length pixel processing, and 216 MHz for 16-bit length

    pixel processing. The XC4013E-3 cannot operate at

    these frequencies. Thus the architectures used will beimplemented from bit parallel-based skeletons. Note

    that a trade-off in the form of digit serial arithmetic is

    still possible. However, this implies additional hardware

    for the digit clock frequency generation, and extra care

    for data synchronisation. A parallel implementation is

    easier to implement and can be efficiently implemented

    using dedicated fast carry logic [16].

    Due to the limited memory resources on the FPGA

    chip, the line buffers have been implemented using the

    off-chip DRAM. Part of the FPGA is configured as an

    interface to the onboard DRAM (FIFOs), while the

    other part is configured to perform the required image

    processing operation as shown below:

    Line

    buffers

    (DRAM)

    FPGA chip

    Input video stream

    Outputvideo

    stream

    DRAM

    interface

    Figure 12. Block diagram of the FPGA chip configuration

    If the user wants to generate a complete configuration,

    including all the low level hardware details, he or she

    merely has to provide the required high leveldescription. The latter description must conform to the

    format in (1) and (2) and can be input textually or even

    graphically. Based on the skeleton library presented

    above, our HIDE4k system is capable of generating the

    corresponding efficient FPGA configuration in seconds

    in the form of an EDIF netlist.

    Due to the irregularity of the resulting architectures,

    the generated EDIF netlist is only partially placed. Once

    the EDIF description is generated, it is then fed to the

    Xilinx PAR tools to generate the FPGA configuration

    bitstream. This may take a long time (~1hr on a Pentium

    233 running Windows 95 with 32M of RAM). This is

    partly due to the fact that the EDIF netlist is only

    partially placed. Another reason is the small area of thetarget FPGA (24x24 CLBs only). Nonetheless, thewhole process is transparent to the user.

    At the application level, the user interfaces to the

    VigraVision board through a C-callable library

    (VigraVision ToolBox- VTB DLL). The ToolBox

    includes hardware initialisation and register control

    functions, image acquisition functions and image

    processing functions. For instance, the application

    Proceedings of the 9

    th

    Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM01)0-7695-2667-5 $20.00 2001 IEEE

  • 8/3/2019 9. High Level Programming for FPGA Based Image and Video Processing Using Hardware Skeletons

    7/8

    developer downloads a particular FPGA configuration by

    invoking the following function:

    u_loadXilinx(Xilinx_Chip_ID, Configuration_

    filename)

    He or she can then copy the processed image from the

    video buffer to the host processor, for further analysis,

    by:

    u_readRect(LPRECT lpImgRect, LPVOID imgData)

    where lpImgRect specifies the rectangle in the frame

    buffer where data is to be read from, and imgData is a

    pointer to the image transferred to the host memory.

    The Tool Box is supported in Microsoft Visual

    C/C++ 5.0. Its functions can be accessed by application

    software which has been linked with the VTB import

    library and accessed via Windows as a DLL library. The

    resulting video coprocessor (see Figure 13) is based on a

    library of bitstream configurations ready to use

    (download to the FPGA) from a high level language

    (VC++ in our case). This library is extensible over time

    using our HIDE4k system. Thanks to our skeleton

    oriented approach, this task is relatively easy to performand requires little FPGA hardware knowledge. This has

    been illustrated in this paper by two high level skeletons.

    Other skeletons can be designed using a similar approach

    and added to the library.

    6. Conclusion

    In this paper, we have presented a framework for FPGA

    based Image Processing. Central to this framework is the

    Hardware Skeleton Library which contains a set of high

    level descriptions of task-specific architectures

    specifically optimised for Xilinx XC4000 FPGAs.

    Although extensible, the library is based on a core level

    containing the operations of Image Algebra. The libraryalso contains high level skeletons for compound

    operations, whose implementations include task-specific

    optimisations. Skeletons are parameterisable, and

    different skeletons for the same operation can be

    provided, for instance for different arithmetic

    representations. This gives the user a range of

    implementation choices. This in turn supports

    experimentation with different implementations and

    choosing the most suitable one for the particular

    constraints in hand (e.g. speed and area). We are

    investigating the possibility of doing some of this

    experimentation automatically, but for now we do it

    manually. Given a complete algorithm description in

    terms of skeletons, an efficient hardware configurationis generated automatically by our system.

    Our approach was assessed successful by a real

    hardware implementation of a video coprocessor on a

    commercial FPGA based video board giving real time

    processing of video data. This video coprocessor allows

    for rapid generation of FPGA architectures from very

    high level, algorithmic, descriptions and opens the way

    to enabling image processing application developers to

    exploit the high performance capability of a direct

    hardware solution, while programming in an

    application-oriented model.

    Note that the skeleton oriented approach is not tied

    to a particular FPGA chip. Moreover, it may have some

    applicability for VLSI design. Furthermore, other

    application domains where there is an established

    algebra such as numerical processing can also benefit

    from the skeleton approach.

    Full system development will in practice inevitably

    hit the problem that some particular task is not readily

    expressed in terms of the skeletons currently in the

    library. It will always be necessary to have an ongoing

    process of skeleton development. This will of course

    require a skilled architecture designer, although less

    efficient solutions might be possible using existing

    skeletons; but the advantage of our approach is that

    system builders themselves do not require detailed

    hardware description skills.

    Future directions include upgrading the system to

    handle other FPGA series (particularly Xilinx Virtex

    chips). The extension of the hardware skeleton library,

    both in supporting more arithmetic types and providing

    other skeletons for more sophisticated image processing

    operations (wavelet transform in particular), is being

    investigated.

    OR

    High Level IP operationsdescriptionsExtensible bitstream

    configurations library

    HIDE4ksystem

    Xilinx PARtools

    EDIFNetlist

    VC++ program

    VigraVision PCI video board

    text

    Hardware Skeleton Library

    VTB library

    Figure 13. Overall view of the VigraVision based video coprocessor

    Proceedings of the 9

    th

    Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM01)0-7695-2667-5 $20.00 2001 IEEE

  • 8/3/2019 9. High Level Programming for FPGA Based Image and Video Processing Using Hardware Skeletons

    8/8

    7. References

    [1] Synopsys Inc., Behavioural Compiler, Software

    documentation, 1998.http://www.synopsys.com/products/beh_syn/

    [2] C Level Design Inc, C/C++ Synthesis SystemCompiler, Product overview, 1998.

    http://www.cleveldesign.com/products/[3] The Embedded Solutions Limited, Handel C

    information sheets, 1999 http://www.embeddedsol.com

    [4] Hutchings B, Bellows P, Hawkins J, Hemmert S, Nelson

    B and Rytting M, A CAD suite for High-PerformanceFPGA design, FCCM99, Preliminary Proceedings.

    [5] Benkrid K, Design and Implementation of a High Level

    FPGA Based Coprocessor for Image and Video

    Processing, PhD Thesis, Department of ComputerScience, The Queen's University of Belfast, 2000.

    [6] Cole M, Algorithmic Skeletons: structured management

    of parallel computation, MIT Press, 1989.

    [7] Darlington J, Ghanem M, and To H W, 'StructuredParallel Programming', In Programming Models for

    Massively Parallel Programming Computers, IEEE

    Computer Society Press, pp. 160-169, Sept 1993.

    [8] Michaelson G J, Scaife N R, and Wallace A M,'Prototyping parallel algorithms in Standard ML',

    Proceedings of British Vision Conference, Sep 1995.

    ftp://ftp.cee.hw.ac.uk/pub/funcprog/msw.bmvc95.ps.Z

    [9] Ritter G X, Wilson J N and Davidson J L, ImageAlgebra: an overview, Computer Vision, Graphics and

    Image Processing, No 49, pp 297-331, 1990.

    [10] Avizienis A, Signed Digit Number Representation for

    Fast Parallel Arithmetic, IRE Transactions on

    Electronic Computer, Vol. 10, pp 389-400, 1961.

    [11] Clocksin W F and Melish C S, Programming inProlog, Springer-Verlag, 1994

    [12] Crookes D, Alotaibi K, Bouridane A, Donachy P and

    Benkrid A, An Environment for Generating FPGA

    Architectures for Image Algebra-based Algorithms,

    ICIP98, Vol.3, pp. 990-994, 1998.[13] Benkrid K, Crookes D, Bouridane A, Corr P and

    Alotaibi K, A High Level Software Environment for

    FPGA Based Image Processing, Proc. IPA'99, IEESeventh International Conference on Image Processing

    and its Applications, Manchester, July 1999. pp. 112-

    116.

    [14] Xilinx Ltd, XC4000E and XC4000X Series FieldProgrammable Gate Arrays -Product Specification,

    1999. http://www.xilinx.com/partinfo/4000.pdf

    [15] Koren I, Computer arithmetic algorithms, Prentice-

    Hall, Inc, pp. 99-126, 1993.[16] Benkrid K, Crookes D, Smith J, Benkrid A, 'High Level

    Programming for Real Time FPGA Based Video

    Programming', Proc. ICASSP'2000, IEEE International

    Conference on Acoustic, Speech and Signal Processing,

    Istanbul, June 2000. Volume VI, pp. 3227-3231.[17] Crookes D, Benkrid K, Bouridane A, Alotaibi K and

    Benkrid A, Design and Implementation of a High Level

    Programming Environment for FPGA Based ImageProcessing, IEE proceedings Vision, Image and SignalProcessing, Vol. 147, No. 7, pp. 377-384.

    [18] Ross J, The Image Processing Handbook, CRC Press,

    1995.[19] Visicom Laboratories, The VigraVision PCI video

    board: users manual, 1998. http://www.visicom.com

    Proceedings of the 9

    th

    Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM01)0-7695-2667-5 $20.00 2001 IEEE