Vforce: An Extensible Framework for Reconfigurable Supercomputing

  • Published on

  • View

  • Download


  • 0018-9162/07/$25.00 2007 IEEE March 2007 39P u b l i s h e d b y t h e I E E E C o m p u t e r S o c i e t y

    C O V E R F E A T U R E

    grammer codes coarse-grained parallelism at the appli-cation level, then optimizes the application for the archi-tecture by identifying those parts that can benefit fromhardware acceleration, and finally recoding the partswith hardware-specific code.

    Supercomputing architectures vary in the level of programming support they offer, but in most cases code particular to the targeted architecture and field- programmable gate array hardware is needed for bothprocessing data and passing data between the applica-tion and the FPGA, and such code is intertwined withapplication code.

    Reconfigurable supercomputing is a volatile field, withvendors introducing new architectures and retiring previ-ous ones within short time frames. Consequently, appli-cations with hardware-specific FPGA optimizationsembedded in the code are not portable across differentreconfigurable computing architectures; designers mustrecode to port their application to a distinct platform. Anovel programming model is therefore required to bringportability to reconfigurable supercomputing applications.


    The Vector, Signal, and Image Processing Library(VSIPL++) for Reconfigurable Computing is an exten-

    Reconfigurable supercomputer architectures require new programming tools that

    support application portability.The Vforce framework, based on the object-oriented

    VSIPL++ standard, encapsulates hardware-specific implementations behind a standard

    API, thus insulating application-level code from hardware-specific details.

    Nicholas Moore, Albert Conti, and Miriam Leeser, Northeastern UniversityLaurie Smith King, College of the Holy Cross

    R econfigurable supercomputing is an emergingresearch area that combines the power ofreconfigurable hardware with traditionalsupercomputing architectures consisting of sev-eral processing nodes with communicationfacilities between nodes. As the High-PerformanceReconfigurable Supercomputing Architectures sidebardescribes, numerous such architectures have appearedin recent years including SGIs RASC RC100, the CrayXD1, and the SRC-7 from SRC Computers.

    Many applications that achieve orders-of-magnitudeprocessing-speed improvements on reconfigurable hard-ware, such as image- and signal-processing applications,are also amenable to parallelization on supercomput-ers. For example, coarse-grained parallelism in image-processing applications can often be achieved bypartitioning the image and letting each node in the super-computer operate over a piece of it or on a differentimage. Reconfigurable hardware on each node can thenbe used to speed up fine-grained parallelism such as pixelprocessing.

    Unfortunately, writing application code that isportable across different reconfigurable supercomput-ing architectures is challenging. The current state of theart is to write high-level application code for a specificreconfigurable supercomputing platform. The pro-

    Vforce: An ExtensibleFramework forReconfigurableSupercomputing

    r3.less.QXP 23/2/07 12:11 PM Page 39

  • 40 Computer

    High-Performance Reconfigurable Supercomputing ArchitecturesSeveral high-performance computing vendors have

    incorporated FPGAs into their parallel architectures.

    SGI RASC RC100Figure A shows the Reconfigurable Application-

    Specific Computing (RASC) RC100, a third-generationreconfigurable compute blade from SGI (www.sgi.com)that slides directly into the Altix NUMAlink 4-basedserver architecture.

    The RASC blade includes a pair of Xilinx Virtex-4LX200 FPGAs, an interface to the NUMAlink backplane,and onboard static RAM (SRAM). The direct coupling ofthe RC100 into the server architecture allows bandwidthof up to 6.4 Gbytes per second into the computersmain memory.

    The RC100 has local expandable memory up to 5SRAM dual inline memory modules (DIMMs) for a totalof 40 Mbytes. The included API supports interfacing

    between the CPUs and FPGAs using a master-slave paradigm.

    Software features include the RASCAbstraction Layer API that allows programmeraccess to the FPGAs and provides features suchas wide and deep scaling. Programmers canuse the VHSIC Hardware DescriptionLanguage, Verilog, Mitrionics Mitrion-C, orCeloxicas Handel-C to program FPGAs.

    Finally, for debugging purposes, the RC100has a RASC-aware debugger based on the GNUProject debugger with FPGA-specific extensions.

    CRAY XD1 AND XT4Figure B1 shows the XD1, Crays first super-

    computer with FPGA coprocessors (www.cray.com). It is a scalable cluster of AMD Opteronprocessors and Xilinx Virtex-II Pro FPGAs. Thesystem provides a low-latency high-bandwidthHyperTransport connection between theOpteron and the FPGA, yielding a peak 3.2-

    GBps bidirectionalbandwidth.

    In the second half of2007, Cray will beginadding FPGAs to the XTline of supercomputersstarting with the CrayXT4. As Figure B2shows, this architectureis similar to the CrayXD1 in that the FPGA isattached to the Opteronvia HyperTransporttechnology; however,two FPGAs are attachedto each Opteron socketwith local dynamic RAM(DRAM) directly associ-ated with each FPGA.This XT4 architecturewill take full advantageof the reconfigurable

    Figure B. Cray supercomputer architectures. (1) XD1 supercomputer architecture with attached FPGAs. (2)

    XT4 supercomputer architecture with attached FPGAs.
















    Figure A. SGI RASC RC100 reconfigurable compute blade.

    TIO NL4






















    (1) (2)

    r3.less.QXP 23/2/07 12:11 PM Page 40

  • March 2007 41

    processor unit modules from DRC Computer Corp.(www.drccomputer.com).

    The API provided with the XD1 uses a master-slaveprogramming paradigm: Each FPGA acts as a slave toone of the CPUs on the motherboard. Several third-party tools can be used to program XD1 FPGAs.

    SRC-7The latest supercomputer from SRC Computers Inc.

    (www.srccomp.com), the SRC-7, provides a flexibleconfiguration of reconfigurable processors, called MAPprocessors, and microprocessor boards, all intercon-nected via a Hi-Bar switch. Each I/O port of the switchand on all modules connected to the switch sustains ayielded data payload of 3.6 GBps with a port-to-port latency of 180 nanoseconds.

    The SRC-7 uses motherboards with dual IntelXeon EM64T microprocessors with a proprietarySNAP interface card that plugs into the mother-board DIMM slots on one side and the Hi-Bar switchon the other side. A simpler, direct-connect mother-board-SNAP-MAP configuration is also supported.

    Figure C shows the MAP processor, which usestwo Altera Stratix II EP2S180 FPGAs to accommo-date user-defined compute functionality. EightSRAM onboard memory banks deliver 16 64-bitwords of data per clock cycle, and two synchronousDRAM banks deliver two words per clock cycle. EachMAP also has two general-purpose I/O ports sustain-ing an additional data payload of 4.8 GBps for directMAP-to-MAP connections or data source input.

    SRCs Carte programming environment supportssource code written in Fortran or C and produces aunified executable containing the code to run onboth the CPU and the FPGAs.

    LINUX NETWORX SS1200 ACCELERATORFigure D shows the SS1200

    accelerator from Linux Networx(www.linuxnetworx.com), anew product released in January2007.

    The SS1200 interfacebetween the system and theaccelerator module is a 16-bit,600-MHz HyperTransport con-nection providing a peak band-width of 4.8 GBps. The systeminterface support includes fourindependent direct memoryaccess channels, bidirectionalbus-mastering, support for

    scatter/gather memory operations and Dword address-ing, and performance counters. To provide the maxi-mum amount of programmable logic, the acceleratorcore, an Altera Stratix II EP2S180 FPGA chip, is separatefrom the HyperTransport bridge.

    In addition to a strong system interface and acceler-ator core, the SS1200 is augmented with a localmemory system consisting of six 64-bit SRAM chan-nels and two 64-bit SDRAM channels that provide theaccelerator with an aggregate of more than 27 GBpsof local memory bandwidth to a single FPGA. Severalthird-party software development tools are expectedto support code development for the SS1200 acceler-ator FPGA.

    Figure D. Linux Networx SS1200 accelerator.

    AlteraStratix IIEP2S180




    533 MHz DDR2 SDRAM





    533 MHz DDR2 SDRAM




    Figure C. SRC-7 (MAP-H series) processor.

    ControllerStratix II

    16 banks onboard memory(64 Mbyte SRAM)

    SDRAM512 MB

    SDRAM512 MB

    User logic 155 MgatesEP2S180

    User logic 155 MgatesEP2S180

    19.2 GB/s (2.4 x 8)

    14.4 GB/s

    4.2 GB/s 4.2 GB/s

    14.4 GB/s sustained payload(7.2 GB/s per pair)

    GPIO4.8 GBps each

    r3.less.QXP 23/2/07 12:11 PM Page 41

  • 42 Computer

    thereby achieving portability. The framework can sup-port any vendors hardware by incorporating the ven-dors API in a generic hardware object.

    Vforce is lightweight compared to other approaches.Invoking the framework to set up a computation imposesa small amount of extra overhead, but there is no over-head during actual computation or data movement.Vforces library calls make it possible to overlap SPP ini-tialization and programming with software execution.Data movement is direct, from the processor hosting theVSIPL++ user program to the SPPs executing the pro-grammed functions. Vforce enables fine-grained paral-

    lelism in its use of custom hardwareconfigurations, task-level parallelismat the user level, and data parallelismat the VSIPL++ level.

    Extensibility to new hardwareTo use Vforce on a new hardware

    architecture, a designer must providea library of supported hardwareimplementations. Vforce includes alibrary lookup mechanism that binds

    hardware transparently to functions the application pro-grammer uses. In this way, the framework cleanly sep-arates SPP programming expertise from applicationprogramming.

    The application programmers code is portable andhardware-independent. Domain experts, such as FPGAdesigners, write SPP implementations for Vforces hard-ware implementation library or compile hardware librarycomponents from higher-level language specificationsusing third-party tools. This approach supports new hard-ware implementations as straightforward additions to theVforce library that can be done once and then leveragedby existing applications without recoding the application.Vforce itself does not prescribe any hardware specifics. It uses hardware implementations made available in alibrary and defaults to software if no hardware imple-mentation of a specific function is available.

    Runtime resource managementEnabling cross-platform code portability requires

    abstracting machine differences. The Vforce RTRMencapsulates hardware-specific information, manageshardware resources, and brokers the binding of an appli-cation task to particular hardware. In addition, theRTRM lets programs run whether or not SPPs are available.

    The RTRM can operate in a simple static bindingmode in which application tasks run on predeterminedhardware. However, it can also incorporate runtimebinding decisions of tasks to hardware to provide faulttolerance, optimization, or load balancing. For exam-ple, the RTRM can dynamically assign a task to partic-ular hardware based on its availability at runtime or use

    sible framework for creating portable applications forreconfigurable supercomputer architectures. Vforceoffers three main features:

    application-level transparent access to special hard-ware,

    framework-level extensibility to new hardware, and system-level runtime resource management.

    Vforce extends the C++ version of the VSIPL++1 APIto make reconfigurable hardware implementations avail-able by encapsulating them beneath a standard API so that the application itself needs no hardware-specific implementa-tion code. Because the programmerdoesnt need detailed knowledge of the hardware, development timedecreases while code portabilityincreases.

    In addition, new reconfigurablecomputing hardware plugs intoVforce via a generic hardware object.The framework provides infrastruc-ture to support hardware-specific bitstreams that are inte-grated beneath the layer of application-visible APIs anduser functions. This permits easy porting of existingVforce applications to new reconfigurable hardware.

    Since the reconfigurable supercomputer platformsVforce targets vary in their number of processors andFPGAs, the framework includes a runtime resource man-ager (RTRM) to enable dynamic binding of applicationcode to specific reconfigurable hardware, load balanc-ing, and resource sharing. Finally, Vforce offers optionalapplication-level support for concurrency so that hard-ware and software functions can run in parallel.

    Transparent access to special hardwareVSIPL++ is a library of commonly used image- and

    signal-processing algorithms designed to increase per-formance, code portability, and end-user productivity.VSIPL++ programs are portable; they do not requirerecoding to move from one platform to another.VSIPL++ implementations can be tailored to specificplatforms to maximize performance and exploit opti-mized libraries.

    The specification offers object-oriented interfaces toboth data and processing objects, which makes it easyto support hardware replacements for functions and tointerpose middleware classes for abstraction. Some func-tions available through VSIPL++, such as fast Fouriertransform (FFT) and finite-length impulse response (FIR)filters, are proven candidates for acceleration withFPGAs or other special-purpose processors. Vforcemakes SPP implementations seamlessly available to theVSIPL++ programmer. It insulates application code fromSPP implementations and encapsulates vendor APIs,

    Additions to the Vforce library can be done

    once and then leveraged by existing applications

    without recoding.

    r3.less.QXP 23/2/07 12:11 PM Page 42

  • a software-only implemen-tation if no reconfigurablehardware is available. Theimplementation of otherdetails, such as the schedul-ing algorithm, is left to eachmanager.


    In VSIPL++, the applica-tion programmer invokes aprocessing object, such asan FFT, that realizes thealgorithm in software. WithVforce, the programmerstill invokes an FFT, but inaddition to running in soft-ware, the processing objectcan run the computationon an FPGA or other SPP.

    For example, the following code snippet invokes a 16-point FFT object that is part of the Vforce framework:

    #include vforce_fft.hppusing namespace vsip;int main(int argc, char* argv[]){vsipl lib;

    Vector inData(16);Vector outData(16);Fft

    fft_obj(Domain(16), 1.0);outData = fft_obj(inData);return(0); }

    The code declares the input and output data types, bothof which are complex floating-point 16-element vectors,then declares and uses an FFT object. The only differ-ence between this application code using a hardware-

    aware FFT and the same code written in rawVSIPL++ is the inclusion of the Vforce header file(vforce_fft.hpp) instead of the VSIPL++ header file.

    In Vforce, processing objects communicate with hard-ware via a generic, hardware-independent internal pro-gramming interface that is neither visible to nor used byapplication code. This IPI, which is implemented by thegeneric hardware object, encapsulates vendor-specificfunctionalityfor example, it includes a genericput_data() method that encapsulates vendor-specificmethods of pushing data to vendor-specific hardware.A library of vendor-specific hardware bitstreams imple-ments the actual algorithms that processing objects compute.

    Figure 1 shows the object-oriented class hierarchyVforce uses to specify a common IPI between process-ing objects and hardware objects. A given ProcessingObjcontrols an SPP through the standard IPI provided bythe HardwareObj. Table 1 lists several functions

    March 2007 43

    Table 1. IPI interface for hardware objects.

    Function Purpose

    Bitstream controlvoid kernel_init(char *kid) Initializes processing element with a specific kernelvoid kernel_dest() Relinquishes ownership of the processing elementvoid kernel_run(bool blocking) Starts the kernelbool poll int() Checks whether the kernel has completedData transfervoid put_data(const View data) Sends data to the processing elementvoid get_data(View data) Retrieves data from the processing elementvoid put_const(unsigned long *data, unsigned int num) Sends data to the board; designed for a small amount of data; used to send

    bitstream configurationvoid get_const(unsigned long *data, unsigned int num) Retrieves data from the board; designed for a small amount of data

    Figure 1. Vforce UML diagram. A given processing object controls a special-purpose processorthrough the standard IPI provided by the hardware object.

    Cray XD1

    SRC -7

    IBM Cell

    Annapolis WC II

    Mercury MCJ6

    VSIPL++ data




    Fft()~Fft()void operator()()

    int scalebool direction*HardwareObj hw


    Fir()~Fir()void operator()()

    int channelsint taps*HardwareObj hw


    HardwareBase()~HardwareBase()void PutData()void GetData()void KernelInit()void KernelDest()void kernelRun()

    FPGAcompute node



    r3.less.QXP 23/2/07 12:11 PM Page 43

  • 44 Computer

    included in the IPI that are common to SPPs, omittingsome details such as error handling.

    Vforce also adds optional application-visible methodsto the processing objects to support concurrency. Thesemethods go beyond the current VSIPL++ standard, so aprogrammer can choose to take advantage of the performance increase that can be obtained when theCPU continues to work while an FPGA computation is running.

    Finally, the processing/hardware class hierarchy con-tains an exception-handling mechanism that eliminatesthe need for hardware-specific exception handling inapplication code. The processing class throws excep-tions to the application programmer in the same situa-tions VSIPL++ would. Vforce does not notify the user ifhardware-specific errors occur, but transparentlydefaults to running the given algorithm in softwarethrough the matching VSIPL++ function.

    VFORCE AT RUNTIMEFigure 2 shows Vforce operation at runtime. The

    RTRM employs an extensible processing kernel libraryof supported hardware implementations that, combinedwith the hardware objects, offers a mechanism to transparently bind standard APIs to hardware-specificimplementations.

    The generic hardware processing object contains nohardware-specific information. Upon instantiation, thehardware object asks the RTRM to provide the hard-ware specifics. The RTRM can interact with one or moreFPGAs to perform initialization activities, as indicatedby the line labeled API between the manager and theprocessing kernel. The RTRMs response contains all of

    the vendor- and model-specific information needed tocommunicate with a corresponding FPGA, including theAPIs and bitstream locations and characteristics. Thehardware object then uses the RTRMs response to inter-act with the particular target hardware.

    The RTRM exists as a separate program that runseither on a separate CPU or as a distinct process in amultitasking environment. To minimize overhead, it isinvolved only during the request of processing resourcesand initialization of the reconfigurable hardware, notduring computation or data transfer. If no reconfigurablehardware is available that can perform the requestedalgorithm, the RTRM will respond appropriately so thatthe processing object can transparently default to per-forming the computation in software.

    The RTRM can operate in a minimum overhead modeand statically bind to predetermined hardware.Alternatively, it can more actively monitor availablehardware and make binding decisions to provide faulttolerance or load balancing. In addition, the RTRM can use knowledge of the environment, profiling, orother means to make optimal decisions about hardwareallocation.

    Consider, for example, the FFT function call. At run-time, the FFT function will invoke the resource managervia the hardware processing object. The manager con-firms that the processing library contains an FFT kerneland that an FPGA capable of running the kernel is avail-able. If both these conditions are met, the manager loadsthe FFT bitstream to the FPGA hardware and binds thehardware processing object with the hardware-specificlibrary functions. The FFT processing object is notifiedand communicates with the hardware by calling the hard-ware objects put_data and get_data methods to transmitdata directly to and from the FPGA. The entire interac-tion between processing object, hardware object, andRTRM is transparent to the application programmer.

    ADDING ARCHITECTURES TO VFORCEAdding a new hardware architecture to Vforce is a

    three-step process of

    creating a hardware object, populating the library with hardware designs, and providing a manager.

    The hardware object encapsulates the vendor-specifichardware APIs behind the Vforce standard IPI. Theframework does not specify any hardware-leveldetailsas long as the standard IPI is implemented inthe hardware object, a vendor can work to the hard-wares strengths, and the application programmer neednot worry about it.

    Adding a function to the library requires both an algo-rithm object and a hardware bitstream. The algorithmobject provides two implementations of the function.

    Figure 2. Vforce runtime operation.The runtime resource man-ager transparently binds standard APIs to hardware-specificimplementations.


    Runtime resource manager

    Processing kernel library

    VSIPL++ data

    VSIPL++ user program

    Processing object FPGA


    Hardware object



    r3.less.QXP 23/2/07 12:11 PM Page 44

  • The first uses the Vforce standard IPI to interact withthe hardware bitstream, and the second is a softwareimplementation that the system calls when there is ahardware error or no SPP is available. If the softwareprocessing object for a function already exists, there isno need to reimplement it. A hardware specialist cancreate a new hardware bitstream using core libraries orthird-party tools.

    We have implemented a defaultRTRM based only on Posix calls andthe C language that should run inmost Unix-like environments. Thismanager provides the minimum sup-port that Vforce requires, and inmany cases it will offer the necessaryfunctionality for a given machine.For machines with a different operating system, or ifextra functionality is desired, the programmer must codea new RTRM. A system programmer can write a run-time manager to provide a wide range of services, fromsimple static scheduling to advanced hardware alloca-tion techniques.

    Importantly, many applications can use a bitstreamonce it has been added to the Vforce processing kernellibrary. A hardware specialist designs the bitstream once,and many applications can use it. In this way, the RTRMand associated library cleanly separate hardware specificsfrom application code and from the application coder.

    VFORCE BENEFITS AND SHORTCOMINGS Vforce contains several features that together enhance

    portability across different platforms. First, it is based onthe VSIPL++ standard and supports hardware imple-mentation options for VSIPL++ functions. Second, theframework cleanly separates hardware specifics, whichare fully encapsulated within the processing kernellibrary, from the application-level code. Third, hardwaresupport in the Vforce libraries encapsulates the hard-ware vendors APIs, which makes it easier to add newhardware. Finally, a system programmer can port ormodify the RTRM itself to support different special-pur-pose hardware platforms without requiring application-level code changes.

    Vforce has limitations as well. The framework does notprovide hardware bitstreams; a hardware design expertmust provide these, possibly by using an automated com-piler. Examples of such compilers include Celoxicas DKDesign Suite (www.celoxica.com/products/dk), DSPlogicsReconfigurable Computing Toolbox (www.dsplogic.com/home/products/retb), Impulse Accelerated TechnologiesCoDeveloper (www.impulsec.com/fpga_c_products.htm),and Mitrionics Mitrion Platform (www.mitrion.com/default.asp?pId=2).

    Another drawback is that layering the applicationfunction, processing object, and generic hardware objectthat interacts with the manager adds overhead.

    However, Vforce also derives benefits from this com-mon technique. The Linux kernel, for example, uses asimilar approach to provide hardware abstraction.Vforce adds overhead only in the setup and assignmentof reconfigurable hardware resources. Once the appli-cation is running, Vforce adds no overhead; all com-munication is direct between calling code and

    reconfigurable hardware.Because Vforce is based on the

    VSIPL++ standard, view or blockdata is opaque, so the datas internalstructure is unknown. Consequently,Vforce copies data from VSIPL++objects into buffers before transmit-ting them to the reconfigurable hard-ware, and vice versa. This short-

    coming could be addressed by exploiting the VSIPL++feature that lets the programmer pass a pointer to previ-ously allocated data storage.

    Vforce attempts to minimize the need for additionalcopying within hardware objects. For example, part ofthe generic hardware object interface includes a call usedto request a buffer specifically for direct-memory-accesstransactions. This can be applied in cases where buffersused for DMA transactions have special initializationprocedures, as with the Cray XD1. Requesting the cor-rect buffer initially avoids a second copy into a DMA-able memory buffer.


    We have implemented Vforce on a wide range of hard-ware architectures including the Cray XD1, MercuryComputer Systems 6U VME (www.mc.com/products/view/index.cfm?id=10&type=systems), and AnnapolisMicro Systems Wildcard II (www.annapmicro.com/wc2.html). The XD1 uses fixed configurations of gen-eral-purpose processors and FPGAs, with the GPPs con-trolling the FPGAs in a master-slave paradigm. Incontrast, the 6U VME offers flexible node configurationsand permits FPGA nodes to function independently, in apeer-to-peer model, without direct control. The WildcardII is a Personal Computer Memory Card InternationalAssociation (PCMCIA) CardBus card with one FPGAand a small amount of memory local to the card and a33-MHz peripheral component interconnect interface.

    Although each of these architectures represents dif-ferent choices in the design space, they all have similarAPIs that let the user program the reconfigurable hard-ware, transfer data, and free the hardware for use byanother application kernel. Our approach encapsulateseach of the vendor APIs in a common hardware object,thus supporting portability across platforms.

    We implemented an FFT on an FPGA for the Vforce ker-nel library for all three systems. The manager looks up thecorrect FFT bitstream in its library, and the application

    March 2007 45

    Vforce adds overhead only in the setup and

    assignment of reconfigurablehardware resources.

    r3.less.QXP 23/2/07 12:11 PM Page 45

  • 46 Computer

    uses it to perform the FFT computation (fft_obj in the pre-vious code snippet). For all of the FFT bitstreams, we usedan FFT core from the Xilinx IP core library.

    The manager for the 6U VME is compatible with theproprietary Mercury Multicomputer OperatingEnvironment. It starts VSIPL++ executables on avail-able processors, then waits to handle requests from therunning programs. When the manager receives a ker-nel_init request, it searches its internal database for anavailable FPGA compute node (FCN) and a processingkernel that matches the requested kernel_id (the pro-cessing algorithmin this case, FFT). If it finds both,the manager configures the compute node, then returnsa handle to that node so the requesting program can useit directly without processing overhead. If the managerdoes not allocate an FCN due to unavailability of eitherthe processing kernel or the hardware itself, the man-ager indicates there is no SPP available so the programcan execute the desired function in software. In addi-tion, the manager attempts to minimize configurationoverhead by keeping track of what configurations areresident on each FPGA. If the manager can execute aresource allocation without having to reconfigure thedevice, it will do so.

    The manager written for the Linux-based XD1 is a ser-vice that waits for incoming requests instead of spawn-ing the users VSIPL++ programs. It relies on a text filethat lists the installed FPGA hardware and what hard-ware object is needed to control each FPGA. The XD1manager was written in Posix-compliant C and containsno platform-specific code. This manager should be ableto be recompiled on other machines that provide a Unix-like environment with minimal modification.

    For the Wildcard II system, which tightly couples theFPGA to a single processor, we chose not to implementa separate manager. The user application performsboth the manager functions and runs the user VSIPL++code. The implementation supports concurrent pro-cessing on the FPGA and host processor, exceptionhandling, and defaulting to software if the FPGA hard-ware is unavailable.

    BEAMFORMING: A VFORCE APPLICATIONWe used Vforce to implement a 3D adaptive time-

    domain beamformer, an algorithm that focuses a sen-sor array to reduce the impact of noise and interference.2

    As Figure 3 shows, the application computes weightsfor incoming sensor data based on previous data andthen applies those weights. The Vforce beamformer usessoftware for weight computation, and it can use eithersoftware or SPP hardware for weight application. Theweight computation and application tasks can run con-currently. Periodically, after calculating a new set ofweights, the application downloads new weights andretrieves data to compute the next weights.

    An important consideration when splitting function-ality between hardware and software is to balance theload between the different partitions to minimize com-munication, which is often the bottleneck in applica-tions using SPPs. Another consideration is that theappropriate granularity for hardware/software parti-tioning is frequently not at the level of a VSIPL++ stan-dard function. Weight application, which consists ofseveral multiply and accumulate (MAC) steps, is coarsergrained than many functions specified in the VSIPL++standard. Our beamforming application demonstratesthat Vforce can support hardware/software partition-ing at a coarser level than a single VSIPL++ function.

    We implemented hardware functions for weight appli-cation for the Cray XD1 and the Mercury 6U VME sothe same Vforce application could use hardware accel-eration on either platform. The results presented hereare for experiments using a Mercury 6U VME chassiswith two daughtercardsone with two PowerPCs andone with two FPGA compute nodes. The Vforce man-ager runs on one PowerPC and the beamforming appli-cation on the other. When SPP hardware is used, weightapplication migrates from software to FPGA hardwareon an FCN.

    The beamforming applications parameters are thenumber of beams, number of sensors, number of equa-tions used to compute weights, and number of samplesprocessed between weight updates. We conducted twosets of experiments. The first set compared performanceusing weight application implemented in software andin hardware as the number of beams varies; the secondset compared performance when only the samplingperiod varies.3

    To compare performance as the number of beamsvaries, we chose five different settings for sensors, equa-tions, and the sampling period as Table 2 shows. Allexperiments processed one million samples per sensor.For each of the five configurations, we measured per-formance for 1, 10, 100, and 1,000 beams, for a total of20 experiments.

    The four results using configuration A in Table 2appear in Table 3, which shows that software weightapplication (MAC steps) scales linearly with the num-

    Figure 3. Vforce beamformer.The application computesweights for incoming sensor data based on previous data andthen applies those weights.

    Weights Weightapplication


    Sensor data Parameter data


    r3.less.QXP 23/2/07 12:11 PM Page 46

  • ber of beams. Weight application is sig-nificantly faster in hardware. As withmost applications using SPPs, overallperformance improves less than thehardware speedup because of overhead.Most of the overhead here is in hard-ware setup time. Overhead also occurs from communicating data to and fromSPPs. In addition, communication andSPP processing do not overlap in our 6UVME Vforce implementation. Over-lapping does occur in our Cray XD1Vforce implementation.

    Table 4 shows the results for 1,000beams for a varying number of sensors,equations, and sampling periods asdefined by configurations B through E.The table indicates that hardware MACspeedup improves as the number of sen-sors increases. Hardware runtime is rela-tively constant compared with theincreasing runtime for software weightapplication. The total application speedupis less dramatic because sampling alsoincreased with the number of sensors.

    Table 5 shows the results from a second set of fiveexperiments that varied only the frequency of weightcomputation for 10,000 beams, 64 sensors, and 64equations. We extrapolated the software weight appli-cation times because of the extremely long (95+ hours)runtimes. The software MAC runtimes are identical inall five experiments because the number of beams andsensors are static. The hardware speedup increases as

    the sampling rate decreases; fewer weight computationsmean fewer costly interruptions to transfer results andcompute new weights in software. Total applicationruntime for a 256K-cycle sampling period droppedfrom about 95 hours (extrapolated) to about 23 min-utes (measured) when the same Vforce application usedweight application implemented in hardware ratherthan software.

    March 2007 47

    Table 2. Parameter configurations for variable-beam experiments on Mercury6U VME system.

    Sampling period Configuration Sensors Equations (cycles)

    A 4 1,024 256 K B 8 512 128 K C 16 256 64 K D 32 128 32 K E 64 64 16 K

    Table 3. Results on Mercury 6U VME system for configuration A.

    Software Hardware Hardware TotalMAC MAC MAC hardware

    Beams (seconds) (seconds) speedup speedup

    1 2.24 1.93 1.16 1.16 10 22.41 6.96 3.22 3.06

    100 224.09 57.28 3.91 3.65 1,000 2,240.90 560.46 4.00 3.73

    Table 5. Results for 10,000 beams, 64 sensors, and 64 equations with varying sampling periods.

    Sampling period Software MAC Hardware MAC Hardware MAC Total hardware (cycles) (seconds) (seconds) speedup speedup

    16 K 342,155.00 5,690.82 60.12 17.89 32 K 342,155.00 2,969.53 115.22 34.81 64 K 342,155.00 1,547.62 221.08 67.91

    128 K 342,155.00 850.85 402.13 131.31 256 K 342,155.00 498.79 685.97 247.99

    Table 4. Results for 1,000 beams with varying sensors, equations, and sampling periods.

    Sampling period Software MAC Hardware MAC Hardware MAC Total hardware Sensors Equations (cycles) (seconds) (seconds) speedup speedup

    8 512 128 K 4,372.68 562.54 7.77 6.62 16 256 64 K 8,636.19 566.64 15.24 10.93 32 128 32 K 17,162.90 575.61 29.82 15.59 64 64 16 K 34,214.80 594.45 57.56 17.73

    r3.less.QXP 23/2/07 12:11 PM Page 47

  • 48 Computer

    OTHER PROJECTSThe increasing availability of supercomputing archi-

    tectures with reconfigurable hardware has sparked inter-est in runtime support for these architectures. Severalother research projects share our goals of maintainingapplication code portability and making the task ofusing reconfigurable hardware easier for programmers.All these projects are aimed at developing an executionmodel and not on automatically compiling to the recon-figurable hardware, which is a related but distinct areaof study.

    Several researchers treat the hardware as a separateexecution thread that runs concurrently with software.Scientists at the University of Kansas have developedhthreads for specifying application threads runningwithin a hybrid CPU/FPGA system.4 Their system sup-ports a master-slave model with one CPU tied to anFPGA. The support for hardware threads requires partof the system to run in hardware on the FPGA andimposes a fair amount of overhead.

    A similar project5 uses threadsboth in master-slave mode and in amore general network with FPGAsacting as peer processing elementsand is based on an abstraction layerthat uses a virtual memory model. Avirtual memory handler must run inFPGA hardware to resolve accessesthat are not in local memory. In themore general network, the hardwaremust include an agent that handlescommunication over the network and resolves memoryaccesses. This approach also requires considerable hard-ware overhead.

    University of Florida researchers have developed aframework to provide runtime services for systems thatinclude heterogeneous hardware. This framework con-sists of two parts, the Universal Standard for UnifiedReconfigurable Platforms6 and the ComprehensiveApproach to Reconfigurable Management Archi-tecture,7 and is designed to support general distributedsystems wherein individual processors can have anattached reconfigurable hardware accelerator. USURPis built on top of the message passing interface and isdistributed, with a small manager running on everynode. These researchers propose a standard interface forhardware designers to use at design time to support run-time portability and services including performancemonitoring and debugging. Their API is of a lower levelthan ours, and requires that the user specify bitstreamdownloading, data transfer, and so on. In our model,these operations are hidden inside functions and notexposed to the programmer.

    Vforce differs from these other projects in severalimportant ways. First, application code does not changeat all from an all-software implementation to a soft-

    ware/hardware implementation. Second, Vforce doesnot require any support on the reconfigurable hardwareitself. This makes our approach more flexible as it canuse any vendors API. The vendor can specify all thedetails of how the hardware is programmed. We do notchange the way hardware is implemented, only the waysoftware invokes it. Finally, Vforce is lighter weight thanother frameworks, introducing minimal overhead.

    V force is not specific to FPGAs and can be used tosupport many different types of SPPs, includinggraphics processing units, digital signal processors,IBMs Cell processor, and others, as it separates SPP pro-gramming from coding applications that benefit fromhardware acceleration. Vforce also ensures that the appli-cation code will run on systems where SPPs are not avail-able, easing application development.

    Future plans include providing support for additionalreconfigurable architectures beyondthe XD1, 6U VME, and Wildcard IIincluding the RASC RC100 andSRC-7. Since the RC100 runs Linux,it should be able to use the genericmanager implemented for the XD1.

    In addition, the RTRM we havethus far implemented is straightfor-ward, scheduling at runtime on afirst-come, first-served basis. We areexploring resource manager imple-mentations with static knowledge of

    the hardware platform, or that dynamically determinethe resources available and how best to allocate themat runtime. This would support more features, such asload balancing and monitoring the state of existingprocessors to avoid faulty hardware components.

    AcknowledgmentsThis research was supported in part by a subcontract

    from ITT Industries under a grant from the US AirForce and by donations from Mercury ComputerSystems and Xilinx. This research is part of the HighPerformance Embedded Computing Software Initiative(HPEC-SI). The Cray XD1 we are using is at the OhioSupercomputing Center. We thank Benjamin Cordesand Kris Kieltyka for their contributions to the Vforceproject.

    References1. High Performance Embedded Computing Software Initiative

    (HPEC-SI); www.hpec-si.org.2. B.D. Van Veen and K.M. Buckley, Beamforming: a Versatile

    Approach to Spatial Filtering, IEEE ASSP Magazine, Apr.1988, pp. 4-24.

    Vforce is not specific to FPGAs and can be used to support many different

    types of SPPs including GPUs, DSPs, and

    IBMs Cell processor.

    r3.less.QXP 23/2/07 12:11 PM Page 48

  • 3. A. Conti, A Hardware/Software System for Adaptive Beam-forming, masters thesis, Dept. of Electrical and ComputerEng., Northeastern Univ., 2006.

    4. D. Andrews, D. Niehaus, and P. Ashenden, ProgrammingModels for Hybrid CPU/FPGA Chips, Computer, Jan. 2004,pp. 118-120.

    5. M. Vuletic, L. Pozzi, and P. Ienne, Seamless Hardware-Soft-ware Integration in Reconfigurable Computing Systems, IEEEDesign & Test of Computers, Mar./Apr. 2005, pp. 102-113.

    6. B.M. Holland et al., Compile- and Run-Time Services forDistributed Heterogeneous Reconfigurable Computing,Proc. Intl Conf. Eng. Reconfigurable Systems & Algorithms,CSREA Press, 2006, pp. 33-41.

    7. R.A. DeVille, I.A. Troxel, and A.D. George, PerformanceMonitoring for Runtime Management of ReconfigurableDevices, Proc. Intl Conf. Eng. of Reconfigurable Systemsand Algorithms, CSREA Press, 2005, pp. 175-181.

    Nicholas Moore is an MS student in the Department of Elec-trical and Computer Engineering at Northeastern Univer-sity, Boston. His research interests include hybrid computerarchitectures and hardware/software codesign. Moorereceived a BS in electrical engineering from the Universityof Rochester. Contact him at nmoore@ece.neu.edu.

    Albert Conti is currently an engineer at the MITRE Corpo-ration and worked on the Vforce project while at North-eastern University, where he received an MS in electricalengineering. His research there focused on the use of FPGAsfor accelerating signal- and image-processing applications.Contact him at aconti@ece.neu.edu.

    Miriam Leeser is a professor in the Department of Electri-cal and Computer Engineering and head of the Reconfig-urable Computing Laboratory at Northeastern University.Her research interests include reconfigurable computing andcomputer arithmetic. Leeser received a PhD in computerscience from Cambridge University. She is a senior memberof the IEEE and the Society of Women Engineers and amember of the ACM. Contact her at mel@coe.neu.edu.

    Laurie Smith King is an associate professor of computerscience at the College of the Holy Cross, Worcester, Mass.Her research interests include hardware/software codesignand programming languages. King received a PhD in com-puter science from the College of William and Mary. She isa member of the IEEE and the ACM. Contact her atlking@holycross.edu.

    March 2007 49

    ISO 9001 provides a tried and tested framework for taking a systematic approach to software engineering practices. Readers are provided with examples of over 55 commonwork products. This in-depth reference expedites the design and development of the documentation required in support of ISO 9001 quality activities. Also available:

    Practical Support for CMMI - SW Software Project Documentation: Using IEEE Software Engineering Standards

    Jumpstart CMM/CMMI Software Process Improvements: Using IEEE Software Engineering Standards

    Practical Support for ISO 9001Software Project Documentation: UsingIEEE Software Engineering Standards

    978-0-471-76867-8 October 2006418 pages Paperback $89.95A Wiley-IEEE Computer Society Press

    To Order:1-877-762-2974 North America+ 44 (0) 1243 779 777 Rest of World

    15 %







    r3.less.QXP 23/2/07 12:12 PM Page 49

    /ColorImageDict > /JPEG2000ColorACSImageDict > /JPEG2000ColorImageDict > /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 150 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages true /GrayImageDownsampleType /Bicubic /GrayImageResolution 300 /GrayImageDepth -1 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 2.00333 /EncodeGrayImages true /GrayImageFilter /DCTEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict > /GrayImageDict > /JPEG2000GrayACSImageDict > /JPEG2000GrayImageDict > /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages true /MonoImageDownsampleType /Bicubic /MonoImageResolution 600 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.00167 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict > /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile () /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName /PDFXTrapped /False

    /Description >>> setdistillerparams> setpagedevice


View more >