13
J. Parallel Distrib. Comput. 72 (2012) 1144–1156 Contents lists available at SciVerse ScienceDirect J. Parallel Distrib. Comput. journal homepage: www.elsevier.com/locate/jpdc VForce: An environment for portable applications on high performance systems with accelerators Nicholas Moore a,, Miriam Leeser a , Laurie Smith King b a Department of Electrical and Computer Engineering, Northeastern University, 02115 Boston, MA, United States b Department of Mathematics and Computer Science, College of the Holy Cross, Worcester, MA, United States article info Article history: Available online 12 August 2011 Keywords: GPU Heterogeneous systems FPGA Middleware Portability abstract Special Purpose Processors (SPPs), including Field Programmable Gate Arrays (FPGAs) and Graphics Processing Units (GPUs), are increasingly being used to accelerate scientific applications. VForce aims to aid application programmers in using such accelerators with minimal changes in user code. VForce is an extensible middleware framework that enables VSIPL++ (the Vector Signal Image Processing Library extension) programs to transparently use Special Purpose Processors (SPPs) while maintaining portability across platforms with and without SPP hardware. The framework is designed to maintain a VSIPL++- like environment and hide hardware-specific details from the application programmer while preserving performance and productivity. VForce focuses on the interface between application code and accelerator code. The same application code can run in software on a general purpose processor or take advantage of SPPs if they are available. VForce is unique in that it supports calls to both FPGAs and GPUs while requiring no changes in user code. Results on systems with NVIDIA Tesla GPUs and Xilinx FPGAs are presented. This paper describes VForce, illustrates its support for portability, and discusses lessons learned for providing support for different hardware configurations at run time. Key considerations involve global knowledge about the relationship between processing steps for defining application mapping, memory allocation, and task parallelism. © 2011 Elsevier Inc. All rights reserved. 1. Introduction Many scientists are using accelerators in a range of systems from stand-alone computers to high performance supercomputers to improve the performance of their applications. These accelera- tors include Field Programmable Gate Arrays (FPGAs) and Graph- ics Processing Units (GPUs). This is a rapidly changing field, with new accelerator architectures being introduced and previous gen- erations discontinued very quickly. Programmers would prefer to focus on their applications rather than have to constantly keep up with changes in the accelerator field. VForce is a middleware framework that allows user application code to run unchanged on different hardware architectures that support a mix of General Pur- pose Processors (GPPs) and Special Purpose Processors (SPPs). The VForce framework is flexible and lightweight; user applications written now can run on future hardware that has not yet been in- troduced. Supported SPPs include GPUs and FGPAs, and may soon include manycore architectures such as Knight’s Corner [22]. In ad- dition, the VForce model can easily be applied to chips where the Corresponding author. E-mail addresses: [email protected] (N. Moore), [email protected] (M. Leeser), [email protected] (L. Smith King). SPP and GPP are integrated on a single die such as in the Sandy- bridge [23] and Fusion [1] architectures. The VForce system focuses on easing the use of SPPs for writers of VSIPL++ programs. VSIPL++ (the Vector Signal Image Processing Library extension) is a library of commonly used processing algorithms [20]. VSIPL++ focuses on high performance, code portability and end-user productivity. VSIPL++ programs are portable; they do not require recoding to move from one platform to another. Implementations of VSIPL++ can be tailored to specific platforms and make use of optimized libraries to maximize performance. VSIPL++ offers object-oriented interfaces to both data and processing objects, which makes it straightforward to support hardware replacements for functions and to interpose middleware classes for abstraction. The VForce framework extends VSIPL++ portability to encompass systems that contain SPPs. FPGAs and GPUs are particularly well suited to accelerating programs written in VSIPL++. The massively parallel nature of GPUs and FPGAs and the exposed memory hierarchy of FPGAs allow these devices to perform certain classes of operations, including many signal and image processing routines, many times faster than traditional processors. Some functions available through VSIPL++, such as FFT and FIR filters, are proven candidates for acceleration with FPGAs, GPUs, and other special purpose hardware. While SPPs offer performance potential, they are generally more difficult 0743-7315/$ – see front matter © 2011 Elsevier Inc. All rights reserved. doi:10.1016/j.jpdc.2011.07.014

VForce: An environment for portable applications on high performance systems with accelerators

Embed Size (px)

Citation preview

Page 1: VForce: An environment for portable applications on high performance systems with accelerators

J. Parallel Distrib. Comput. 72 (2012) 1144–1156

Contents lists available at SciVerse ScienceDirect

J. Parallel Distrib. Comput.

journal homepage: www.elsevier.com/locate/jpdc

VForce: An environment for portable applications on high performance systemswith acceleratorsNicholas Moore a,∗, Miriam Leeser a, Laurie Smith King b

a Department of Electrical and Computer Engineering, Northeastern University, 02115 Boston, MA, United Statesb Department of Mathematics and Computer Science, College of the Holy Cross, Worcester, MA, United States

a r t i c l e i n f o

Article history:Available online 12 August 2011

Keywords:GPUHeterogeneous systemsFPGAMiddlewarePortability

a b s t r a c t

Special Purpose Processors (SPPs), including Field Programmable Gate Arrays (FPGAs) and GraphicsProcessing Units (GPUs), are increasingly being used to accelerate scientific applications. VForce aimsto aid application programmers in using such accelerators with minimal changes in user code. VForce isan extensible middleware framework that enables VSIPL++ (the Vector Signal Image Processing Libraryextension) programs to transparently use Special Purpose Processors (SPPs) whilemaintaining portabilityacross platforms with and without SPP hardware. The framework is designed to maintain a VSIPL++-like environment and hide hardware-specific details from the application programmer while preservingperformance and productivity. VForce focuses on the interface between application code and acceleratorcode. The same application code can run in software on a general purpose processor or take advantage ofSPPs if they are available. VForce is unique in that it supports calls to both FPGAs and GPUswhile requiringno changes in user code. Results on systems with NVIDIA Tesla GPUs and Xilinx FPGAs are presented. Thispaper describes VForce, illustrates its support for portability, and discusses lessons learned for providingsupport for different hardware configurations at run time. Key considerations involve global knowledgeabout the relationship between processing steps for defining application mapping, memory allocation,and task parallelism.

© 2011 Elsevier Inc. All rights reserved.

1. Introduction

Many scientists are using accelerators in a range of systemsfrom stand-alone computers to high performance supercomputersto improve the performance of their applications. These accelera-tors include Field Programmable Gate Arrays (FPGAs) and Graph-ics Processing Units (GPUs). This is a rapidly changing field, withnew accelerator architectures being introduced and previous gen-erations discontinued very quickly. Programmers would prefer tofocus on their applications rather than have to constantly keepup with changes in the accelerator field. VForce is a middlewareframework that allows user application code to run unchanged ondifferent hardware architectures that support amix of General Pur-pose Processors (GPPs) and Special Purpose Processors (SPPs). TheVForce framework is flexible and lightweight; user applicationswritten now can run on future hardware that has not yet been in-troduced. Supported SPPs include GPUs and FGPAs, and may soonincludemanycore architectures such as Knight’s Corner [22]. In ad-dition, the VForce model can easily be applied to chips where the

∗ Corresponding author.E-mail addresses: [email protected] (N. Moore), [email protected]

(M. Leeser), [email protected] (L. Smith King).

0743-7315/$ – see front matter© 2011 Elsevier Inc. All rights reserved.doi:10.1016/j.jpdc.2011.07.014

SPP and GPP are integrated on a single die such as in the Sandy-bridge [23] and Fusion [1] architectures.

The VForce system focuses on easing the use of SPPs forwriters of VSIPL++ programs. VSIPL++ (the Vector Signal ImageProcessing Library extension) is a library of commonly usedprocessing algorithms [20]. VSIPL++ focuses on high performance,code portability and end-user productivity. VSIPL++ programsare portable; they do not require recoding to move from oneplatform to another. Implementations of VSIPL++ can be tailored tospecific platforms andmake use of optimized libraries tomaximizeperformance. VSIPL++ offers object-oriented interfaces to bothdata and processing objects, which makes it straightforward tosupport hardware replacements for functions and to interposemiddleware classes for abstraction. The VForce framework extendsVSIPL++ portability to encompass systems that contain SPPs.

FPGAs and GPUs are particularly well suited to acceleratingprogramswritten inVSIPL++. Themassively parallel nature of GPUsand FPGAs and the exposed memory hierarchy of FPGAs allowthese devices to perform certain classes of operations, includingmany signal and image processing routines,many times faster thantraditional processors. Some functions available through VSIPL++,such as FFT and FIR filters, are proven candidates for accelerationwith FPGAs, GPUs, and other special purpose hardware. WhileSPPs offer performance potential, they are generally more difficult

Page 2: VForce: An environment for portable applications on high performance systems with accelerators

N. Moore et al. / J. Parallel Distrib. Comput. 72 (2012) 1144–1156 1145

to use than GPPs. Programming SPPs may require specializedhardware knowledge and different, often vendor-specific, toolchains and development paradigms. Software written to takeadvantage of SPPs is often full of device-specific code, hamperingportability. These characteristics, combined with the rapid rateof change and short lifetime of SPP hardware, make it desirableto have approaches that allow effective SPP hardware utilizationwhile maintaining program portability and separating applicationdevelopment from SPP hardware concerns.

VForce is an extensible middleware framework that enablesVSIPL++ programs to transparently utilize various SPPs whilemaintaining portability across platforms with and without SPPhardware. The framework is designed to maintain a VSIPL++-like environment and hide hardware-specific details from theapplication programmer while maintaining existing VSIPL++performance and productivity. VForce can be extended to includenew algorithmic functionality aswell as new SPP platforms. VForcealso introduces optional concurrent execution on GPP and SPPhardware — a technique that can improve the performance of SPPuse. Previous publications [8,30–32] have introduced VForce andpresented results on using VForce with systems that incorporateFPGAs. In this paper, we extend this work to NVIDIA CUDA GPUs.In order to flexibly and robustly handle both FPGAs and GPUs,some of the internals of the VForce implementation have changed,including communicating more information between the userapplication and the Run Time Resource Manager. In addition, wepresent results that illustrate the benefits of exposing concurrentoperation using a beamforming application. We discuss lessonslearned that are applicable to frameworks that support differenthardware configurations at run time. Key considerations involveglobal knowledge about the relationship between processing stepsfor defining application mapping, memory allocation, as well astask parallelism.

The rest of this paper is organized as follows. In Section 2, wepresent background on heterogeneous architectures and on theVSIPL++ standard. Next, in Section 3, we present the VForce frame-work, followed by a discussion of ways VForce can be extendedto support new platforms and new applications in Section 4. InSection 5, we give details of the platforms currently supported, fol-lowed by results of experiments on those platforms in Section 6.Finally, in Sections 7–9, we discuss related work, future directions,and draw conclusions. Appendix A contains a glossary of acronymsused in this paper. Appendix B lists the interfaces used by variouscomponents of VForce.

2. Background

In this section, we discuss heterogeneous architectures andprovide background on the VSIPL++ standard. Heterogeneousarchitectures that incorporate GPUs and FPGAs have becomepopular in a range of platforms from laptops and desktops tohigh performance supercomputers. Examples of GPU acceleratedcomputation are prevalent at scientific conferences. For example,four of the top five supercomputers in the world, announced atSupercomputing 2010, make use of NVIDIA GPUs. The world’sfastest, the Tianhe-1A in China, scores 2.507 PetaFlops (PF) inLINPACK thanks to having 7168 GPUs. The third fastest, Nebulae,also in China, scores 1.27 PF in LINPACK thanks to having 4640GPUs. The fourth fastest, Tsubame 2.0, in Japan, is not far behind,with 1192 PF with 4200 GPUs. Tianhe-1A and Tsubame 2.0 arealso on the top ten greenest supercomputer list due to the highefficiency of scientific computation on GPUs [41]. GPUs not onlydominate supercomputing but are prevalent in desktops andlaptops as well, and they are used by scientists and engineers toaccelerate many applications.

FPGAs are also used to accelerate applications in supercomput-ers andworkstations. Examples include the Novo-G cluster [43,10]that includes 96 Altera FPGAs. The Convey HC1 can be used as asingle board accelerator or combined into a cluster [9]. Convey hasdemonstrated acceleration of algorithms in the life sciences, suchas Smith Waterman. Nallatech demonstrated deep packet inspec-tion on FPGAs at Supercomputing 2010 [33].

Many systems, both academic and commercial, combine FPGAsand GPUs in a single chassis. An example of a research systemis the Axel system [42], built at Imperial College, London, andused for n-body simulation. Several commercial FPGA systems,including Convey computers [9], have been configured with bothFPGAs and GPUs to enable users to make use of different types ofheterogeneous acceleration.

It is interesting to note that all of the systems mentioned abovewere built within the last couple of years, and most of thesesystemswill be replaced by newmodels in a year or two. Scientistsand engineers writing applications would like the benefits of usingthe latest accelerator hardwarewithout the effort necessary to porttheir application to each new platform. VSIPL++ was developed tosupport portability, productivity, and performance for CPU-basedclusters of computers. The VForce framework extends VSIPL++ tosystems with FPGAs and GPUs.

VSIPL++ [20], a parallel C++ library specification designed to ad-dress signal and image processing applications, is being developedby the High Performance Embedded Computing Software Initia-tive (HPEC-SI) [18]. The initiative extends the Vector Signal Im-age Processing Library (VSIPL) [19]. VSIPL++ makes heavy use ofC++ templates and object-oriented techniques and is aimed at avariety of applications that make use of signal processing, includ-ing radar, sonar, imaging, and other scientific applications. VSIPL++allows the user to write high level programs while still achiev-ing performance and preserving the portability of the code acrossparallel computing hardware platforms. The Parallel Vector, Sig-nal, and Image Processing Library (Parallel VSIPL++) [26] extendsVSIPL++ by providing high level C++ array constructs, a sim-ple mechanism for mapping data and functions onto parallelhardware, and a community-defined portable interface. ParallelVSIPL++ supports adaptive optimization at many levels. The C++arrays are designed to support automatic hardware specializationby the compiler and library implementation. The computation ob-jects (e.g., fast Fourier transforms) are built with explicit setup andrun stages to allow for run time optimization. However, ParallelVSIPL++ does not support heterogeneous processing such as FPGAsand GPUs. Extending VSIPL++ to these Special Purpose Processors(SPPs) is the topic of this research.

3. VForce

The VSIPL++ FOr Reconfigurable Computing Environments(VForce) framework, based on the object-oriented VSIPL++ stan-dard, presents a standard object-oriented API to applicationwriters that is similar to VSIPL++ and hides implementation detailsspecific to special purpose processors (SPPs) from the user.

VForce achieves three major design objectives:

• Encapsulating access to special purpose hardware to insulatethe application from SPP-specific coding.

• Enabling application portability across SPP platforms.• Maintaining high performance.

With VForce, portability encompasses both application porta-bility and framework portability; the same application can run un-changed on different platforms and the VForce framework can beextended to utilize new SPPs. This paper presents extensions toVForce [31] to use NVIDIA GPUs as well as Xilinx FPGAs.

Page 3: VForce: An environment for portable applications on high performance systems with accelerators

1146 N. Moore et al. / J. Parallel Distrib. Comput. 72 (2012) 1144–1156

(a) SPP initialization. (b) Kernel execution.

Fig. 1. VForce at SPP initialization and run time. Note that the RTRM is not involved in data transfer or control between the client application and the SPP after initialization.

3.1. VForce architecture

VForce consists of a number of modular components thatinteract to provide all of the services offered by the framework.These include components compiled into the user application aswell as runtime components. The compiled components includeVForce processing classes and the generic processing element(GPE). These compiled components do not change from one SPPtype to another—auser applicationdoes not need to be recompiledto target a new type of SPP. The runtime components include aresourcemanager (RTRM), SPP control libraries, and SPP algorithmkernel libraries. After a high level operational overview, each ofthese components is discussed in detail.

One of the important characteristics of the VForce frameworkis that user applications are compiled without any SPP-specificinformation in the source or binary. Additionally, all decisionsabout which, if any, SPP to use are made at run time. VForceintroduces new VSIPL++-like processing classes that contain bothsoftware-only and SPP-accelerated versions of the same algorithm.When a user application instantiates a VForce processing class, theobject creates an instance of the GPE, which represents an abstractSPP device and constructs an Algorithm Information Container(AIC) that stores the characteristics of the requested algorithm. Theprocessing object passes the AIC to the GPE’s initializationmethod,as shown in Fig. 1(a). The GPE sends the AIC data to a standalonesystem service through IPC to request an SPP device on which toexecute the given problem instance.

The run time resource manager (RTRM), upon receiving arequest for an SPP, queries each SPP device through SPP-specificcontrol libraries to determine the availability of SPP device/kernelpairs and, when appropriate, programs the device. If the RTRMis able to find SPP hardware to run the specified algorithm, itindicates which SPP and corresponding SPP control library the GPEshould use. The GPE can now dynamically load the SPP controllibrary and bind to the needed SPP-specific functionality. Afterthis step, the RTRM is no longer involved in the user applicationexecution until the user application relinquishes control of thehardware. As shown in Fig. 1(b), the VForce user applicationcommunicates directlywith the SPP kernelwithout the RTRMas anintermediary. In other words, during data transfer and processing,there is no overhead due to the RTRM when running VForceapplications. When processing is complete, the VForce processingobject returns the SPP to the RTRM.

3.1.1. VForce processing objects and the generic processing elementVForce introduces new custom processing classes that support

SPPs for use by VSIPL++ application programmers. These process-ing classes can replace existing VSIPL++ functionality, such as FFTs,or introduce new functionality that better matches SPP strengths.In the latter case, processing classes of larger granularity thanwhatis available in the VSIPL++ specification may make sense to mini-mize data transfer and communications overhead. The relationshipbetween VForce and VSIPL++ is shown in Fig. 2.

Fig. 2. Relation of VForce to VSIPL++.

Fig. 3. Constituent components and hierarchical structure of VForce processingobjects and the GPE.

While VForce processing classes are distinct from the defaultset of classes provided by the VSIPL++ implementation, applicationcode treats VForce processing objects and existing VSIPL++processing objects identically. In order to work seamlessly inenvironments both with and without SPP hardware, VForceprocessing objects contain two implementations of the specifiedfunctionality: one for software execution and one for execution onan SPP, as shown in Fig. 3. The VForce software implementationmay be realized in any fashion and can optionally use VSIPL++functionality to perform the appropriate processing.

Within a VForce processing class, the SPP implementation usesthe GPE, which represents an abstract SPP, to interact with acorresponding SPP algorithm kernel. The GPE presents the VForceprocessing class with a set of operations common to most SPPs,and includes methods for loading and configuring SPP kernels,data transfer, and kernel execution and synchronization. Usingthe GPE’s interface insulates VForce processing objects from SPP-specific codewhilemaintaining the same SPP control behavior [31,32]. The abstraction offered by the GPE is possible because mostSPP APIs offer similar functionality. The complete list of methodsprovided by the GPE for use in VForce processing objects is givenin Table B.3.

To maximize performance, it is important for an application tobe able to exploit concurrency: simultaneous execution on boththe GPP and SPP. Concurrent execution allows larger speedupsthan would be possible from the speedup provided by the SPP ona single kernel alone. VForce processing objects accommodate thisneed at two levels. First, the GPE methods provide non-blockingdata transfer and kernel execution control operations that allowconcurrent execution on both the GPP and SPP hardware. This canbe used to implement coordinated GPP and SPP execution internal

Page 4: VForce: An environment for portable applications on high performance systems with accelerators

N. Moore et al. / J. Parallel Distrib. Comput. 72 (2012) 1144–1156 1147

to the VForce processing class; at this level the concurrency is notexposed to the user. VForce processing objectsmay also extend theVSIPL++ specification with asynchronous start(), status(),andfinish()methods to allowuser applications to start aVForceprocessing object’s computation and poll for status while runningother tasks concurrently. Thedecisions to utilize these concurrencycapabilities and whether or not to expose task concurrency to theuser application is made on a case-by-case basis. VForce objectsthat do expose the concurrency methods to the user applicationalso operate using the usual blocking function call methods.Section 6.1 covers the FFT VForce processing object which exposesthe concurrency to the user application; Section 6.2 presents abeamforming VForce processing object that takes advantage ofconcurrency internally.

One of the goals of VForce is to hide SPP-related issues fromthe user application, including seamless failsafe execution in thecase of a hardware error. There are situations where the processdescribed in Section 3.1 may fail: a VForce application couldrun on a system without an RTRM or there could be no SPPhardware available to execute the desired algorithm. Also, whilenot common, errors may be encountered at run time. If at anypoint while using the SPP an error occurs, the GPE throws anexception. The VForce processing class is expected to catch theerror and transfer execution to the software-only implementation.No diagnostic or recovery actions need be taken by the VForceprocessing class as other parts of the framework (RTRM) handleerror recovery. This behavior guarantees that the user programwillnever receive hardware-related errors.

The VForce processing classes and the GPE are compiled intothe user application. To maintain portability, there is no SPP-specific code in the VForce processing classes or the GPE, keepingapplication code SPP agnostic. The SPP-specific and system-specificcode resides in SPP control libraries that are loaded at run timethrough the standard POSIX dynamic linking API [36]. The GPEexpects to find a standard set of functions within an SPP controllibrary that implements a device’s specific API and is responsiblefor converting between the VSIPL++ types used by the VForceprocessing classes and the C data types used by the SPP controllibrary interface. The control library for a particular SPP providescontrol functions thatmirror the API provided by the GPE to VForceprocessing classes. However, not all SPP control libraries provideidentical functionality. For example, the basic library functions thatSPPs provide, including support for asynchronous data transfers,are shown in Table B.4. Table B.5 shows the additional functionsfor SPPs that also furnish DMA support.

Like the SPP control library, the mechanism used to contactthe RTRM may be platform dependent. Since no platform-specificinformation is compiled into the user application, there must bea mechanism for the user application to be provided with thecorrect SPP control library. In VForce, this is done through anothermodule loaded by the GPE. When a VForce processing object callsthe kernel_init()method on a GPE instance, the GPE relies ona separate library containing the two functions shown in Table B.6.

This GPE communication library is expected to have thecapability to contact the run time resource manager (RTRM)to make decisions about which, if any, SPP to use for a givenalgorithm. The GPE communication library may be implementedin any fashion as long as it conforms to the API. This designallows the GPE communication library to contact a non-localRTRM over a network connection or implement all of the SPPallocation decisions locally inside the user VForce process. Ascurrently implemented, the GPE communication library uses inter-process communication (IPC) to contact a system-wide RTRM.The GPE communication library is loaded using the same POSIXdynamic loading API and is located using an environment variable.The IPC functionality is not compiled into the user application,increasing portability. Mechanisms in addition to IPC may be usedto communicate with the RTRM as well.

3.1.2. Run time resource manager and algorithm information con-tainer

To complete platform abstraction, our VForce implementationrelies on a run time resource manager (RTRM). The RTRM isa system daemon that is independent of user applications. Itis responsible for managing the available SPP resources in agiven system and making decisions regarding which hardware,if any, to assign for use by a given VForce processing class. Themechanism used for the RTRM discovering available SPPs and thecorresponding RTRM SPP control libraries is not specified. It maybe a simple text file indicating the system configuration, as is thecase for our RTRM implementation, or a more advanced probingof system resources. The RTRM SPP control libraries may or maynot be the same physical shared library implementing the GPE SPPcontrol library for a given SPP. Control libraries are loaded usingthe standard POSIX dynamic loading API in a similar fashion to theGPE. Table B.7 lists the set of functions that make up the RTRM SPPcontrol library interface.

In the current version of VForce, the RTRM SPP control libraryfor each device is responsible for maintaining a library of pre-built SPP algorithm implementations. This policy allows each SPPvendor to manage their own library of solutions.

VForce provides an Algorithm Information Container (AIC) datatype. The AIC stores key/value pairs that describe the requestedalgorithm type as well as the relevant parameters for the currentproblem. VForce also provides a set of functions to add, modify,remove, and query key/value pairs. In the user application, a VForceprocessing object passes an Algorithm Information Container (AIC)to a GPE’s kernel_init()method.

The GPE SPP communication library sends this data structureto the RTRM, where an RTRM SPP control library examines it todetermine whether its associated SPPs are capable of performingnot only the requested algorithm type, but also the given probleminstance. For example, although a given SPP may have an FFTkernel, it may not be able to perform a particular FFT instance, suchas a real-to-complex FFT or a very large FFT. If the pe_can_do()function from an SPP control library indicates that an SPP canperform the requested functionality, the RTRMmay assign that SPPto the requesting VForce application.

Either or both the RTRM and GPE may call the pe_program_pe() function, depending on particular characteristics of a givenSPP. pe_get_spp_info() provides the GPE and RTRM withinformation about several of the SPP’s characteristics, includingwhether a programmed SPP can be passed between processes.While less useful for GPUs, this feature is important for FPGAswhere bitstream loading incurs significant overhead. In the casethat an SPP supports RTRM-controlled programming, the RTRMtracks which kernel has been loaded and will attempt to avoidreprogramming the device with the same kernel.

Once a client VForce application is finished using an SPP (usu-ally at VForce processing object destruction time), the GPE callssurrender_pe(), to inform the RTRM that the SPP is no longerin use and can be reassigned.

The pe_recover() function is called by the RTRMwhenever aclient application indicates that a hardware errorwas encountered.The pe_recover() function is expected to perform any opera-tions to the SPP to return it to a non-error state. The return statusof the function indicates whether or not the recovery operationswere successful and whether the RTRM should continue to con-sider the SPP for future client requests.

4. Extensibility and portability

VForce’smodular designmakesVForce and applicationswrittenusing it both portable and extensible. Portability and runtimebinding to a particular SPP type are achieved through a hierarchy of

Page 5: VForce: An environment for portable applications on high performance systems with accelerators

1148 N. Moore et al. / J. Parallel Distrib. Comput. 72 (2012) 1144–1156

Fig. 4. The interfaces in VForce.

APIs and loading resources on demand. The middle and the right-hand side of Fig. 3 represent the SPP-related VForce componentsthat are resident in a VForce user application process. The system-specific inter-process communication (IPC) and SPP control APIsare separate, and are shown in Fig. 4. Each API is used to interactwith components, represented with dashed outlines in Fig. 3, thatare loaded on demand and are not part of the user applicationbinary. The hierarchy isolates concerns so that each of thesecomponents can be considered independently, starting with theinterface used by VForce processing objects. VForce applicationswill run on any platform that supports dynamic shared libraryloading, and if any component, even the RTRM, is not present, theVForce application will default to software execution.

To support SPP acceleration of functions, several componentsare required. Internally, theGPE andRTRMeach use two interfaces:one for controlling SPP hardware through the SPP control librariesand one for performing communication between a GPE in a VForceuser application and the RTRM, as shown in Fig. 4. The SPPcommunication libraries provided for the GPE and RTRMmust usethe same mechanism and protocol for communication, includingthe representation of the AIC.

VForce’s runtime architecture enables adding functionality andcapabilities to a user program without sacrificing portability.Together, the GPE, DLSLs (Dynamically Linked Shared Libraries),and RTRM exclude SPP-specific code from VForce applications atcompile time. The VForce components compiled into user appli-cations are light weight, and they do not add significant over-head at either compile time or run time. This provides both sourcecode portability and binary compatibility across platforms thatshare the same application binary interface (ABI). Additionally,logic to control themapping of an application to the heterogeneouselements in a system is not embedded in the user application.Different RTRMsmay be used to create staticmappings for applica-tions or perform dynamic mappings based on problem parametersor the characteristics of the available processing elements. Whenan SPP control library for the target device and corresponding SPPimplementation for the desired functionality are added to the ker-nel library, the same VForce application running unchanged willgain the ability to use the new SPP implementation.

The SPP kernels used by VForce (shown in Fig. 1(a) and (b)) arepart of a pre-built library for any given SPP and not part of VForce.This allows VForce to take advantage of the best available kernelimplementation or automatic tool for each target architecture.VForce imposes minimal requirements on the kernels designedfor SPPs, allowing kernel implementers to take advantage of anindividual device’s characteristics and feature set, thusmaximizingpossible performance. As long as the control library for the SPPimplements the standard interface between the GPE and RTRM,the functionality can be implemented in any fashion. Differentapproaches to SPP control library implementation are discussed inSection 5. In general, there is a one-to-many mapping of VForceprocessing objects to SPP kernels, as a single processing object can

be used on any platform and SPP kernels are generally tied to asingle SPP type. The SPP kernel library is per hardware architecture,and it is shared by many VForce applications. For example, inour kernel libraries, the FFT processing object has a softwareimplementation based on VSIPL++, an Xilinx FPGA implementationbased on the Xilinx CoreGen FFT, and an NVIDIA CUFFT-basedversion for NVIDIA GPUs.

VForce can be extended in several dimensions. Adding supportfor a new algorithm on an already supported platform requirescreating a new VForce processing object for that algorithm;without an SPP kernel the processing will take place in software.Adding SPP support for an existing VForce processing objectrequires creating an SPP kernel compatible with the device’s SPPcontrol library. Note that once the VForce processing object forthe new algorithm has been created, the program is still portable,because therewill always be a software failsafe, even if other targetplatforms lack SPPs or the specific SPP kernel. Adding support for anewSPP type requires implementing an SPP control library and anydesired SPP kernels. If existing RTRM and communication librariesare acceptable for the new platform, in terms of both compatibilityand features, no furtherwork is required. Otherwise, thesemust beimplemented as well. Regardless, the VForce application code willnot require any changes.

Finally, VForce provides a simplemechanism for adding SPP ac-celeration to legacy VSIPL++ applications when the source code isavailable for recompilation. By compiling in VForce implementa-tions of standard VSIPL++ processing classes, the VForce-enabledSPP-accelerated versions of these classes will be used instead. Anexample is the VForce FFT replacement discussed in Section 6.1.

4.1. Reference RTRM and communication library

VForce includes reference implementations of the RTRM andGPE communication library. At startup, the reference RTRM readsa configuration file that lists available SPPs in the system anda corresponding RTRM SPP control library for each SPP. Duringregular operation, the RTRM queries devices about performinga given problem in the order listed in the configuration fileand assigns hardware on a first-come-first-served basis. Noperformance estimation is done, and SPP hardware, if available, isalways assigned. The reference RTRMuses standard POSIX API callsandmay be used on any platformwhere more advanced capabilityis not needed.

The current reference RTRM implementation does not fullyimplement the interface modularity as shown in Fig. 4, as it in-corporates the communication into the RTRM itself. A GPE com-munication library matching the communication mechanism andprotocol implemented in the reference RTRM is provided. Com-munication between the VForce user application and the refer-enceRTRMoccurs through SystemV-style IPCmessage queues. Themessage queueswere chosen for the reference implementation be-cause they provide automatic arbitration and ordering ofmessagesin the case that multiple VForce client applications attempt to con-tact the RTRM simultaneously.

5. Supported platforms

VForce places no constraints on the SPP control library im-plementation. However, the SPP device must be able to be con-trolled with the basic set of operations in the SPP control interface(Table B.3), which include initialization, algorithm loading, datatransfer, and execution control. To demonstrate framework porta-bility, VForce has been implemented for a range of hardware archi-tectures [8,31]. Previous publications describe VForce on MercuryComputer Systems’ 6U VME [28] that features FPGA nodes oper-ating independently in a peer-to-peer model. This paper discusses

Page 6: VForce: An environment for portable applications on high performance systems with accelerators

N. Moore et al. / J. Parallel Distrib. Comput. 72 (2012) 1144–1156 1149

the extension of VForce to support both the Cray XD1 and NVIDIACUDA-enabled GPUs. The differences in the SPP control library im-plementations between the platforms helps to show the benefitsassociated with SPP control library implementation flexibility.

5.1. Cray XD1

The Cray XD1 [12] is a supercomputer using fixed configu-rations of general purpose processors and FPGAs operating in amaster–slave paradigm. A single chassis encloses 12 GPPs and 6FPGAs [11]. The XD1 supports both single and dual core proces-sors paired into groups of six two-way or four-way SMP nodes.Each node’s processors share an FPGA and are directly connectedto the processors’ HyperTransport bus via Cray’s RapidArray inter-connect,which provides high bandwidth, low latency, direct point-to-point connections between processors as well as a mechanismfor direct memory access to the CPUs’ RAM by the FPGA. Each nodeon the XD1 runs an instance of the Linux operating system andincludes system libraries modified to use the RapidArray intercon-nect for inter-processor communication and memory accesses of-fering up to 8 GB/s bandwidth (with a 1.7µs MPI message latency)between nodes.

VForce was ported to a Cray XD1 installed at the OhioSupercomputing Center [34]. This XD1 system had 248 Opteronprocessors (2.2 GHz, 1 MB L2 cache), Virtex-II Pro 50-7 FPGAs, andran Cray software release 1.3 (build 1005), which included GCC3.3.3. Each node of the system had 4 GB of RAM installed, and eachnode’s FPGA was paired with 16 MB of local second generationQuad Data Rate (QDR-II) SRAM, which was arranged into four64-bit wide 4 MB banks that could achieve four transfers per clockcycle [37].

Developing complete VForce support for a new platform mayconsist of porting the RTRM, developing kernel bitstreams for theplatform’s SPPs, and crafting an SPP control library to implementthe GPE and RTRM SPP control library interfaces to the SPP-specificfunctionality. Because the XD1 used a fully featured Linux-basedoperating systemand theGCC tools, VForce’s referenceRTRMcouldbe compiled directly. Kernel bitstreams were created for the XD1’sFPGAsusingVHDL alongwith cores generatedwith theXilinx COREGenerator [45]. The bitstreams made use of IP from Cray for sometasks common to many bitstream designs, including the registerfile design, DMA engines, QDR-II SRAM cores, and HyperTransportbus communication cores. Designs were synthesized, placed, androuted using Xilinx ISE 9.1i.

The SPP implemented for the XD1 allows VForce to takeadvantage of the XD1’s DMA memory transfer functionality. TheXD1 requires that buffers used for DMA memory transfer to andfrom the FPGA be page aligned and that the sizes of the buffersbe integer multiples of the system’s page size. The XD1 SPPimplements the DMA-related SPP functions (Table B.5), providingthe GPE with a memory buffer that meets the Cray’s requirementsfor DMA, thus allowing VForce to take advantage of DMA for datatransfers to and from the FPGA.

The Cray XD1 does not allow a programmed FPGA to be closedand reopened in a non-destructive manner. Since the RTRM andVForce user application are separate processes, only one is allowedto open a given FPGA device at a time. As a result, the GPE inthe VForce user application must program the SPP; the RTRMon the XD1 could not reuse programmed FPGAs to minimizereprogramming, as discussed in Section 3.1.

VForce does not specify anything lower level than the GPESPP control library interface. For the Cray XD1, a relativelysimple control library was used, with both the RTRM and GPEcontrol library interfaces implemented with the same sharedlibrary. Despite being provided with the AIC by the RTRM, theRTRM control library relied on a simple library mechanism to

find available bitstreams based only on matching filenames toalgorithm names and data types from the AIC.

A consistent memory layout for bitstream configurationregisters and RAM banks was used for all implemented Cray XD1bitstreams. This allowed a control library to control all the FPGAbitstreams. Once the bitstream is loaded and the configurationregisters set, the control library checks a specific FPGA registerthat indicates whether the bitstream is capable of performingthe specific algorithm configuration. If the register indicates anunsupported configuration, the control library generates an error,causing the GPE to throw an exception to the VForce processingobject and triggering the use of the software-only implementation.

5.2. NVIDIA CUDA-enabled GPUs

In the past few years, the use of graphics processing units(GPUs) to perform general purpose computations (GPGPU) hassignificantly increased. To date, the most widely used GPGPUenvironment is NVIDIA’s CUDA. While there are many significantdifferences between FPGA and GPGPU development, one of themain issues for VForce is the level of dynamism present inGPU kernel library elements. Modern GPUs execute an ISA asopposed to employing fixed function hardware. Due to thefixed GPU architecture and memory hierarchy, some algorithmicimplementations may need be broken down into several passesof different GPU kernels. In CUDA, memory usage is also highlydynamic and managed at run time.

Since these issues require specialized GPP host code, there maybe significant variation in host-side GPU management betweentwo different GPGPU algorithm implementations. As a result,VForce SPP support for CUDA is different in implementation thanthat for the Cray XD1. The RTRM and GPE SPP control libraries areimplemented as distinct shared libraries, and the RTRMSPP controllibrary for CUDA uses AICs to determine if there is an availablekernel capable of performing the specific problem instance. Thisallows VForce to determine whether or not SPP acceleration canoccur for a given problem before the RTRM replies to the GPE inthe user VForce application.

On the GPE side, algorithm-specific SPP control libraries areused. Assuming that a CUDA implementation is available, the SPPcontrol library provides the name of a GPE control library tailoredto the current algorithm, which the RTRM forwards to the GPEuser application. The use of algorithm-specific GPE control librariesallows the libraries to contain any custom host-side managementcode and GPU kernels needed for the current problem. To helpreduce redundant code among the algorithm-specific GPE controllibraries, VForce code is linked with a library containing a setof common utility functions for CUDA. Since these differencesare implemented in the control libraries behind an API specifiedby VForce, no changes to the user application or to the VForceframework are required. VForce’s mechanism for loading SPP-specific code is flexible enough to support a wide variety of deviceand device kernel management styles.

6. Experiments and results

Two different case studies, an FFT processing object and anadaptive beamforming application, demonstrate that the VForceframework enables SPP hardware performance benefits withoutsacrificing portability, illustrate characteristics of the VForceframework, and highlight some practical implementation issues.

6.1. FFT

To demonstrate the ability of VForce to accelerate functionalityprovided by VSIPL++, as well as VForce user application portability,

Page 7: VForce: An environment for portable applications on high performance systems with accelerators

1150 N. Moore et al. / J. Parallel Distrib. Comput. 72 (2012) 1144–1156

Table 1Speedup of VForce FFT defaulting to software over a regular VSIPL++ FFT.

FFT size (log2(x)) 4 5 6 7 8 9 10 11 12 13 14 15 16

Speedup 0.867 0.947 0.983 0.980 0.997 1.000 1.014 0.991 0.994 1.004 1.017 0.994 0.994

25×109

20×109

15×109

10×109

5×109

0

FLO

PS

VSIPL++VForce SW

CUDAVForce on CUDA

16 64 256 1K 4K 16K 64K 256K 1M

FFT Size

FFT Performance on a Tesla C1060/Intel Xeon W3580

FFT Size

60

50

40

30

20

10

0

Per

cent

of R

un T

ime

Data Copies as a Percentageof VForce GPU Runtime

16 128 1K 8K 64K 512K

(a) Performance of various FFTs. (b) Data copying overhead with the VForcFFT targetingCUFFT.

Fig. 5. FFT performance results.

an FFT VForce processing class, largely conforming to the VSIPL++standard, was created. The VForce processing class relies on thestandard VSIPL++ FFT for its software implementation. The classalso exposes asynchronous start(), status(), and stop()methods, as discussed in Section 3.1.1.

FFT library elements were created for both the Cray XD1 [32]and NVIDIA GPU. The Cray XD1 FFT runs on an FPGA; XilinxCoregen was used to generate the implementation. The GPU FFTlibrary element is from NVIDIA’s CUFFT library, which provides anFFTW-like API for executing FFTs on CUDA-enabled GPUs.

A simple benchmarking application relying on the VForce FFTprocessing object was written, and the same application sourcewas run on both the XD1 (Section 5.1) and a CUDA-enabledworkstation.

For bothmachines, a reference application that uses native APIswas written to compare the overhead associated with using theVForce framework. The Cray XD1 reference C application utilizesthe same FPGA bitstream used by the VForce application, butdirectly calls the Cray APIs to control the FPGA. Similarly, a Capplication directly calling the CUDA and CUFFT APIs was used asa reference for CUFFT performance.

Additionally, to compare the performance of the softwarefailsafe, the FFT application was compiled using the standardVSIPL++ FFT without the VForce framework. On both machines(Cray XD1 and CUDA workstation) the FFT library element wasremoved, resulting in the application compiled with the VForceFFT defaulting to software processing which in turn invokedthe VSIPL++-provided FFT. We compare the performance ofthe two scenarios, VSIPL++-provided FFT only and VSIPL++-provided FFT software failsafe in the VForce FFT processing object.Comparing the performance on a single platform exposes theoverhead associatedwith running a VForce-enabled application onamachinewith no SPPs. For these results, the VForce RTRMwas leftrunning. The GPE contacts the RTRM through the communicationlibrary and requests a lookup for any available SPP-acceleratedFFT implementations. The RTRM replies that the SPP-supportedFFT is not found, causing the VForce FFT to switch to the softwarefailsafe. Running without the RTRM will reduce overhead, as theGPE communication library will not be able to contact the RTRM,thus causing an exception earlier in the VForce SPP setup process.

Results presented previously [32] show no significant perfor-mance difference between the software failsafe FFT called from theVForce framework and the VSIPL++ FFT running on the Cray XD1.Running this comparison on the CUDA workstation shows similarresults (Table 1). Note that slow down (speedup less than 1) is ex-pected. At very small FFT sizes it is possible to see the effects ofVForce overhead. However, for non-trivial amounts of computa-tion, VForce does not add any appreciable overhead to this appli-cation. The small fluctuations at larger FFT sizes are likely due tomeasurement noise.

Overhead is observed in data copying. As an extension built ontop of the VSIPL++API, VForcemust rely on the public functionality.An important concept in the VSIPL++ specification is ownership ofa given memory buffer. When memory is allocated by the user, itmust be admitted into VSIPL++. Once admitted, the user code isonly allowed to use standard VSIPL++ library functionality on thememory andmay not directly access ormanipulate it. This allows aVSIPL++ implementation to reorganize data layouts to bettermatchtarget processors.

While the replacement VForce processing objects such as theFFT may provide an interface that is conformant to the VSIPL++specification, they do not, in general, have direct access to theunderlying memory. This forces the VForce processing objects touse the element-wise extraction of the underlying data that issupported for admitted memory before sending it to the SPP forprocessing.

Underlying memory access restrictions affect the accelerationperformance of the NVIDIA CUDA FFT as well as on the Cray XD1.Fig. 5(a) shows the performance of the VSIPL++ software FFT, theVForce FFT replacement running on an NVIDIA Tesla C1060 GPU,and aC application directly accessing theCUFFT library on the sameGPU. In all cases, the CPU was an Intel Xeon W3580 (4 i7 coresat 3.33 GHz with 8 MB L3-cache) and CodeSourcery VSIPL++ 2.2was compiled to use FFTW 3.2.2 for the FFT implementation. Allthe GPU results use CUDA 3.2, which includes the CUFFT library.The data copying overhead prevents the VForce user from gettingthe full performance advantages associated with using the CUFFTlibrary. As can be seen in Fig. 5(a) by comparing Native CUDA andVForce CUFFT, as the FFT size grows the overhead associated withcopying data increases. Fig. 5(b) shows the percent of the totalapplication run time involved in the element-wise data copy out of

Page 8: VForce: An environment for portable applications on high performance systems with accelerators

N. Moore et al. / J. Parallel Distrib. Comput. 72 (2012) 1144–1156 1151

Fig. 6. A high level representation of the beamforming application.

the opaque VSIPL++ container and into memory that can be usedfor data transfers. With this overhead removed, the VForce CUFFTtimes are indistinguishable from the application directly targetingCUDA.

The FFT example highlights the ease of code portability withVForce. The same application source was compiled and run on theCray XD1, the NVIDIA Tesla C1060, and the VForce FFT software de-fault. Additionally, the VForce framework does not impose any no-ticeable overhead when running in a software-only environment.

It should be noted that VSIPL++ provides a mechanism to allowusers to substitute their own opaque containers to manage thememory underlying data. It may be possible to use this featurein VForce to introduce a custom container to manage properallocation of underlying memory so that VForce can gain access tothe memory and use it directly in a DMA transfer without a copy.However, since memory is allocated independently of invokinga processing object, it is not possible to know the SPP type, andtherefore its memory allocation requirements, before memoryallocation time. Additional custom blocks may be created for eachSPP type, but this will reduce the source code portability betweenplatforms, as the SPP type must be indicated in advance.

6.2. Adaptive beamforming

A beamforming application was constructed using the VForceframework. Beamforming is a signal processing technique thatcontrols the sensitivity pattern of an array of antennas. By adjustingthe gain and phase or time delay of the incoming sensor data, abeamformer can increase gain in selected directions or frequenciesand place nulls in the sensitivity pattern in the direction ofinterfering signals. Beamforming can be performed in either thefrequency domain, where phase shifts are used, or in the timedomain, where time delays are used. In both cases weights arecomputed for each sensor and for each beam that modify the gainand phase of the sensor reading. Our implementation performstime domain adaptive beamforming [8,31].

The beamformer application demonstrates the importance ofprocessing object granularity, a high level performance consid-eration for VForce. SPPs are able to perform certain operationsmuch faster then the host GPP, but there are overheads associatedwith using SPPs, including setup (e.g. FPGA programming) and data

transfer. The overhead varies, but in general there must be enoughwork performed on the SPP so the processing speedup compen-sates for the overhead. The VSIPL++ specification includes manyobjects whose computations are not large enough to overcome SPPoverheads. Consequently, VForce allows for the creation of pro-cessing objects that are much larger in granularity than a typicalVSIPL++ processing object.

The beamforming application exemplifies this scenario. Asshown in Fig. 6, the processing required can be broken down intotwo large pieces: weight update and weight application. Weightsare continually applied to incoming sensor data with a periodicupdate of the weights in order to minimize or emphasize certaincharacteristics of the data. The mathematical operation performedin the weight application is a complex, indexed weighted multiplyaccumulate. This operation can be effectively implemented withFPGAs, and a bitstream to perform this operation was developedfor the XD1 FPGAs. The weight application bitstream relies on thecommunication, register, QDR-II SRAM, and DMA infrastructureprovided by Cray, and it was synthesized, placed, and routed withXilinx ISE 9.1i. Details of the implementation of the beamformercan be found in [32]. The weight update algorithm used for thebeamformer involves a least squared error solver, and iswell suitedfor GPP execution. A weight computation class was implementedthat relies on the VSIPL++ qrd processing object, which provides aQR decomposition-based linear system solver.

An important feature of VForce that the beamforming applica-tion illustrates is the concurrent operation of processing on GPPand SPP. Weight application and weight update are computed si-multaneously, using coarse-grained task parallelism. The overlap-ping of processing as well as data transmission are illustrated inFig. 7. Send data transfers to the SPP; return results returns themto the GPP. Weight application is done on the SPP; weight com-putation is done on the GPP. As Fig. 7 illustrates, sending sensordata to the FPGA, weight application on the FPGA, returning datafrom the FPGA, and weight computation on the host occur simul-taneously. The sensor data is double buffered, allowing the trans-fer of new data to the FPGA while processing the previous set. Theresults are immediately streamed back to the host without tem-porary storage on the FPGA, eliminating the need to wait for anadditional data transfer after each group of sensor data. The beam-former processing object coordinates these simultaneous opera-tions using the asynchronous data transfer and kernel executioncapabilities of the GPE to coordinate the various activities. Notethat the parameter data transfer to the FPGA,which includes newlycomputed weights, is a serialization point, as this data is not dou-ble buffered. The concurrent execution of the various activities re-quired to perform the beamforming creates a processing pipeline,where the effects of a set of sensor data do not affect the outcomeof the beamformer immediately. Fig. 7 indicates this delay throughthe letter labels and arrows. For example, the sensor data enteringthe system at the first time step shown, labeled A, does not affectthe weights used on incoming data for two time steps, the stepslabeled C.

Numerous simulations of the beamforming application usingdata sets consisting of 220 time steps were analyzed [32]. The

Fig. 7. Data flowing through the beamformer’s weight application pipeline (not to scale).

Page 9: VForce: An environment for portable applications on high performance systems with accelerators

1152 N. Moore et al. / J. Parallel Distrib. Comput. 72 (2012) 1144–1156

Weight ApplicationWeight UpdatePush DataPush ParametersFPGA ConfigurationObserved Runtime

32768163848192409620481024

Update Period (Samples)

100

80

60

40

20

0

Tim

e (s

econ

ds)

Execution Time vs. Update Period

Fig. 8. Cumulative component and observed run times for the accelerated FPGAversion of the beamforming application. The individual task times are shownstacked on top of each other in the order shown in the legend.

results show that the concurrent execution on both the SPPand GPP allowed overall application speedups greater than thespeedup of the accelerated portion alone. As a result of concurrentoperation, for some parameters the total application speedup isgreater than the weight application speedup. This occurs when theweight computation and weight update are well balanced. Notethat there are a large number of parameters that can be varied,including weight update period, number of sensors, and numberof beams. Fig. 8 plots the observed run time of the beamformerapplication as well as the times of several smaller parts of thewhole beamforming application while varying the update period.The observed run time is plotted as a solid line with respect to thehorizontal axis, and individual components are shown as blocksof color and graphed cumulatively so that each curve is plottedrelative to the next lowest curve. With short update periods, theweight computation time dominates and the weight applicationtime is hidden due to concurrency. When the update periodsare long, the weight computation time is infrequent, and as aresult there is little GPP processing time to overlap with weightapplication. In this case, the observed run time is close to the sumof the components. Concurrent execution is essential to realizingpeak performance on heterogeneous systems, and is a featureadded to VSIPL++ by VForce.

7. Related work

Our focuswith VForce is the callingmechanism for applications,both at compile time and at run time. There are many tools forgenerating application kernels for both GPUs and FPGAs that weplan to leverage. The most common approach in programmingSPPs is to use vendor-supplied tools. We use these for developinglibrary kernels as we have for our experiments. For interfacingwith user code, however, vendor tools such as CUDA and thoseprovided by FPGA manufacturers do not result in portable code. Inthis section, we discuss similar work to VForce aimed at providinghigh performance, portable code for GPPs and SPPs.

One approach is to compile from a common source language.Howes et al. [17] describe compiling to GPUS, FPGAs, and thePlaystation 2. A similar project, Accelerator [3,38], targets bothFPGAs and GPUs from F# and C# source code. The GPU code isautomatically generated and compiled at run time, while the FPGAtarget generates VHDL for offline compilation. Other compilationframeworks typically target FPGAs or GPUs but not both. FPGAcompilers from C source include ImpulseC and ROCCC [21,39,16].The PortlandGroupprovides support for compiling toNVIDIAGPUsfrom C, C++, and Fortran [40]. Govindarajan and Thazhuthaveetil

[35] describe an approach called PLASMA, where different sourcelanguages are translated to a common intermediate representation(IR) before being compiled to an SIMD accelerator. They supportGPUs as well as SIMD extensions to GPPs. While PLASMAhas a runtime component, only runtime support for memorymanagement is described. Diamos et al. [13] have created Ocelot,a runtime framework that translates CUDA PTX kernels to LLVM IRbefore retargeting the code for multicore CPUs, GPUs, or the Cellprocessor.

Any of the tools mentioned above, including vendor-suppliedtools and libraries, can be used in our approach. VForce imposesfew limitations on implementations of SPP control libraries andloads any needed support libraries at run time. This permits thedesigner to use the best tools available to achieve performance foreach different accelerator platform and allows VForce to support awide range of SPP types.

Intel Quick Assist [24] has similar goals to VForce in terms ofseparating user code from accelerator code in a portable manner.Quick Assist targets Intel processors for interfacing to FPGA chipsover the Front Side Bus (FSB), making it less portable than VForce.

Software support for GPUs and multicore processors is morewidespread than for FPGA-based platforms. OpenCL [25] supportsdifferent GPU vendors, specifically NVIDIA and AMD, as well asprogramming for multicore platforms. However, in many caseseach SPP requires a unique kernel, managed by the applicationprogrammer, to take advantage of the device specifics. The OpenCLruntime environment is at a much lower level than the targetof VForce. OpenCL may be a good virtual SPP target for VForce,whichweplan to investigate. In addition,whileOpenCL specificallytargets digital signal processors, support for FPGAs is not yetavailable.

VForce targets application code written in VSIPL++. Two othergroups, notably Codesourcery and GTRI, are investigating inter-faces between VSIPL and accelerators. Neither of these efforts isas general as VForce since neither considers both FPGAs and GPUs.Codesourcery has implemented support for NVIDIA GPUs and forthe CELL/B.E. for a handful of VSIPL++ processing objects [7]. Thisis all done below the level where the application programmer isinvolved. Memory management is integrated into the running ofthe VSIPL++ program. TheGeorgia TechGPU-VSIPL library supportsNVIDIA CUDA calls from C-VSIPL [6,15]. However, VSIPL++ is notsupported. Excellent results have been demonstrated on some ex-amples. VForce can take advantage of the code generated by theseprojects, so this work is complementary to ours. The VForce envi-ronment would impose a layer such that GPU-VSIPL libraries couldbe used (or not) without changing the user application code.

These efforts do not include several hardware and platformabstractions that VForce provides, including support for hetero-geneous hardware within the same application, resource manage-ment through the RTRM, and software failsafe for functionality notsupported on an SPP. The greater flexibility and adaptability ofVForce, however, requires more initially to add support for newfunctions and hardware.

Approaches for supporting accelerators similar to VForce arerepresented by the Auto-Pipe [14], StarPU [2], Merge [27], and theframework proposed byWernsing and Stitt [44]. All these projectsuse a runtime component to execute tasks on heterogeneoussystems and are designed to separate application specificationfrom kernel creation and SPP-specific concerns. Auto-Pipe focuseson streaming applications and makes use of its own coordinationlanguage, called X, and Merge is based on the Map-Reduceparadigm. VForce builds on the VSIPL++ standard, allowing formore flexible application specification. Wernsing and Stitt [44]focus on empirical performance profile planning for mappingapplications to systems, while StarPU employs its own task graphand scheduling policies. VForce specifies no application or system-specific mapping in the framework itself. While the current RTRM

Page 10: VForce: An environment for portable applications on high performance systems with accelerators

N. Moore et al. / J. Parallel Distrib. Comput. 72 (2012) 1144–1156 1153

implementation relies on a greedy algorithm, alternatives may beprovided without recompiling the user application. Additionally,VForce differs from these frameworks by providing a failsafeexecution mechanism for faulty SPPs or implementations as wellas by working on both FPGAs and GPUs.

8. Discussion and future directions

The VForce framework shows great potential for introducingportable and largely transparent SPP acceleration to VSIPL++applications. In this section, we present several directions toimprove VForce as well as general lessons learned that can beapplied to similar frameworks.

The overhead of data copying, as discussed in Section 6.1,affects performance. We are investigating methods to reduce thisoverhead and eliminate unnecessary copying. One possibility isto take advantage of a VSIPL++ mechanism that allows users tosubstitute their own opaque containers to manage the memoryunderlying data. This may make it possible to allocate memoryaccording to constraints imposed by an SPP and allowVForce directaccess to the memory for use in DMA transfers without copies.However, this requires that the hardware type is knownatmemoryallocation time.WithVForce, SPP allocationdecisions aremade lateat run time within a GPE, while VSIPL++ data containers allocatetheir own memory asynchronously and outside the visibilityof VForce components. Providing target SPP information aboutmemory allocation requirements to a data container withoutrequiring changes to VSIPL++ source code and not over-allocatingSPP-specific memory in a way that is detrimental to overallperformance are both challenges. Allocating toomuch page-lockedmemory for NVIDIA GPUs is an example of the latter issue.

We are also investigating mechanisms to increase the granu-larity of functionality assigned to SPPs. Ideally several functionswould be grouped together and the set of kernels would be pro-cessed on the SPP. This would allow the user application to keepdata on the SPP between function calls and would thus reduce un-necessary data transfers between the host and the SPP. This couldalso be used in an FPGA design where the data is held static inmemory local to the FPGA while the bitstream is reconfigured.VForce, which is in charge of managing all functionality runningon SPPs, could use the knowledge of function-to-SPP assignmentto improve overall system behavior.

Another area of improvement involves SPP assignment by theRTRM. Currently, the RTRM queries each device and assigns thefirst SPP discovered that is capable of performing the requestedfunctionality. SPP assignment makes no attempt at selecting thebest performing SPP for the current request. This could clearlybe much more sophisticated. The RTRM SPP control library couldbe extended to support the communication of performance-related information. Decisions on which SPP to use could be madestatically or dynamically at run timebasedon this information. Thisshould further improve VForce application performance.

The beamforming application highlights the importance of co-ordinated simultaneous host and SPP execution, something that re-quires better support in both VSIPL++ and VForce. VSIPL++ needs tosupport polling or interrupt mechanisms to support asynchronousfunction calls. Thiswas added in VForce and is the one place VForcedoes not conform to the VSIPL++ specification. Concurrent execu-tion requires a mechanism to ensure data consistency if multipleprocessors may potentially be accessing the same data. Exploit-ing concurrency internally to a VForce processing object, as donewith the beamforming example, alleviates some issues by prevent-ing application programmers from running into data consistencyissues.

VForcewas designed to be flexible and to allowapplication codeto run on accelerators thatwere not availablewhen the application

was written. VForce applications, both in source and binary form,are highly portable, and this portability is achieved by postponingdecision making about application mapping until very late at runtime. The result of these decisions is that the VForce frameworklacks a global visibility into the overall structure of the application.This is a tradeoff; more effective mapping decisions can be madewithmore information about the complete hardware and softwareenvironments.

For example, extra data transfers are a direct result of this latebinding. In VSIPL++, memory objects and processing objects areinstantiated independently from each other, and any processingobject may be used on a given memory object. More informationabout the overall structure of the application is needed to effec-tively allocate the right type of memory at the right time. One ap-proach to accomplish this is the Task and Conduit framework [29].Here, processing objects (Tasks) anddata pipes (Conduits) betweenprocessing objects are fully instantiated and connected before theapplication processing begins. This approach allows each data ob-ject to know its role — it may be temporary storage between twostages of processing within the same memory space (two process-ing objects on the sameGPPor SPP) or data thatmust be transferredacross a memory space boundary between a GPP and an SPP. Thetasks and conduits approach also allows for a global mapping ofalgorithms to computational resources. This knowledge could becombined with a more sophisticated API for SPP performance esti-mation to produce more sophisticated application mappings ontoa heterogeneous computer system.

The VForce project has shown that the main concepts behindVForce, including just-in-time dynamic binding of algorithms toSPPs and highly portable application code that runs in environ-mentswith andwithout SPPhardware, are sound. These considera-tions combined with a more global optimization approach are thefocus of current research involving portability on heterogeneoussystems [4,5].

9. Conclusions

VForce provides the VSIPL++ application programmer with theability to use SPP acceleration in a transparent, portable, andefficient manner. The application code is portable, and it can bedirectly recompiled in many VSIPL++ environments — even thosewithout SPP hardware or RTRMs. There is binary portability forsystems with a common ABI but different SPP hardware. Throughthe separation of concerns provided by VForce, the applicationprogrammer need not worry about SPP details or the availabilityof various types of SPP devices. Hardware expertise, optimizedlibraries, or specialized tools are used to build SPP kernels. Inaddition, separate runtime components allow adding features andcapabilities after the application code is complete, such as makinguse of accelerator hardware that was not available when theapplication was written. Hardware interactions are encapsulatedwithin VForce processing objects and specifics of the hardwareplatform are encapsulated in the RTRM and SPP control libraries.

Our experiments show that it possible to take advantage of SPPsand task concurrency in VForce applications while maintainingVSIPL++ portability. We plan to apply VForce to more user appli-cations and to broaden architecture support to show the effective-ness of VForce in portability, reusability, and performance acrossplatforms.

Acknowledgments

The Cray XD1 used in this research was at the Ohio Super-computing Center. This research is part of the High PerformanceEmbedded Computing Software Initiative (HPEC-SI). We wouldlike to thank Benjamin Cordes, Al Conti, and Kris Kieltyka for theircontributions to the VForce project. We would also like to thankThe MathWorks for their support.

Page 11: VForce: An environment for portable applications on high performance systems with accelerators

1154 N. Moore et al. / J. Parallel Distrib. Comput. 72 (2012) 1144–1156

Table A.2Glossary of acronyms.

AIC: Algorithm Information ContainerDLSL: Dynamically Linked Shared LibraryFPGA: Field Programmable Gate ArrayGPE: Generic Processing ElementGPP: General Purpose ProcessorGPU: Graphics Processing UnitIPC: Inter-Process CommunicationRTRM: Run Time Resource ManagerSPP: Special Purpose Processor (FPGA or GPU)VSIPL: Vector Signal Image Processing LibraryVForce: VSIPL++ for Reconfigurable Computing Environments

Table B.3The generic processing element’s SPP-independent interface.

Method name Description

get_data()put_data()

Blocking functions to transfer data to and from the SPP

get_data_start()put_data_start()

Non-blocking functions to transfer data to and from the SPP

get_data_status()put_data_status()

Non-blocking functions to check if an outstanding data transfer to or from the SPP has completed

get_data_finish()put_data_finish()

Blocking (if necessary) functions to complete data transfer to or from the SPP

get_const()put_const()

Read or write SPP kernel configuration parameters

kernel_init() Obtain an SPP and setup the specified kernel — uses the RTRMkernel_dest() Relinquish ownership of the SPP — uses the RTRMkernel_run() Start the kernel on the SPP — can specify blocking or non-blockingpoll_int() Check for SPP kernel execution completion — only useful for non-blocking calls to kernel_runclear_int() Prepare SPP for another execution

Table B.4The basic GPE SPP control library interface.

SPP function name Description

pe_vforce_api_version() Returns the version of the VForce API implemented by the SPP control library.pe_kernel_init() Perform any SPP or kernel initialization required.pe_get_spp_info() Returns SPP device characteristics.pe_get_kernel_info() Returns characteristics of SPP’s current kernel.pe_program_pe() Program SPP with specified kernel. Depending on the characteristics of the SPP this function may not be required.pe_kernel_run() Execute previously loaded kernel (specify if blocking).pe_poll_int() Check execution status of SPP (non-blocking).pe_clear_int() Prepare SPP for another kernel execution.pe_kernel_dest() Finalize SPP VForce user application is finished.pe_get_data()pe_put_data() Synchronous read or write data to or from the SPP.

Table B.5The additional GPE SPP control library functions for SPPs that support DMA.

SPP function name Description

pe_get_dma_buffer()pe_free_dma_buffer()

Allocate and deallocate a DMA buffer for the SPP.

pe_get_data_dma_start()pe_put_data_dma_start()

Start DMA transfer from/to the SPP.

pe_get_data_dma_status()pe_put_data_dma_status()

Check status of DMA transfer from or to the SPP.

pe_get_data_dma_finish()pe_put_data_dma_finish()

Complete and finalize outstanding DMA transfer.

Table B.6The GPE communication library interface.

Function name Description

request_pe() Indicate which, if any, SPP control library to load and device to use to execute the specified problem.surrender_pe() Relinquish control of an SPP previously obtained with the request_pe().

Appendix A. Glossary

See Table A.2.

Appendix B. Interfaces in VForce

See Tables B.3–B.7.

Page 12: VForce: An environment for portable applications on high performance systems with accelerators

N. Moore et al. / J. Parallel Distrib. Comput. 72 (2012) 1144–1156 1155

Table B.7The RTRM SPP control library interface.

SPP function name Description

pe_vforce_api_version() Returns the version of the VForce API implemented by the SPP Control Library.pe_get_spp_info() Returns SPP device characteristics.pe_program_pe() Program SPP with specified kernel. Depending on the characteristics of the SPP this function

may not be required.pe_can_do() Indicates whether or not the given SPP can execute a specified problem.pe_recover() Performs any appropriate SPP recovery operations and indicate to the RTRM whether or not

the SPP should continue to be assigned to VForce user applications.

References

[1] AMD, The amd fusion family of apus, 2011, http://sites.amd.com/us/fusion/apu/Pages/fusion.aspx.

[2] C. Augonnet, S. Thibault, R. Namyst, P.-A.Wacrenier, StarPU: a unified platformfor task scheduling onheterogeneousmulticore architectures, in: Concurrencyand Computation: Practice and Experience, Special Issue: Euro-Par 2009 23,2011, pp. 187–198.

[3] B. Bond, K. Hammil, L. Litchev, S. Singh, FPGA circuit synthesis of accelera-tor data-parallel programs, in: Field-Programmable Custom Computing Ma-chines, FCCM, 2010, pp. 167–170.

[4] J. Brock, M. Leeser, M. Niedre, Adding support for GPUs to PVTOL: the parallelvector tile optimizing library, in: High-Performance Embedded ComputingWorkshop, 2010.

[5] J. Brock, M. Leeser, M. Niedre, Portable application framework for hetero-geneous systems, 2010, http://www.coe.neu.edu/Research/rcl/projects/tasks_and_conduits.php.

[6] D. Campbell, M. McCans, M. Davis, M. Brinkmann, Using GPU VSIPL and CUDAto accelerate RF clutter simulation, in: SymposiumonApplication Acceleratorsin High Performance Computing, SAAHPC, 2010.

[7] Codesourcery, Sourcery VSIPL++, 2011, http://www.codesourcery.com/vsiplplusplus/.

[8] A.A. Conti, A hardware/software system for adaptive beamforming. Master’sThesis, Dept. of ECE, Northeastern University, 2006.

[9] Convey computer, Convey Cluster Framework 1600, 2011, http://www.conveycomputer.com/Resources/ClusterDataSheet.pdf.

[10] S. Craciun, A. George, H. Lam, J. Principe, A parallel hardware architecturefor information-theoretic adaptive filtering, in: Proc. of High-PerformanceReconfigurable Computing Technology and Applications Workshop, HPRCTA,2010.

[11] Cray Inc., Cray XD1 Datasheet, 2004. http://www.hpc.unm.edu/∼tlthomas/buildout/Cray_XD1_Datasheet.pdf.

[12] Cray Inc., Cray Legacy Products, 2010, http://www.cray.com/products/Legacy.aspx.

[13] G.F. Diamos, A.R. Kerr, S. Yalamanchili, N. Clark, Ocelot: a dynamicoptimization framework for bulk-synchronous applications in heterogeneoussystems, in: Proceedings of the 19th International Conference on ParallelArchitectures andCompilation Techniques. PACT ’10, ACM,NewYork, NY, USA,2010, pp. 353–364.

[14] M. Franklin, J. Maschmeyer, et al. Auto-pipe: a pipeline design and evaluationsystem, in: International Parallel and Distributed Processing Symposium,IPDPS, 2006.

[15] Georgia Tech Research Institute, GPUVSIPL, 2011, http://gpu-vsipl.gtri.gatech.edu/.

[16] Z. Guo, W. Najjar, B. Buyukkurt, Efficient hardware code generation for FPGAs,ACM Trans. Archit. Code Optim. 5 (6) (2008) 1–6. 26.

[17] L. Howes, O. Beckmann, O. Mencer, O. Pell, P. Price, Comparing FPGAs tographics accelerators and the Playstation 2 using a unified source description,in: International Conference on Field-Programmable Logic, 2006.

[18] HPEC-SI, 2010, http://www.hpec-si.org.[19] HPEC-SI, Vector Signal Image Processing Library, 2010, http://www.vsipl.org.[20] HPEC-SI, VSIPL++ Specification 1.02, 2010, http://hpec-si.org/spec-1.02-

final.pdf.[21] Impulse,Impulse accelerated technologies, 2011,

http://www.impulseaccelerated.com/.[22] Intel, Intel unveils new product plans for high-performance computing, 2010,

http://www.intel.com/pressroom/archive/releases/20100531comp.htm.[23] Intel, Intel Microarchitecture Codename Sandy Bridge, 2011, http://www.

intel.com/technology/architecture-silicon/2ndgen/.[24] Intel, Intel QuickAssist technology, 2011, http://www.intel.com/technology/

platforms/quickassist/.[25] Khronos Group, Opencl overview, 2011, http://www.khronos.org/opencl/.

[26] J. Lebak, J. Kepner, H.Hoffmann, E. Rutledge, Parallel VSIPL++: an open standardsoftware library for high-performance parallel signal processing, Proceedingsof the IEEE 93 (2) (2005) 313–330.

[27] M.D. Linderman, J.D. Collins, H. Wang, T.H. Meng, Merge: a programmingmodel for heterogeneous multi-core systems, in: Proceedings of the 13thInternational Conference on Architectural Support for Programming Lan-guages and Operating Systems. ASPLOS XIII, ACM, New York, NY, USA, 2008,pp. 287–296.

[28] Mercury Computer Systems, Inc., 2010, http://www.mc.com/.[29] S. Mohindra, J. Daly, R. Haney, G. Schrader, Task and conduit framework for

multi-core systems, in: HPCMP Users Group Conference 0, 2008, pp. 506–513.[30] N. Moore, A. Conti, M. Leeser, L. Smith King, Writing portable applications that

dynamically bind at run time to reconfigurable hardware, in: IEEE Symposiumon FPGAs for Custom Computing Machines, FCCM, 2007, pp. 229–238.

[31] N. Moore, A. Conti, L. Smith King, M. Leeser, An extensible framework forapplication portability between reconfigurable supercomputing architectures,IEEE Computer Magazine (2007) 39–49.

[32] N.J. Moore, Vforce: VSIPL++ for reconfigurable computing environments.Master’s Thesis, Dept. of ECE, Northeastern University, 2007.

[33] Nallatech, Nallatech will demonstrate deep packet inspection at Super-computing 2010, 2010, http://www.nallatech.com/Latest-News/nallatech-will-demonstrate-deep-packet-inspection/-fpga-network-accelerator-at-supercomputing-2010.html.

[34] Ohio Supercomputing Center, 2009, http://www.osc.edu.[35] S. Pai, R. Govindarajan, M.J. Thazhuthaveetil, PLASMA: Portable Programming

for SIMD Heterogeneous Accelerators, in: Workshop on Language, Compiler,and Architecture Support for GPGPU, 2010.

[36] POSIX, 1 2008. IEEE Standard for Information Technology-Portable OperatingSystem Interface (POSIX) Base Specifications, Issue 7. IEEE Std 1003.1-2008(Revision of IEEE Std 1003.1-2004).

[37] QDRT Consortium, QDR: the high bandwidth SRAM family, 2007, http://www.qdrconsortium.org/.

[38] M. Research, Accelerator, 2011, http://research.microsoft.com/en-us/projects/accelerator/.

[39] ROCCC, Riverside optimizing compiler for configurable computing, 2011, http://roccc.cs.ucr.edu/index.php.

[40] The Portland Group, PGI accelerator compilers, 2011, http://www.pgroup.com/resources/accel.htm.

[41] top500.org, Top 500 supercomputers of 2010, 2010, http://www.top500.org/lists/2010/11.

[42] K.H. Tsoi, W. Luk, Axel: a heterogeneous cluster with FPGAs and GPUs,in: Proceedings of the 18th Annual ACM/SIGDA International Symposium onField Programmable Gate Arrays, FPGA’10, ACM, New York, NY, USA, 2010,pp. 115–124.

[43] UF HCS Research Lab, Novo-G cluster, 2010, http://www.hcs.ufl.edu/lab/novog.php.

[44] J.R. Wernsing, G. Stitt, Elastic computing: a framework for transparent,portable, and adaptive multi-core heterogeneous computing, in: LCTES ’10:Proceedings of the ACM SIGPLAN/SIGBED 2010 Conference on Languages,Compilers, and Tools for Embedded Systems, ACM, New York, NY, USA, 2010,pp. 115–124.

[45] I. Xilinx, Xilinx CORE Generator System Overview, 2005, http://www.xilinx.com/ise/products/coregen_overview.pdf.

Nicholas Moore is a Ph.D. student in the Department ofElectrical and Computer Engineering at Northeastern Uni-versity. His research interests include tools and techniquesfor GPGPU programming. Moore received a B.S. in electri-cal engineering from the University of Rochester.

Page 13: VForce: An environment for portable applications on high performance systems with accelerators

1156 N. Moore et al. / J. Parallel Distrib. Comput. 72 (2012) 1144–1156

Miriam Leeser is a professor in the Department ofElectrical and Computer Engineering and head of theReconfigurable Computing Laboratory at NortheasternUniversity. Her research interests include reconfigurablecomputing and computer arithmetic. Leeser received aPh.D. in computer science from Cambridge University. Sheis a senior member of the IEEE and the Society of WomenEngineers and a member of the ACM.

Laurie Smith King is an associate professor of computerscience at the College of the Holy Cross, Worcester,Mass. Her research interests include hardware–softwarecodesign and programming languages. King received aPh.D. in computer science from the College ofWilliam andMary. She is a member of the IEEE and the ACM.