8
Hardware Libraries: An Architecture for Economic Acceleration in Soft Multi-Core Environments David Meisner Division of Engineering Brown University Providence, RI 02912 david [email protected] Sherief Reda Division of Engineering Brown University Providence, RI 02912 sherief [email protected] Abstract In single processor architectures, computationally- intensive functions are typically accelerated using hard- ware accelerators, which exploit the concurrency in the function code to achieve a significant speedup over soft- ware. The increased design constraints from power den- sity and signal delay have shifted processor architectures in general towards multi-core designs. The migration to multi- core designs introduces the possibility of sharing hardware acceleratorsbetween cores. In this paper, we propose the concept of a hardware library, which is a pool of runtime accelerated functions that are accessible by multiple cores. We find that sharing provides significant reductions in the area and logic usage required for hardware acceleration. Contention for these units may exist in certain cases; how- ever, the savings in terms of chip area are more appealing to many applications, particularly the embedded domain. We study the performance implications for our proposal using various multi-core arrangements, with actual implementa- tions in FPGA fabrics. FPGAs are particularly appealing due to their cost effectiveness and the attained area savings enable designers to easily add functionality without signifi- cant chip revision. Our results show that is possible to save up to 37% of a chip’s available logic and interconnect re- sources at a negligible impact (< 3%) to the performance. 1 Introduction Recently there has been a migration from uni-processor design to multi-core processors. The majority of vendors are quickly turning to this solution in order to solve prob- lems arising from increased spatial integration (e.g. power density, wire delay and complexity). With current technol- ogy scaling trends, the number of cores on-chip could be quickly approaching 80 [11]. The introduction of multi- core designs in processor architectures presents new op- portunities and challenges to designers. In transitioning to a multi-core processor, simple replication of previous ar- chitectures on a die will not suffice for an efficient design [12, 19]. Accordingly, it becomes prudent to reassess the system’s design as whole, as often one can share a sys- tem feature between multiple cores thus freeing up silicon [13, 16]. The reduction in silicon area through sharing is particularly important in Field Programmable Gate Arrays (FPGAs) which are constrained in resources and chip’s area is at a premium price. For example, the largest FPGA from Altera (Stratix II EP2S180C3) cost almost ten thousands dollars as of August 2007. Thus any reduction in silicon area of designs destined for such systems will directly re- duce the implementation costs. Our design’s inspiration comes from the domain of soft- ware engineering, where software libraries are created to provide the same functionality to multiple programs without the need for redundant copies of identical code. This task is relatively simple in software because shared libraries can be mapped to the memory space of multiple processes, where the same instructions are read by the various processes with small overhead. Software libraries have the properties of depth, the ability for libraries to depend on other libraries in a hierarchy, as well as reusability, multiple applications being able to use the same library. It is the later property that inspires this work. Reusability of hardware enables designers to save both static power and chip area. When not in use, excess hard- ware leaks current, contributing to power loss and should be eliminated if possible. We propose the concept of a shared hardware library. A shared hardware library exists as a pool of hardware resources or accelerators that any core can use to accelerate the execution of processes running on it. Simi- lar to software libraries, redundancy of resources is avoided by a sharing scheme. Unlike software, however, hardware units cannot be simultaneously used by multiple applica-

Hardware Libraries: An Architecture for Economic Acceleration in …scale.engin.brown.edu/pubs/iccd07.pdf · 2014-03-23 · Hardware Libraries: An Architecture for Economic Acceleration

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Hardware Libraries: An Architecture for Economic Acceleration in …scale.engin.brown.edu/pubs/iccd07.pdf · 2014-03-23 · Hardware Libraries: An Architecture for Economic Acceleration

Hardware Libraries: An Architecture for Economic Acceleration in SoftMulti-Core Environments

David MeisnerDivision of Engineering

Brown UniversityProvidence, RI 02912

david [email protected]

Sherief RedaDivision of Engineering

Brown UniversityProvidence, RI 02912

sherief [email protected]

Abstract

In single processor architectures, computationally-intensive functions are typically accelerated using hard-ware accelerators, which exploit the concurrency in thefunction code to achieve a significant speedup over soft-ware. The increased design constraints from power den-sity and signal delay have shifted processor architectures ingeneral towards multi-core designs. The migration to multi-core designs introduces the possibility of sharing hardwareaccelerators between cores. In this paper, we propose theconcept of a hardware library, which is a pool of runtimeaccelerated functions that are accessible by multiple cores.We find that sharing provides significant reductions in thearea and logic usage required for hardware acceleration.Contention for these units may exist in certain cases; how-ever, the savings in terms of chip area are more appealing tomany applications, particularly the embedded domain. Westudy the performance implications for our proposal usingvarious multi-core arrangements, with actual implementa-tions in FPGA fabrics. FPGAs are particularly appealingdue to their cost effectiveness and the attained area savingsenable designers to easily add functionality without signifi-cant chip revision. Our results show that is possible to saveup to 37% of a chip’s available logic and interconnect re-sources at a negligible impact (< 3%) to the performance.

1 Introduction

Recently there has been a migration from uni-processordesign to multi-core processors. The majority of vendorsare quickly turning to this solution in order to solve prob-lems arising from increased spatial integration (e.g. powerdensity, wire delay and complexity). With current technol-ogy scaling trends, the number of cores on-chip could bequickly approaching 80 [11]. The introduction of multi-

core designs in processor architectures presents new op-portunities and challenges to designers. In transitioning toa multi-core processor, simple replication of previous ar-chitectures on a die will not suffice for an efficient design[12, 19]. Accordingly, it becomes prudent to reassess thesystem’s design as whole, as often one can share a sys-tem feature between multiple cores thus freeing up silicon[13, 16]. The reduction in silicon area through sharing isparticularly important in Field Programmable Gate Arrays(FPGAs) which are constrained in resources and chip’s areais at a premium price. For example, the largest FPGA fromAltera (Stratix II EP2S180C3) cost almost ten thousandsdollars as of August 2007. Thus any reduction in siliconarea of designs destined for such systems will directly re-duce the implementation costs.

Our design’s inspiration comes from the domain of soft-ware engineering, where software libraries are created toprovide the same functionality to multiple programs withoutthe need for redundant copies of identical code. This task isrelatively simple in software because shared libraries can bemapped to the memory space of multiple processes, wherethe same instructions are read by the various processes withsmall overhead. Software libraries have the properties ofdepth, the ability for libraries to depend on other librariesin a hierarchy, as well as reusability, multiple applicationsbeing able to use the same library. It is the later propertythat inspires this work.

Reusability of hardware enables designers to save bothstatic power and chip area. When not in use, excess hard-ware leaks current, contributing to power loss and should beeliminated if possible. We propose the concept of a sharedhardware library. A shared hardware library exists as a poolof hardware resources or accelerators that any core can useto accelerate the execution of processes running on it. Simi-lar to software libraries, redundancy of resources is avoidedby a sharing scheme. Unlike software, however, hardwareunits cannot be simultaneously used by multiple applica-

Page 2: Hardware Libraries: An Architecture for Economic Acceleration in …scale.engin.brown.edu/pubs/iccd07.pdf · 2014-03-23 · Hardware Libraries: An Architecture for Economic Acceleration

tions. Therefore, depending on the popularity of a sharedlibrary, multiple copies may be necessary in order to avoidcontention and performance degradation. In order to fullyevaluate a sharing scheme for multi-core environments, thispaper investigates the following.

• We propose the concept of a hardware library, a poolof shared accelerators that takes the place of softwareroutines that need acceleration. Access to the libraryis controlled via a dispatcher that is capable of shar-ing the accelerators among requesting CPUs withoutsignificant overhead.

• We analyze the performance and chip economy tradeoffs for differing sharing multi-core organizations

• We provide a multi-core system that significantly re-duces chip logic and interconnnect usage (by up to athird) with a negligible impact on performance (lessthan 3%).

• We realize a working system using Altera’s Cyclone IIFPGAs and the NIOS II soft-processor.

The structure of this paper is as follows. Section 2 dis-cusses the relevant previous work. In Section 3, we describeour hardware library proposal. We provide our design for asharing system architecture and the models we used to ana-lyze its performance and implementation costs. In Section4, we provide the experimental details of our implementa-tion prototype. Finally, Section 5 provides the overall con-clusions of our work.

2 Previous Work

When processors are embedded in FPGA fabrics, thereare two possible options. Either one can have a micropro-cessor permanently built into the silicon in what is calleda hard processor or it can be synthesized out of reconfig-urable logic organized in combinational logic block (CLB)s.The later option, creating a soft processor out of config-urable logic, drives the focus of this investigation. Soft pro-cessors possess numerous advantages over hard processors.In theory, one can have complete control over the designof the processor, from the pipeline through the instructionset to the coprocessors. In practice, and particularly in re-search environments, it is not worth the investment of timeto build up a processor from scratch. Furthermore, reduc-ing the time to realize one’s hardware can help bridge thegap between hardware and software designers [18]. Exam-ples of soft processors include Altera’s Nios II and XilinxMicro-Blaze [1, 2]. Open source options are also availablepossessing the advantages of zero engineering cost and cus-tomizability [3].

There have been recent efforts to optimize the implemen-tation of multiple soft cores in FPGAs. Particularly, one im-portant obstacle is concerned with making the performanceof FPGAs comparable to ASICs despite their tendency toconsume more area and run at slower clock speeds. Trim-ming unneeded hardware by Instruction Set Architecture(ISA) subsetting for a given application is one method ofreducing power [20]. Another strategy is to reuse hardwarethat is shared between soft processors. In contrast to ourwork, previous investigations have focused on small-scalesharing methodologies, such as sharing internal CPU units,e.g. multipliers, between a couple of processors [17]. An-other recent suggestion proposes that reconfigurable logiccan be mutated by runtime profilers to accelerate and mapthe binaries executed by a couple of CPUs [13]. Thisscheme has the overhead of a profiling and mapping sys-tem and requires dynamically reconfigurable FPGA fabric.

Many applications require hardware acceleration in or-der to meet performance requirements for a particular ap-plication. For instance, many signal processing applica-tions that require an FFT will not be able to run in real timeimplemented in software alone. Also, many applicationsin scientific computing will require evaluation of numeri-cally complex equations or systems. Implementing a spe-cific function in hardware will cut down the runtime drasti-cally while maintaining the general structure and executionflow of general purpose CPU.

The field of biotechnology is driving a demand for in-creased computational throughput. In particular, DNAsequence alignment and protein identification from massspectrometry data [5] are becoming computational bottle-necks. The sheer amount of sequencing and mass spectrom-etry data overwhelms general-purpose processors. Customhardware acceleration can greatly increase the rate at whichthe data can be analyzed [15]. Automation of genomicand proteomic sequencing tasks typically uses some formof the edit distance algorithm which efficiently computesthe smallest number of changes (e.g., shifts, insertions anddeletions) needed to align two strings [14].

The edit distance algorithm relies on dynamic program-ming, where the overall problem is broken into smaller sub-problems and then later combined to give the overall solu-tion [9]. A general framework for the edit-distance func-tion is given in Algorithm 1. The critical section of code isthe nested for loops that need to traverse a two-dimensionalmatrix, presenting the possible alinement scenarios. Givenan n×n matrix, this algorithm would run in O(n2) time be-cause it has two nested loops of length n. With a hardwareacceleration, it is possible to exploit the inherent concur-rency to run the algorithm in just O(n) time [8, 21]. Thisgives an impressive speedup over software alone.

Consequently, we use an accelerated version of the editdistance as the backbone of a shared hardware library em-

Page 3: Hardware Libraries: An Architecture for Economic Acceleration in …scale.engin.brown.edu/pubs/iccd07.pdf · 2014-03-23 · Hardware Libraries: An Architecture for Economic Acceleration

Algorithm 1 Edit DistanceInput: String A of length N, String B of length MOutput: The edit distance of the two stringsLet E be a MxN martixfor i = 0..M do

E[i,0] = 0end forfor j = 0..N do

E[0,j] = 0end forfor i = 0..M do

for j = 0..N doE[i,j] = min( E[i-1,j] +1, E[i,j-1] + 1,match(A[i],B[j])+ E[i-1,j-1])return E[M,N]

end forend forfunction match(x, y)if x equals y then

return 0else

return 1end if

bedded in a multi-core environment that is suitable foracceleration of computationally-critical genomic and pro-teomic tasks. Our proposed concepts apply to any generalmulti-core system that requires hardware acceleration.

3 System Design with Hardware Libraries

Our proposed system architecture is centered aroundmultiple CPU cores accessing a common shared hardwarelibrary. This library contains multiple “functions” that arerealized as hardware accelerators. In order to arbitraterequests and data to and from each core, a dispatcherunit exists to mediate. For example, Figure 1 gives theoverall organization (showing only one CPU), where adispatcher unit “sits” between the requesting CPUs andthe shared library. A CPU accesses the library by callingcustom functions or instructions, just as in software, butin this case, the function calls are trapped and sent to thedispatcher. The dispatcher checks if the library is availableto execute the requested functions. In case the libraryis busy with a previous request, the dispatcher insertsthe call into a priority queue, so that the function call isexecuted once the hardware library is available. Note thatthe hardware library should be able to access the memorythrough the system bus. This gives it the ability to receivefunction calls with pointers in the arguments.

Impact of Sharing on Silicon Real Estate. One of the pri-

SOPC Builder

Custom Instruction

Dispatch Unit

Edit DistanceUnit

FFT Unit RSAEncryptor

32 3232

CPU

Operand_AOperand_B Return done clk clk_en

32 3232

Operand_A Operand_B Return done clk clk_en

Memory

Slave Port

Data_master

Operand_AOperand_B Return done clk clk_en

Hardware Function Pool

DispatchBus

OptionalAvalon

Interface

AvalonInterface

Figure 1. System Architecture.

mary goals of our investigation is to see how the system’ssharing ratio affects the overall system implementation re-quirements, where the sharing ratio is defined as the ratiobetween the number of CPUs to the number of acceleratorsin the system. The main goal for sharing is to reduce thelogic needed to synthesize the system. Logic usage is di-rectly related to the real-estate silicon used by the design.Thus decreasing the logic requirement opens the possibil-ity for either adding extra functionality, reducing leakagepower dissipation or using a smaller device to house the de-sign. Figure 2 shows a number of possible sharing organiza-tions for a six-core system, where the sharing ratio is equalto 6, 3, 2, and 1. Note that essentially one would like thedispatcher to coordinate calls to multiple accelerators at thesame time. Such a scheme will likely improve the systemperformance, at the expense of a complicated dispatcher de-sign.

The complexity and number of interconnections neededto realize a design is also an important factor when sharingis considered. FGPAs have various forms of interconnectsthat can connect different regions of the chip. It is possiblethat a design’s logic is complex enough to create conges-tions that limit routability. Thus, one must consider theinterconnect implications of such a sharing mechanism.

Impact of Sharing on Performance. The main perfor-mance degradation faced by sharing an accelerator is corestalling, which arises when another core already has a lock

Page 4: Hardware Libraries: An Architecture for Economic Acceleration in …scale.engin.brown.edu/pubs/iccd07.pdf · 2014-03-23 · Hardware Libraries: An Architecture for Economic Acceleration

libraryshared

libraryshared

libraryshared

libraryshared

libraryshared

libraryshared

libraryshared

sharing ratio = 6 to 1

sharing ratio = 6 to 6

sharing ratio = 6 to 3

sharing ratio = 6 to 2

dispatcher dispatcher dispatcher dispatcher

dispatcherdispatcherdispatcher

core 1 core 1 core 1

core 1 core 1 core 1

core 6

core 6core 3 core 3

core 2 core 2

Figure 2. Various sharing configurations.

on the library, and another core requests it. Therefore it isinformative to analyze the trends in stalling due to differentsharing configurations and traffic generation. Two factorsdirectly impact the amount of stalling: (1) the sharing ratioand (2) the rate at which requests are being generated by thecores to the library.

A metric for raw performance of a program on a micro-processor is its execution time. Since the clock frequencyremains constant in a design, the number of cycles is di-rectly proportional to the time it takes to execute a program.Another important performance metric of the system is theutilization of the accelerator. The utilization is defined asthe total time the accelerator is busy divided by the totaltime spent with the system running. This can be expressedas follows.

U = timebusytimetotal

= #requests×timerequesttimetotal

Hence if the hardware library receives many requestsover a “small” period of time it will be heavily utilized.Conversely, if in a “long” time span, the accelerator barelygets any requests then it is not heavily utilized. As thesharing ratio increases, there is a more likely chance ofstalling, with an increase in library utilization and viceversa. Furthermore, for a given sharing ratio, an increasein the rate of requests from the cores is likely to increasethe stalling, with more library utilization. A key questionis: given a set of applications, what is the best sharing ratiothat leads to the largest savings in silicon real-estate, at anegligible penalty (say <3%) to performance as measuredby the execution time? To answer this question, we resortto queueing theory.

Predicting the Performance Using Queuing Theory.Queuing theory models the flow of traffic through serversthat service customers (for example customers waiting in acheckout line). In computing sciences, queuing theory hasbeen applied to analyzing the scenario of a CPU accessing

0 100 200 300 400 500 600 700 800 900 10000

0.001

0.002

0.003

0.004

0.005

0.006

0.007

0.008

0.009

0.01

x

f(x)

lambda=0.01

Figure 3. Exponential distribution.

multiple devices such as hard drive disks [10]. For the pro-posed sharing systems discussed however, we must modelone or more hardware libraries shared by multiple cores.Using queueing theory models, the probability that an indi-vidual core i waits for xi cycles before issuing a request forthe library is given by the exponential probability densityfunction given below, where λi is the average issue rate ofcore i, or alternatively, the average arrival rate at the accel-erator from CPU i.

f (xi,λi) ={

λie−λixi , xi ≥ 00 , x ≤ 0

Note that the expected value for xi is equal to 1/λi.Figure 3 plots an example of an exponential distributionfor 1/λi = 100. Depending on the processes runningon the different cores, they might have different averageissuing rates. The mixture of a number of exponentialdistributions with different λi leads to an overall gammadistribution. Upon arrival to the dispatcher, each request orfunction call will be dispatched once the library is available.

Performance Upper Bound. We determine a simple upperbound on the impact of sharing on performance. Given acollection of m cores with each running a process, the worstcase scenario occurs when all m cores request to access ashared hardware library at the same time. In this case, therequests must be serialized. If each core makes a total of Nrequests and the library takes t cycles to complete a request,then the worst case system runtime from the perspective ofthe last core to be serviced is

CyclesHardware = m× t ×N.

On the other hand, if one were to execute the requests insoftware instead (with no hardware library acceleration),the running time is determined by a single core. In this case,if the request takes k cycles to execute in software, the totalrunning time becomes

Page 5: Hardware Libraries: An Architecture for Economic Acceleration in …scale.engin.brown.edu/pubs/iccd07.pdf · 2014-03-23 · Hardware Libraries: An Architecture for Economic Acceleration

CyclesSo f tware = k×N.

To gain any advantage from acceleration through hardwarelibraries, CyclesHardware must be less than CyclesSo f tware,which translates to

m < kt .

Note that kt is essentially the speedup gained by hard-

ware acceleration. Thus there is essentially no gain inperformance if the number of requesting cores are largerthan or equal to the speedup attained by the acceleration.For example, if the hardware library offers a 6× speedupover the software, then the number of cores accessingthe library should be limited to at most 5 to avoid theworst case scenario. This motivates us to consider hybridsoftware/hardware library schemes.

Hybrid Software/Hardware Libraries. A possible hy-brid scheme is to have the shared library in both softwareand hardware forms. When a core requests a hardware li-brary for acceleration, the dispatcher examines the numberof cores waiting for the library. If the number of cores isless than the acceleration speedup offered by the library,the the request is queued for later acceleration; otherwise,the dispatcher returns to the calling core with a special re-turn value or through an interrupt mechanism. The returnvalue instructs the calling core to use the software versionof the library instead. In principle, the hybrid scheme wouldhave no software engineering overhead because the functionwould be written in C (or any other language) first before itgot compiled to hardware. The only penalty for such a fea-ture would be the increased dispatcher complexity and thesoftware code size.

4 Implementation and Experimental Results

Our system is realized using an Altera Cyclone II FPGAdevice, manufactured in 90nm technology with 35K logicblocks. We use the NIOS II soft-core RISC processor forour system implementations. The NIOS II processor pos-sesses a custom instruction interface that is taken advantageof to incorporate the dispatcher with the NIOS II cores [7].For the prototype, all data to and from the main memorygoes through the CPU bus. This data will then be fed to thecustom instruction and exported to the accelerator as wasshown in Figure 1. Each accelerator unit reports to a singledispatcher which can service numerous cores as discussedpreviously. We have used Altera’s SOPC Builder and theQuartus II for implementing the various software and hard-ware components of the system. As indicated earlier, wehave used the edit-distance function as the backbone of our

hardware library that is embedded in a system that is de-signed for accelerating genomic and proteomic computa-tions.

Hardware libraries can be created either through Hard-ware Definition Languages (HDL), or through software-based languages such as SystemC. The later option makesshared hardware libraries a natural extension to software li-braries. In recent years, there has been an increase in com-mercial tools that directly translate a synthesizable subset ofC code to hardware. These include, for example, Altera’s C-to-HDL tool and Celoxica’s Agility compiler [4, 6]. Thesetools are currently “captive” to the single core computingparadigm. Our proposed idea of shared hardware libraryallows current and future tools to migrate to multi-core de-signs, with graceful demand on the limited FPGA siliconreal estate.

We implement our six core system for different sharingconfigurations as was given in Figure 2. The hardwarelibrary takes 16 cycles to finish its computations. We notethat our prototype currently allows functional calls usingdirect operands. The system can be extended to acceptpointer arguments, which would be used to access thememory system using Altera’s Avalon bus architecture.Our future implementations can include such access; theonly downside would be contention on the data bus thatwith other cores accessing it as well. We now study thevarious implications of our system on both the system realestate and performance.

Logic Usage and Chip Area. Figure 4 shows the trends inlogic usage for the various sharing configurations. Clearly,the higher the sharing ratio, the less logic is needed to obtainthe same functionality. Remarkably, the six to one config-uration, which has the highest sharing ratio, utilizes nearlyone third less of the total logic compared to no sharing atall (the six to six configuration). In fact, the six to six rationearly uses all of the Cyclone II device’s available logic.

Figure 5 shows the floor plan of the FPGA for the caseof the six-to-three configuration. Each small rectanglerepresents a CLB region. The darker blue coloring indicatesdenser logic usage. The magenta represents the area usedby the Dispatcher unit(s). Clearly, the Dispatcher andaccelerator regions take up a significant amount of space.Increasing the sharing ratio would eliminate up to one ofthe regions colored in magenta.

Interconnect Complexity. The percentage of intercon-nections used can be seen to increase inversely with thesharing ratio as shown in Figure 6. This trend is logicalbecause the lower ratios require more logic and thisincrease implies increased connections between this logic.Although, a dispatcher servicing 6 accelerators requiresmore complicated routing, it remains less than six copies of

Page 6: Hardware Libraries: An Architecture for Economic Acceleration in …scale.engin.brown.edu/pubs/iccd07.pdf · 2014-03-23 · Hardware Libraries: An Architecture for Economic Acceleration

6 to 1 6 to 2 6 to 3 6 to 60

10

20

30

40

50

60

70

80

90

100

Configuration

Logi

c U

sage

(P

erce

nt)

Logic ElementsCombinational FunctionsDedicated Registers

Figure 4. Silicon logic usage for the differentconfigurations ratios in a six core system.

Figure 5. Floorplan for the six-to-three con-figuration.

a simpler design.

Impact on Performance. We run our simulations whilevarying the parameters of sharing ratio and average requestrate λ. In the first experiment, we assess the impact of vary-ing these two parameters on stalling as given in Figure 7.As would be expected, as the number of cores increases, sodoes the average stalling. Likewise, an increase in the aver-age rate of request generation also increases stalling. How-ever, the number of stalls does not linearly increase witheither of these quantities. One can observe that the increas-ing number of cores drastically increases the stalling timefor any value of λ (especially for higher values of λ). Also,increasing lambda for a given number of cores has a signif-icant negative effect. What these trends infer is continuingto increase the core count or the rate of requests can lead to

6 to 1 6 to 2 6 to 3 6 to 60

10

20

30

40

50

60

70

Configuration

Inte

rcon

nect

Usa

ge (

Per

cent

)

BlockC4LocalR4

Figure 6. Interconnect usage for the differentsharing configurations.

a major slowdown.In the second experiment, we assess the impact of shar-

ing ratio and the average issue rate on the execution timein cycles as shown Figure 8. The execution time increasesfor different issuing rates. However, the magnitude of in-crease depends on the issuing rate, where small rates toler-ate a greater increase in the number of cores.

Figure 9 shows the impact of the request arrival rate (λ)and the number of cores affecting the utilization of the ac-celerator. Clearly more cores yield a higher utilization per-centage and likewise a higher rate of requests increases uti-lization. At a certain point, too many cores in the system ortoo high a request rate will completely saturate the unit andit will always be used throughout the execution. This sce-

0

5

10

15

20 0

0.02

0.04

0.06

0.08

0.1

0

50

100

150

200

250

300

Lambda

#Cores

Sta

lls

Figure 7. Average stall time as a function ofthe sharing ratio and the average issue rate.

Page 7: Hardware Libraries: An Architecture for Economic Acceleration in …scale.engin.brown.edu/pubs/iccd07.pdf · 2014-03-23 · Hardware Libraries: An Architecture for Economic Acceleration

Configuration Logic Interconnect Performance PenaltySavings Savings λ = 0.006 λ = 0.013 λ = 0.021

Six to one 33.72% 37.27% 1.98% 16.76% 48.95%Six to two 26.09% 28.77% 0.03% 2.93% 8.23%Six to three 19.53% 21.35% 0.00% 0.99% 2.85%Six to six 0.00% 0.00% 0.00% 0.00% 0.00%

Table 1. Tradeoff between silicon real estate savings for different multi-core environments. IssueRate (IR) is equal to λ.

0 2 4 6 8 10 12 14 16 18 200

0.5

1

1.5

2

2.5

3

3.5x 10

4

# of Cores

Exe

cutio

n T

ime

(Cyc

les)

Lambda = 0.01Lambda = 0.05Lambda = 0.1

Figure 8. Application execution time.

nario is unwanted and will probably require a lower sharingratio. Ideally, one would want a utilization of just under100%. This way, the accelerator is being used as much as itcan, but is not so busy it will cause mass stalling.

Table 4 summarizes the implications of sharing on logic,interconnects, and performance. We report the average is-sue rate (λ), above which the performance penalty, as mea-sured by the execution time, is greater than 3%. The resultsgive a pareto frontier that allows effective tradeoff betweensavings in silicon real estate and the multi-core system per-

05

1015

20

0

0.05

0.10

20

40

60

80

100

120

# CoresLambda

Per

cent

Util

izat

ion

Figure 9. Accelerator utilization.

formance. Depending on the processes average issuing rate,it is possible to save up to 37% in silicon real estate, at onlya negligible (3%) penalty to performance.

5 Conclusion

In the sub-100nm era, designers are no longer able tocontinue the trend of monotonically increasing clock speedsto improve performance; nevertheless, transistor featuresize continues to shrink. In order to utilize the availablearea, designers incorporate multiple cores on the die. Atthe same time, FPGAs have risen to prominence for theirability to rapidly enable product development without thesoaring cost of custom fabrication. However, designs im-plemented in FPGAs tend to consume more area and run atslower speeds than designs implemented in ASICs. Perhapsfor applications where performance is absolutely the domi-nant factor, ASICs remain a more viable option. However,when considerations such as power, engineering cost, anddevice size become important, the proposed sharing methodenhances the abilities of FPGAs.

We have proposed the concept of a shared hardware li-brary. This library consists of hardware accelerators that runcommonly used algorithms faster than is possible in soft-ware alone. Like software libraries these units are sharedbetween multiple instances running programs. However,because hardware cannot be replicated without a cost, onetries to minimize the number of copies required for a givenapplication.

The main goal of a hardware library is to distribute andshare multiple hardware accelerators among many cores.By developing a set of guidelines and a general design flowfor such a library, engineers can more effectively utilize theavailable resources in a design to execute their acceleratedfunction. This hardware library would be highly desirablein many systems. In computationally intensive applications,software is often not enough and hardware accelerators arenecessary. Moreover, embedded systems in particular havelimited chip area yet still require speed in specialized func-tions. A digital signal processor unit for example, couldhave separate threads running on multiple cores and thenaccess a FFT unit when necessary. A FFT unit is costly interms of logic and if multiple processes can share a unit,

Page 8: Hardware Libraries: An Architecture for Economic Acceleration in …scale.engin.brown.edu/pubs/iccd07.pdf · 2014-03-23 · Hardware Libraries: An Architecture for Economic Acceleration

there will be a major savings in terms of area with little im-pact on performance when the system is carefully designed.When area is a precious resource, as in embedded systems,such a setup becomes an attractive option.

The need to minimize the number of copies of the li-brary naturally lends itself to sharing between cores. Bycreating a dispatching unit, multiple cores can use the sameaccelerator in turns. Because only one core may use the ac-celerator at a time, a performance penalty will be incurredif a core needs the accelerator but another core already hasthe lock. We have shown that one can increase the num-ber of cores sharing an accelerator library only to a certainpoint after which the performance penalty will become sig-nificant. However, with reasonable numbers of core, onegains significant logic and area savings. This will reducedynamic power consumption and leave more room for ad-ditional functionality.

Future revision of this shared hardware library will mostlikely focus on enhancing the performance of the system.While there are clear advantages in terms of chip economy,performance does degrade slightly for nominal operation.The source of slowdown comes from the cores needing tostall when another core has a lock on the accelerator theywish to use. In this implementation these cores will simplystall until the lock is released.

Acknowledgments. The authors would like to thank Al-tera Corporation for their generous donation of the FPGAdevices and design software. In particular, the authors aregrateful to Mike Hutton, Mike Phipps and Steven Brownfrom Altera’s university support program.

References

[1] [Online]. Available: http://www.altera.com.

[2] [Online]. Available: http://www.xilinx.com.

[3] [Online]. Available: http://www.opencores.org.

[4] [Online]. Available: http://www.celoxica.com.

[5] A. T. Alex, M. Dumontier, J. S. Rose, and C. W. V. Hogue,“Hardware-Software Protein Identification for Mass Spec-trometry,” Rapid Communications in Mass Spectrometry,vol. 19, pp. 833–837, 2005.

[6] Altera, “Automated Generation of Hardware AcceleratorsWith Direct Memory Access From ANSI/ISO Standard CFunctions,” Tech. Rep., May 2006.

[7] NIOS II Processor Reference Handbook, Altera, San Jose,2006.

[8] T. V. Court and M. C. Herbordt, “Families of FPGA-BasedAlgorithms for Approximate String Matching,” in ASAP ’04:Proceedings of the Application-Specific Systems, Architec-tures and Processors, 15th IEEE International Conferenceon (ASAP’04). Washington, DC, USA: IEEE ComputerSociety, 2004, pp. 354–364.

[9] S. Dasgupta, C. Papadimitriou, , and U. Vazirani, Algo-rithms. University of California, May 2006.

[10] P. J. Denning and J. P. Buzen, “The Operational Analysis ofQueueing Network Models,” ACM Comput. Surv., vol. 10,no. 3, pp. 225–261, 1978.

[11] S. Gaudin, “Intel builds 80-core chip,” EE Times, 2007.

[12] Y. Jin, N. Satish, K. Ravindran, and K. Keutzer, “An Auto-mated Exploration Framework for FPGA-based Soft Multi-processor Systems,” in CODES+ISSS ’05: Proceedings ofthe 3rd IEEE/ACM/IFIP international conference on Hard-ware/software codesign and system synthesis, 2005, pp. 273–278.

[13] R. Lysecky and F. Vahid, “A Study of the Speedups andCompetitiveness of FPGA Soft Processor Cores using Dy-namic Hardware/Software Partitioning,” in DATE ’05: Pro-ceedings of the conference on Design, Automation and Testin Europe. Washington, DC, USA: IEEE Computer Society,2005, pp. 18–23.

[14] A. Marzal and E. Vidal, “Computation of Normalized EditDistance and Applications,” IEEE Trans. Pattern Anal.Mach. Intell., vol. 15, no. 9, pp. 926–932, 1993.

[15] L. Roberts, “New Chip May Speed Genome Analysis,” Sci-ence, vol. 244, pp. 655–656, May 1989.

[16] D. Sheldon, R. Kumar, R. Lysecky, F. Vahid, and D. Tullsen,“Application-specific customization of parameterized FPGAsoft-core processors,” in ICCAD ’06: Proceedings of the2006 IEEE/ACM international conference on Computer-aided design. New York, NY, USA: ACM Press, 2006,pp. 261–268.

[17] D. Sheldon, R. Kumar, F. Vahid, D. Tullsen, and R. Ly-secky, “Conjoining soft-core FPGA processors,” in ICCAD’06: Proceedings of the 2006 IEEE/ACM international con-ference on Computer-aided design. New York, NY, USA:ACM Press, 2006, pp. 694–701.

[18] J. Wawrzynek, M. Oskin, C. Kozyrakis, D. Chiou, D. A. Pat-terson, S.-L. Lu, J. C. Hoe, and K. Asanovic, “RAMP: A Re-search Accelerator for Multiple Processors,” Electrical Engi-neering and Computer Sciences University of California atBerkeley, Tech. Rep., February 2006.

[19] P. Yiannacouras, J. Rose, and J. G. Steffan, “The Microarchi-tecture of FPGA-based Soft Processors,” in CASES ’05: Pro-ceedings of the 2005 international conference on Compilers,architectures and synthesis for embedded systems. NewYork, NY, USA: ACM Press, 2005, pp. 202–212.

[20] P. Yiannacouras, J. G. Steffan, and J. Rose, “Application-specific customization of soft processor microarchitecture,”in FPGA ’06: Proceedings of the 2006 ACM/SIGDA 14th in-ternational symposium on Field programmable gate arrays.New York, NY, USA: ACM Press, 2006, pp. 201–210.

[21] C. Yu, K. Kwong, K. Lee, and P. Leong, “A Smith-WatermanSystolic Cell,” in Proceedings of the Tenth InternationalWorkshop on Field Programmable Logic and Applications(FPL’03), 2003, pp. 375–384.