Dynamic Memory Management in Vivado-HLS for Scalable Many

Dynamic Memory Management in Vivado-HLSfor Scalable Many-Accelerator Architectures

D. Diamantopoulos, S. Xydis, K. Siozios and D. Soudris

School of Computer Engineering, National Technical University of Athens, Greece

Abstract. This paper discusses the incorporation of dynamic memorymanagement during High-Level-Synthesis (HLS) for effective resourceutilization in many-accelerator architectures targeting to FPGA devices.We show that in today’s FPGA devices, the main limiting factor of scal-ing the number of accelerators is the starvation of the available on-chipmemory. For many-accelerator architectures, this leads in severe inef-ficiencies, i.e. memory-induced resource under-utilization of the rest ofthe FPGA’s resources. Recognizing that static memory allocation – thede-facto mechanism supported by modern design techniques and synthe-sis tools – forms the main source of “resource under-utilization” prob-lems, we introduce the DMM-HLS framework that extends conventionalHLS with dynamic memory allocation/deallocation mechanisms to be in-corporated during many-accelerator synthesis. The proposed DMM-HLSframework enables each accelerator to dynamically adapt its allocatedmemory according to the runtime memory requirements, thus maximiz-ing the the overall accelerator count through effective sharing of FPGA’smemories resources. We integrated the proposed framework with the in-dustrial strength Vivado-HLS tool, and we evaluate its effectiveness witha set of key accelerators from emerging application domains. DMM-HLSdelivers significant increase in FPGA’s accelerators density (3.8× moreaccelerators) in exchange for affordable overheads in terms of delay andresource count.

1 Introduction

Breaking the exascale barrier [1] has been recently identified as the next bigchallenge in computing systems. Several studies [2], showed that reaching thisgoal requires a design paradigm shift towards more aggressive hardware/soft-ware co-design architecture solutions. Recently, many-accelerator heterogeneousarchitectures have been proposed to overcome the utilization/power wall [3–5].

From an electronic design automation (EDA) perspective, high-level synthe-sis (HLS) tools are expected to play a central role [6] in enabling effective designof many-accelerator computing platforms. Raising the design abstraction layer,designers can now quickly evaluate the performance, power, area and cost re-quirements of a differing accelerator configurations, thus providing controllablesystem specifications with reduced effort over traditional HDL-based develop-ment flow.

From a hardware perspective, heterogeneous FPGAs are proven to form aninteresting platform solution for many-accelerator architectures. Their inherentflexibility/programmability and their continuous scaling of hardware density en-ables the accommodation of several types of hardware accelerators in comparison

2

Fig. 1: Memory bottleneck of Kmeans clustering algorithm, with Ai-Accelerators, Np-Points, Pk-Clusters=3, Vj-Point Values=Rand(0,1000) on:(a)Workload scalability (Ai=1, Np = [2 × 104 : 2× : 256 × 104]),(b)Accelerators scalability (Ai = [1 : 2× : 128], Np = 2 × 104)

with conventional ASICs. In [7], it has been shown that FPGAs can potentiallyprovide orders-of-magnitude speedups over conventional processors for compute-intensive algorithms, with greatly reduced power consumption. In a similar tomulti-core trend, many parallel datapaths can be built and programmed into theFPGA to increase performance.

However, in such diverse and large pool of accelerators the memory orga-nization forms a significant performance bottleneck, thus a carefully designedmemory subsystem is required in order to keep accelerator datapaths busy [8],[9]. In [8], the authors propose a many-accelerator memory organization thatstatically shares the address space between active accelerators. Similarly in [9], amany-accelerator architectural template is proposed that enables the reuse of ac-celerator memory resources adopting a non-uniform cache architecture (NUCA)scheme. Both [8] and [9] are targeting ASIC-like many-accelerator systems andthey mainly focusing on the performance implications of the memory subsystem.

In this paper, we extend the scope of the aforementioned research activitiesby showing that the memory subsystem is not only a performance bottleneckbut it rather forms the main limiting factor regarding the scalability of suchmany-accelerator architectures for FPGA devices. Specifically, we show that inmodern high-end FPGA devices, the main limiting factor of scaling the numberof accelerators is the starvation of the available on-chip memory. This leads insevere resource under-utilization of the FPGA. In fact modern FPGA designtools (both at the RTL or HLS-level) allow only static memory allocation, whichdictates the reservation of the maximum memory that an application needs, forthe entire execution window. While static allocation works fine for a limitednumber of accelerators, it does not scale to a many-accelerator design paradigm.Figure 1 shows a study on the resource demands when scaling (i) the datasetworkload and (ii) the number of parallel accelerators implementing the Kmeans

3

clustering algorithm on the Virtex Ultrascale XVCU190 FPGA device, i.e. theFPGA with the highest storage of on-chip block RAM (BRAM), total size:132.9Mb. It is clearly shown that in both cases the BRAM memory is the resourcethat saturates faster than to the rest FPGA resources types (FFs, LUTs, DSPs),thus forming the main limiting factor of higher accelerator densities as well asgenerating large fractions of underutilized resources.

In this paper, we propose to alleviate the “resource under-utilization” prob-lem by eliminating the pessimistic memory allocation forced by static approaches.Specifically, we propose the adoption of dynamic memory management (DMM)during HLS for the design of many-accelerator FPGA-based systems, to enableeach accelerator to dynamically adapt its allocated memory according to theruntime memory requirements.

To the best of our knowledge, only [10] and [11] studied the dynamic memoryallocation for high-level synthesis. However they do not target many-accleratorsystems, supporting only fixed size intra-accelerator allocations. We introducethe DMM-HLS framework that (i) extends typical HLS with DMM mechanismsand (ii) provides API function calls which enable source-to-source modificationsthat transform the statically allocated code to dynamic one, similar to mal-loc/free API of glibc. We extensively evaluated the effectiveness of the proposedDMM-HLS framework over several many-accelerator architectures for represen-tative applications of emerging computing domains. We show that DMM-HLSdelivers in average 3.8× increase on the accelerator count in comparison to typ-ical HLS with static memory allocation, leading in all cases to more uniformresources allocation and better scalability.

2 On the memory-induced resource under-utilization inmany-accelerator FPGAs

In this section, we further analyze the memory-induced resource under-utilizationproblem found in many-accelerator FPGAs. We show that the on-chip BRAMmemory size forms the resource that exhibits the most severe limitation regard-ing to accelerator count scaling. We further strengthen the previous argumentby showing that memory-induced resource under-utilization is consistent acrossseveral commercial FPGA devices with different platform characteristics. Todrive analysis, we consider a many-accelerator FPGA architecture targeting theKmeans clustering kernel. Each accelerator implements an instance of Kmeans

kernel with statically allocated memory and it is scheduled independently fromthe rest of accelerators. The minimization of resource under-utilization in suchsystems, i.e. the maximization of accelerator count, directly affects their through-put characteristics [12].

We study the FPGA resource utilization when scaling the number of theallocated Kmeans hardware kernels. We consider four type of resources, i.e.BRAMs, look-up tables (LUTs), flip-flops (FFs) and digital signal processingblocks (DSPs). We are interest to identify which resource types form the mostsevere scaling bottleneck. Thus, for each of these resources, we scale the numberof accelerators up to the point that the specific resource becomes starved, underthe ideal assumption that there is an infinite number of rest of the resources.In order to characterize accelerator scaling under differing platform configura-tions, we perform the analysis on three Xilinx FPGA devices [13], i.e. (i) Vir-tex Ultrascale XVCU190 (highest on-chip memory on the market), (ii) Virtex-7

4

Kmeans accelerators limit per-resource

1 10 100 1000Ultra

scale

Vir

tex-7

Zynq

BRAMs

DSPs

FFs

LUTs98,5%- 99,4% Resources under-utilizationZC

706 - X

C7Z045

VC

707 - X

C7V

X485T

XV

CU

190

90,1%- 96,9% Resources under-utilization

82,6%- 96,8% Resources under-utilization

Fig. 2: Resource under-utilization impact on Kmeans algorithm: as we move from lowto high-end Xilinx FPGA devices the number of usable resources (e.g. the number ofaccelerators) increases, however the portion of unusable resources remains high.

VC707, XC7VX485T (high-end development board) and (iii) Zynq-7000, ZC706XC7Z045 (embedded development board).

Figure 2 depicts the maximum number of accelerators that can be allocated,up to the point of resource starvation. As shown, the on-chip memory (BRAMtype) is the resource type that starves faster, exhibiting this consistent behavioracross all the examined FPGA devices. As expected the size of the on-chip mem-ory has a significant impact on the maximum number of allocated accelerators,ranging form 41 for the Ultrascale down to 1 for the Zynq device. In more detail,the memory-induced resource under-utilization ranges between 98.5% - 99.4%,90.1% - 96.9% and 82.6% - 96.8% for the Zynq, Virtex-7 and Ultrascale FPGAdevices, respectively. Previous analysis indicates that the memory-induced re-source under-utilization is an existing and critical limiting factor regarding toeffective scaling of many-accelerator FPGA designs. We identify static mem-ory allocation, i.e. managing accelerator’s memory based on worst-case mem-ory footprint estimates, as the main source of this “resources under-utilization”problem, which strongly motivated our DMM-HLS proposal to enable dynamicadaptation of allocated memory according to the accelerator’s runtime memoryrequirements.

3 The DMM-HLS Framework for Many-AcceleratorArchitectures

This section presents the DMM-HLS framework enabling dynamic memory man-agement during HLS for alleviating the memory resource starvation problems inFPGA based many-accelerator architectures. We first present the DMM-HLSflow integrated in Vivado-HLS tool along with the architectural implications onthe organization of many-accelerator system and we then describe the imple-mented DMM mechanisms.

5

3.1 Dynamic Memory Management in Vivado-HLS forMany-Accelerator System Synthesis

Figure 3 shows an overview of the proposed DMM-HLS design and verifica-tion flow. The flow is based on Xilinx Vivado-HLS, a state-of-art and industrialstrength HLS tool. The DMM-HLS extension works explicitly to the high-levelsource code of the application, thus it keeps minimum implementation overheadto the designers. A source-to-source code modification stage is the step where theoriginal code is transformed from statically allocated to dynamically allocatedusing specific function calls from the proposed DMM-HLS API. For the targetedmany-accelerator systems, the static-to dynamic-allocation code transformationis performed on data structures found within the global scope of the application.The transformed code is augmented by the DMM-HLS function calls and it issynthesized into RTL implementation through the back-end of Vivado HLS tool.

Standard Vivado HLS flow

Vivado HLS(High-level Synthesis)

RTL(VHDL/Verilog/SystemC)

Vivado Co-Simulation(cycle-accurate execution)

Targ

et F

PG

A

tech

no

logy

lib

rari

es

High level codeC/C++/SystemC

(Static allocation)

Testbench Wrapper

Source-to-Source Code Modification

for DMM-API

Transformed codeC/C++/SystemC

(Dynamic allocation)

DMM-Extension

Implementation(ISE/EDK/Vivado)

DMM Sourcecode

Bitstream

Implementation Strategy

Syn

the

sis

Dir

ect

ive

s(D

esig

n O

pti

miz

atio

n)

Fig. 3: Proposed extension on Vivado HLS flow to support dynamic memory manage-ment for many-accelerators FPGA-based systems.

DMM-HLS framework extends the architectural template originally sup-ported in Vivado-HLS by (i) supporting emerging many-accelerator systems and(ii) allowing the on-chip BRAM to by dynamically allocated among acceleratorsat runtime. The proposed architecture, Fig. 4, is divided in two subsystems: theprocessor subsystem and the accelerators subsystem. The processor subsystemtypically consists of soft or hard IPs, such as the processor (e.g. ARM A9 inZynq), executing three main code segments: the application control flow; thecomputationally non-intensive code kernels; and the code which shall be exe-cuted on CPU due to the need of high accuracy or high FPGA implementationcost (e.g. floating point division). The accelerators subsystem holds the com-putationally intensive kernels of the application. The accelerators are modeledin high-level language and synthesized using Vivado-HLS. The on-chip BRAMsare considered as a shared-memory communication link between processor andaccelerators datapaths. Although we propose the dynamic allocation of on-chipmemory, DMM-HLS framework supports the description of accelerators withboth statically and dynamically allocated data stored in BRAMs.

6

Many-accelerators System Interconnection Network

FreeBitMap 1

Heap 1

FreeBitMap M

Heap M

`

BRAMs (On-chip Memory)

DM

M A

lloca

tor

1

DM

M A

lloca

tor

M

Statically Allocated BRAM

Dynamically Allocated BRAM

Accelerator 2

DMM Port

Accelerator 1

DMM Port

Accelerators Subsystem

Host CPU

Off-ChipMemory

External Storage(SATA)

Floating Point Unit

Processor Subsystem

Accelerator N

DMM Port

Accelerator N-2

DMM Port

0x000x01

0x020x030x04

0 1 1 1 1 1 1

0x05

<Alloc. Size>FrBMi

DMM Heap #i

1. int *A = HlsMalloc ( 1 * sizeof ( int ) , i ) ;2. A [ 0 ] = DATA ;

0

Heap Address mapping to FrBMi

= 2

5+2

4+2

3+2

2+

21+2

0

DHi

LHi

DFi

LFi

FreeBitMap #i

FreeBitMap index (FrBMi) : 5 4 3 2 1 0

Reg0=630 0 0 0 0 0 00

0 0 0 0 0 0 00

Reg1=0

RegN=0

DATA[31:23]DATA[22:16]DATA[15:19]

DATA[18:10]

(a) (b)

Address/Data/Bus(es) 1,2,…,M

Fig. 4: (a) Proposed architectural template for memory efficient many-acceleratorFPGA-based systems. (b) Free-list organization, using a bit-map array

A key performance factor of the system is the ability to feed acceleratorswith data so that no processing stalling occurs due to memory read/write la-tency. Moving from static to dynamic allocation using an aggressive approachcould lead to the allocation of all BRAMs under the same memory module sothat all allocation/deallocation requests occurs on this unique module. However,bigger memory banks means wider wire buses that impact design closure (clockspeed). In addition, throughput is limited by the BRAM ports (single or dualport) configuration. In the case of hundreds of accelerators parallel processingcapability becomes highly memory-constrained.

Similar to multi-threaded dynamic memory management [14], [15], DMM-HLS supports fully-parallel memory access paths, by grouping BRAM modulesinto unique memory banks, named heaps (Fig. 4). Every heap instantiation hasits own DM allocator. The composition of heaps is described in a C headerconfiguration file, enabling new heap instantiations to be easily described. Everyheap-i is highly parameterizable on a number of design options, most importantof which are (i) the heap depth DH

i (the total number of unique addresses), (ii)the heap word length LH

i (the number of bytes of every single word of heap),(iii) the allocation alignment Ai (the minimum number of bytes per allocationso that every new allocation starts from a unique address) and (iv) the meta-data header size Hi (the number of bytes reserved on first address(es) of everyallocation to store meta-data related to the allocation, e.g. allocation length). Ina similar manner we define the free depth DF

i and free word length LFi for the

FreeBitMap memory.The proposed DM allocator for HLS supports arbitrary size allocation/deallo-

cation, i.e. enables allocation on the same heap of any simple data type (integers,floats, doubles etc.), as well as more complex data types (structs of same or com-bined simple data types and 1-D arrays). A FreeBitMap structure, i.e. bit-mapfree-list holding information about the free space inside this heap, tracks theoccupied memory space. FreeBitMap, is an array of registers, every bit of whichmaps to a singe byte of Heap. The term “maps” refers to the allocation status

7

(allocated or free) of this byte. Given that Fi × LF

i = (DHi × LH

i )/8, in everycase the FreeBitMap’s size is 8× smaller of the heap size. Figure 4(b) depictsthe organization of FreeBitMap, using an example of the allocation of 1 integer(4 bytes on x86).

3.2 DMM-HLS Allocation Mechanisms

The DM allocator can be employed through the DMM API, which is composedof two main function calls for memory allocation and deallocation. We adopt asimilar to glibc malloc/free function call API:

– void* HlsMalloc(size_t size, uint heap_id)– void HlsFree(void *ptr, uint heap_id)

, where size is the requested allocation size in bytes, heap id is the iden-tification number of the heap on which allocation shall occur and *ptr is thepointer which shall be freed up. HlsMalloc and HlsFree is given in code listings9 and 5, respectively.

ALGORITHM 1:void* HlsMalloc(size t size, uint heap id)

Input: size: Requested allocation size in bytesInput: heap id: The identification number of

heapOutput: Generic pointer pointing to allocated

memory addr.Data: *HeapArray, *FreeBitMapArray

1 Heap struct* cur heap← HeapArray[heap id];2 FreeBitMap struct*

cur fbm← FreeBitMapArray[heap id];3 int FrBMi ← FirstFit(size + 2, cur fbm);4 int* CandidateAddr ← MapAddr(FrBMi,

cur heap);5 [int* AlignedAddr, int offset]←

Padding(CandidateAddr);6 MarkBitMap(FrBMi, size + offset + 2,

cur fbm);7 cur heap[AlignedAddr] ← size; // 1st addr:

alloc. size8 cur heap[AlignedAddr + 1] ← FrBMi; // 2nd

addr: FrBMi9 return (void*)AlignedAddr;

ALGORITHM 2:void HlsFree(void* ptr, uint heap id)

Input: ptr: The pointer to freeInput: heap id: The identification number

of heapOutput: Generic pointer pointing to

allocated memory addr.Data: *HeapArray, *FreeBitMapArray

1 heap struct*cur heap← HeapArray[heap id];

2 FreeBitMap struct*cur fbm← FreeBitMapArray[heap id];

3 int size ← cur heap[ptr];4 int FrBMi ← cur heap[ptr + 1];5 UnmarkBitMap(FrBMi, size+offset+2,

cur fbm);

The allocator currently implements a first fit algorithm to find a free continu-ous memory space according to requested bytes. Since we reserve two extra binsto hold meta-data related to allocation, the first fit function returns the index ofFreeBitMap, so that there are continuous (size+ 2) free positions (equal to 0)at FreeBitMap[index]-FreeBitMap[index+ size+ 2] (line 3). Next the alloca-tor calculates the address of heap based on this index (line 4). However the heapword length LH

i defines the smallest unit of memory access, i.e. each memoryaddress specifies different number of bytes. Thus in order to allow any arbitraryallocation size we implemented a LH

i -byte aligned memory address access. Forthat scope the mapped address CandidateAddr is padded with extra bytes inorder to be aligned (line 5). Next the corresponding bits in FreeBitMap forthe aligned address space reserved are marked (set to 1), (line 6) and finally themeta-data are written to the first two bins (lines 7,8), while the aligned addressis returned.

The HlsFree function has minimum execution overhead since the meta-datafor every allocation store all necessary information for deallocation (lines 3,4).

8

The function exits with the deallocation of corresponding bits in FreeBitMapfor the requested pointer (line 5).

Finally, the shared hardware interface of HlsMalloc and HlsFree limits ex-ecution parallelism when multiple accelerators request to allocate/free memorysimultaneously, even if the accesses affect data across different heaps. To over-come this issue, we apply function inlining of the HlsMalloc and HlsFree1

which increases resource occupation, but also allows the unconstrained parallelaccess on heaps from many-accelerators.

3.3 DMM-HLS Allocation Runtime Issues

There are three major runtime issues related to the DMM in many accelera-tor systems, i.e. memory fragmentation, memory coherency and memory accessconflicts.

Fragmentation is defined as the amount of memory used by the allocatorrelative to the amount of memory requested by the program. In many accel-erator systems, we recognize two fragmentation types, alignment and requestfragmentation. Alignment fragmentation accounts for the extra bytes reservedfor keeping every allocation padded to the heap word length LH

i , including al-location size and header metadata information. As long as the size of DMMrequests (malloc/free) is multiple of LH

i , the alignment fragmentation is zero.Request fragmentation refers to the situation that a memory request skips freedmemory blocks to find a continuous memory space equivalent to the size of therequest. In the worst case scenario the request cannot be served even if there areavailable memory blocks in the heap if they are merged. Request fragmentationis strongly dependent on the memory allocation patterns of each accelerator. Incase of an homogeneous many-accelerator system, i.e. each accelerator allocatesthe same memory size, request fragmentation is zero. Due to the first-fit policy,the worst request fragmentation appears on many-accelerator systems with het-erogeneous memory size allocation requests, and only in the case of subsequentmemory requests, where the second allocation requests a larger memory blockthan the last freed one.

In the proposed architectural template of many-accelerator systems, memorycoherency problems are eliminated inherently, since every accelerator has itsown memory space, thus no other accelerator may access it. However memoryaccess conflicts may become a performance bottleneck in case that a large set ofaccelerators share the same heap. We alleviate this issue by enabling multi-heapsconfigurations that relax the pressure on the dynamic memory managementinfrastructure.

4 Experimental Results

We evaluated the efficiency of the proposed DMM-HLS framework consideringmany-accelerator architectures targeting to emerging application domains, e.g.artificial intelligence, scientific computing, enterprise computing etc. Recently,Microsoft showed that such many-accelerator systems on reconfigurable fabricsform promising solution to accelerate portions of large-scale software [12]. Amedium-scale deployment on a bed of 1,632 servers, measuring its efficacy in

1Using the “#pragma AP inline” preprocessor pragma of Vivado-HLS

9

accelerating the Bing web search engine, reported improvements on the rankingthroughput of each server by a factor of 95%. In this paper, we considered sixapplications evaluation test-bed of Phoenix MapReduce framework for shared-memory systems [16]. The characterization setup of the employed applicationsis summarized in Table 1.

We evaluate the practical performance improvements delivered by the pro-posed DMM-HLS framework in a twofold manner. First, we evaluate the maxi-mum number of accelerators that can be programmed onto the FPGA2 in orderto find the realistic saturation point for many-accelerator systems due to the star-vation of the resources. Figure 5 shows that the proposed DMM-HLS frameworkis able to deliver many-accelerator architectures with 3.8× more accelerators inaverage, compared to static allocation. On some kernels the gains may be up to9.7×, i.e. Histogram application. The high gains of DMM-HLS on accelerator’sdensity come from the usage of one single heap (DMM-1). DMM-1 configurationhas the least possible overhead regarding the resources consumed by the DMmanager. As long as configurations with more heaps are adopted (e.g. DMM-4etc), the extra resources needed to implement the corresponding allocators de-crease the maximum number of accelerators, i.e. the instantiation of 16 heapsdelivers an average gain of 1.7×.

In order to evaluate the per-accelerator latency overhead due to DMM mech-anisms, Figure 6 depicts the normalized latency of the employed kernels for thecase of 16 accelerators using 1,2,4,8 and 16 heaps. As shown, the average over-head for all accelerators is 19.9× when only one heap is employed, while it dropsto 10×, 4.7×, 2.3× and 1.2× as long as the heaps are doubled. The increaseof heaps allows higher memory level parallelism to be achieved, since less accel-erators are sharing the same heap. Yet, even in the case that every acceleratorhas its own unique heap, i.e. 16-accelerators - 16-heaps configuration, the latencyoverhead is still persist (1.2×), due DMM internal operation (freelist check, first-fit operation, etc.). However the DMM-HLS framework enables effective systemconfigurations, which trade-off accelerator density and heap sharing.

To express the aforementioned effectiveness, of multiple-heap DMM setups,we measure the throughput achieved when allocating the maximum numberof accelerators considering differing multiple-heap DMM configurations. Thethroughput metric is obtained as the workload size (in Mbytes) over the la-tency of a kernel to process it (in us). Figure 7 depicts the aggregated results.Each diagram corresponds to the many-accelerator architectures for each spe-cific application. The x-axis refers to the system’s setup, ranging from staticallocation and DMM allocation with number of heaps equal to 1, 2, 3, 4, 16and 32, respectively. The left y-axis refers to the normalized resources3. All thestacked vertical bars refer to this axis. The right y-axis refers to the normalizedthroughput, with reference to throughput obtained from static allocation, whichis highlighted as a dashed horizontal line of value 1. Accelerators’ density (i.e.maximum number of deployed accelerators) for every system setup is depictedon top of every stacked bar.

2For all measurements reported in section 4 we selected the Virtex UltrascaleXVCU190 as target device.

3Every stacked vertical bar contains the cumulative percentage of the four studiedresources (BRAMs, DSPs, FFs and LUTs), thus the theoretical maximum value of lefty-axis, on the diagrams of Fig. 7, is 400%.

10

Table 1: Applications Characterization

Application Domain Kernel Description Parameters

Image Processing Histogram Determine frequency of RGB channels in image. Msize = 640×480 pixelsScientific Computing Matrix Multiplication Dense integer matrix multiplication. Msize = 100×100Enterprise Computing String Match Search file with keys for an encrypted word Nfile−keys = 307,200, Mwords = 4Artificial Intelligence Linear Regression Compute the best fit line for a set of points. Npoints = 100,000Artificial Intelligence PCA Principal components analysis on a matrix. Msize = 250×250Artificial Intelligence Kmeans Iterative clustering algorithm to classify n-D Npoints = 20,000,

data points into groups. Pclusters = 10, ndimensions = 3

Fig. 5: DMM-HLS efficiency on the maximum number of deployedaccelerators: 3.8× more accelerators in average, compared to staticallocation

Fig. 6: Per-accelerator latency overhead for different number of heaps

11

Fig. 7: Comparison on resources breakdown and accelerators density versus system’sthroughput, between Static and multiple-heap DMM-HLS setups.

While the DMM-1 setup (one heap for all accelerators) allows the highestnumber of accelerators (3.8×), it express low throughput since all acceleratorsare executed sequentially, thus the low system’s latency outperforms the massiveperformance budget of the increased number of accelerators. As long as moreheaps are added, the overall system exhibits higher throughput up to the pointthat the extra overhead of multi-heaps causes the decrease of accelerators, i.e.although there are available idle heaps, the accelerators are not so many to fullyutilize them, leading to throughput decrease. However, the proposed DMM-HLSframework aided the instantiation of many-accelerator systems that exhibit anaverage throughput increase of 21.4× over the static allocation obtained by theconventional Vivado-HLS flow. On average, considering both the DMM infras-tructure and accelerators, the on-chip memory utilization is around 60% whilethe rest FPGA resources are increased by a factor of 6.3×, 17, 6× and 29.7× forDSPs, FFs and LUTs respectively.

5 Conclusions

This paper targeted the scalability issues of modern many-accelerator FPGA sys-tems. We showed that the on-chip memory resource impose severe bottleneckson the maximum number of deployed accelerators, leading to large resourceunder-utilization in modern FPGA devices. By developing the DMM-HLS, wemanage the incorporation of dynamic memory management during HLS to al-leviate the resource under-utilization inefficiencies mainly induced by the staticmemory allocation strategies used in modern HLS tools. The proposed approachhas been extensively evaluated over real-life many-accelerator architectures tar-geting to emerging applications, showing that its adoption delivers significantgains regarding to accelerators density and throughput improvements.

References

1. M. J. Flynn, O. Mencer, V. Milutinovic, G. Rakocevic, P. Stenstrom,R. Trobec, and M. Valero, “Moving from petaflops to petadata,” Com-mun. ACM, vol. 56, no. 5, pp. 39–42, May 2013. [Online]. Available:http://doi.acm.org/10.1145/2447976.2447989

12

2. J. Shalf, D. Quinlan, and C. Janssen, “Rethinking hardware-software codesign forexascale systems,” Computer, vol. 44, no. 11, pp. 22–30, Nov 2011.

3. G. Venkatesh, J. Sampson, N. Goulding, S. Garcia, V. Bryksin, J. Lugo-Martinez,S. Swanson, and M. B. Taylor, “Conservation cores: Reducing the energy of maturecomputations,” SIGARCH Comput. Archit. News, vol. 38, no. 1, pp. 205–218,Mar. 2010. [Online]. Available: http://doi.acm.org/10.1145/1735970.1736044

4. Y.-T. Chen, J. Cong, M. Ghodrat, M. Huang, C. Liu, B. Xiao, and Y. Zou,“Accelerator-rich cmps: From concept to real hardware,” in Computer Design(ICCD), 2013 IEEE 31st International Conference on, Oct 2013, pp. 169–176.

5. J. Cong, M. A. Ghodrat, M. Gill, B. Grigorian, and G. Reinman, “Architecturesupport for domain-specific accelerator-rich cmps,” ACM Trans. Embed. Comput.Syst., vol. 13, no. 4s, pp. 131:1–131:26, Apr. 2014. [Online]. Available:http://doi.acm.org/10.1145/2584664

6. J. Cong, B. Liu, S. Neuendorffer, J. Noguera, K. Vissers, and Z. Zhang, “High-levelsynthesis for fpgas: From prototyping to deployment,” Computer-Aided Design ofIntegrated Circuits and Systems, IEEE Transactions on, vol. 30, no. 4, pp. 473–491,April 2011.

7. M. Araya-Polo, J. Cabezas, M. Hanzich, M. Pericas, F. Rubio, I. Gelado, M. Shafiq,E. Morancho, N. Navarro, E. Ayguade, J. Cela, and M. Valero, “Assessingaccelerator-based hpc reverse time migration,” Parallel and Distributed Systems,IEEE Transactions on, vol. 22, no. 1, pp. 147–162, Jan 2011.

8. M. J. Lyons, M. Hempstead, G.-Y. Wei, and D. Brooks, “The acceleratorstore: A shared memory framework for accelerator-based systems,” ACM Trans.Archit. Code Optim., vol. 8, no. 4, pp. 48:1–48:22, Jan. 2012. [Online]. Available:http://doi.acm.org/10.1145/2086696.2086727

9. E. Cota, P. Mantovani, M. Petracca, M. Casu, and L. Carloni, “Accelerator memoryreuse in the dark silicon era,” IEEE Computer Architecture Letters, vol. 99, no.RapidPosts, p. 1, 2012.

10. L. Semeria and G. De Micheli, “Spc: synthesis of pointers in c application of pointeranalysis to the behavioral synthesis from c,” in Computer-Aided Design, 1998.ICCAD 98. Digest of Technical Papers. 1998 IEEE/ACM International Conferenceon, Nov 1998, pp. 340–346.

11. M. Shalan and V. J. Mooney, “A dynamic memory management unit forembedded real-time system-on-a-chip,” in Proceedings of the 2000 InternationalConference on Compilers, Architecture, and Synthesis for Embedded Systems, ser.CASES ’00. New York, NY, USA: ACM, 2000, pp. 180–186. [Online]. Available:http://doi.acm.org/10.1145/354880.354905

12. A. Putnam, A. Caulfield, E. Chung, D. Chiou, K. Constantinides, J. Demme,H. Esmaeilzadeh, J. Fowers, G. P. Gopal, J. Gray, M. Haselman, S. Hauck,S. Heil, A. Hormati, J.-Y. Kim, S. Lanka, J. Larus, E. Peterson, S. Pope,A. Smith, J. Thong, P. Y. Xiao, and D. Burger, “A reconfigurable fabricfor accelerating large-scale datacenter services,” in 41st Annual InternationalSymposium on Computer Architecture (ISCA), June 2014. [Online]. Available:http://research.microsoft.com/apps/pubs/default.aspx?id=212001

13. Xilinx, Inc. [Online]. Available: http://www.xilinx.com14. S. Xydis, A. Bartzas, I. Anagnostopoulos, D. Soudris, and K. Z. Pekmestzi, “Cus-

tom multi-threaded dynamic memory management for multiprocessor system-on-chip platforms,” in ICSAMOS, 2010, pp. 102–109.

15. Y. Sade, M. Sagiv, and R. Shaham, “Optimizing c multithreaded memory man-agement using thread-local storage,” in Proceedings of the 14th International Con-ference on Compiler Construction, ser. CC’05, 2005, pp. 137–155.

16. C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C. Kozyrakis, “Evalu-ating mapreduce for multi-core and multiprocessor systems,” in High PerformanceComputer Architecture, 2007. HPCA 2007. IEEE 13th International Symposiumon, Feb 2007, pp. 13–24.

Documents

Dynamic Memory Management in Vivado-HLS for Scalable Many