4

Click here to load reader

fpga13_3

Embed Size (px)

Citation preview

Page 1: fpga13_3

Architecture Support for Custom Instructions with MemoryOperations

Jason [email protected]

Karthik [email protected]

Department of Computer Science, University of California Los Angeles

ABSTRACTCustomized instructions (CIs) implemented using customfunctional units (CFUs) have been proposed as a way of im-proving performance and energy efficiency of software whileminimizing cost of designing and verifying accelerators fromscratch. However, previous work allows CIs to only commu-nicate with the processor through registers or with limitedmemory operations. In this work we propose an architecturethat allows CIs to seamlessly execute memory operationswithout any special synchronization operations to guaran-tee program order of instructions. Our results show thatour architecture can provide 24% energy savings with 14%performance improvement for 2-issue and 4-issue superscalarprocessor cores.

Categories and Subject Descriptors: C.1.0 [ProcessorArchitectures]: General

Keywords: ASIP, Custom instructions, Speculation.

1. INTRODUCTIONInstruction-set customizations have been proposed in [1,

3, 6, 4, 10, 9] which allow certain patterns of operationsto be executed efficiently on CFUs added to the processorpipeline. Integration of CFUs with a superscalar pipelineprovides additional opportunities : typical superscalar pro-cessors have hardware for speculatively executing instruc-tions and rolling back and recovering to a correct state whenthere is mis-speculation. In our work we propose an archi-tecture for integrating CFUs with the superscalar pipelinesuch that the CFUs can perform memory operations with-out depending on the compiler to synchronize accesses withthe rest of the program.

2. RELATED WORK AND OUR CONTRI-BUTIONS

In [18, 4, 10, 9, 17, 12], the CFUs read (write) their inputs(outputs) directly from (to) the register file of the processorand cannot access memory. However, since the CFU cannot

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.FPGA’13, February 11–13, 2013, Monterey, California, USA.Copyright 2013 ACM 978-1-4503-1887-7/13/02 ...$15.00.

1 ld [A] , R12 s t R2 , [R1 ]3 CI

Figure 1: Memory ordering example

access memory, the size of the computation pattern thatcan be implemented on a CFU is constrained. The primaryproblem with allowing CFUs to launch memory operationsis to ensure that memory is updated and read in a consistentfashion with respect to other in-flight memory instructionsin the superscalar pipeline. In the GARP [13], VEAL [8]and OneChip [7] systems, the compiler needs to insert somespecial instructions to ensure all preceding memory opera-tions have completed before launching the custom instruc-tion. Systems with architecturally visible storage (AVS) [5,14] also depend on the compiler for synchronization.

In this paper, our goal is to design an architecture suchthat custom instructions (CIs) with memory operations canexecute seamlessly along with other instructions in the pro-cessor pipeline without any special synchronization opera-tions. More precisely, we present an architecture for inte-grating CFUs with the processor pipeline with the follow-ing properties: (1) CFUs can launch memory operationsto directly access the L1 D-cache of the processor. (2) Nosynchronization instructions need to be inserted before/afterthe CI ; this greatly reduces the burden on the compilerespecially for applications with operations whose memoryaddress cannot be determined beforehand. Our proposedmicroarchitecture ensures the correct ordering among thedifferent memory operations.

3. CHALLENGES AND OUR PROPOSED SO-LUTION FOR SUPPORTING MEMORYOPERATIONS IN CFUS

Note: For a more detailed explanation of our architecture,additional results and analysis, we refer the reader to ourtechnical report [11].

In this section we explain the issues associated when CFUsconnected to an out-of-order (OoO) core are allowed to launchmemory operations directly and propose modifications inthe compilation flow and underlying microarchitecture of thecore to support such CFUs.

231

Page 2: fpga13_3

Lifetime of a CI : A CI is essentially a set of operationsgrouped together by the compiler to be executed as a singleinstruction. The primary inputs for a CI always come fromthe registers of the processor; all input registers must beready before a CI is ready to execute. Once a CI startsexecuting, it can issue a series of memory operations intothe processor’s pipeline. Outputs of the CI can write to theprocessor’s registers or memory.

3.1 Issue 1: Maintaining program order formemory operations

Consider the simple example in Figure 1. The CI to beexecuted launches three memory operations: a read from lo-cation M2 and writes to M1 and M3. For previous work [7,14], the CI needs to at least wait until the addresses of allpreceding memory operations are computed before proceed-ing; this is enforced by inserting synchronization instructionsbefore the CI. We overcome this limitation by modifying thecore’s microarchitecture. The key difference of our approachis that we make the memory operations launched by a CIvisible to the OoO processor’s pipeline in program or-der . In the decode stage of the pipeline, in which programorder is still maintained, the CI in Figure 1 will launch threememory operations (which we call mem-rops) into the dis-patch stage. In the dispatch stage, each operation is assigneda tag representing its program order. The OoO pipeline willassign one entry for the store and three entries for the CI inthe LSQ. Even if the store instruction’s address is computedafter the CI begins execution, the LSQ will check whetherany successive operations have an address conflict. In thecase of a conflict, the OoO pipeline will ensure that the CIis squashed and a pipeline flush occurs.

3.2 Issue 2: Ordering of memory operationswithin a CI

For the CI in Figure 1, assume that the first write oper-ation to address M1 and the read operation reading fromM2 overlap/conflict. In the case of normal memory instruc-tions, if the read executes before the write, the read will besquashed and re-executed in the OoO pipeline. However, theinstruction stream of the program contains only the CI andnot the individual memory operations. Re-execution willneed to begin from the CI, but the same conflict will occuragain. To overcome this problem, we place a constraint dur-ing the CI compilation phase: the compiler cannot clustera memory operation in a CI if there is a preceding memorywrite operation within the CI which may cause a conflict.The compiler uses alias analysis (possibly in a conservativeway) to satisfy this constraint. Memory dependences be-tween different CIs are handled as described in Section 3.1.

3.3 Issue 3: Possible partial commit to mem-ory

For the CI in Figure 1, assume that the first write opera-tion to addressM1 commits and updates memory. However,after this commit, it is determined that the write operationto M3 fails because of a TLB translation fault. This wouldleave the memory in an inconsistent state since the writeto address M1 was committed. We overcome this prob-lem by delaying the commit of all write memory operationslaunched by the CI until the successful completion of the CI(no TLB faults). The compiler inserts additional instruc-tions that are executed in case a CI causes a TLB fault.

Figure 2: Layout of the processor pipeline withtightly integrated CFU

3.4 Issue 4: Handling a variable number ofmemory operations

Since CIs can span multiple basic blocks in the program,the number of memory operations launched by a CI couldvary across executions and need not be deterministic beforethe CI starts executing. This issue is solved by launchingthe maximum number of memory operations that a CI canexecute during the decode stage. In the case where a par-ticular memory operation is not executed, the CI suppliesa ‘dummy’ address for these operations, effectively turningthem into nops.

4. DETAILS OF PROPOSED ARCHITECTUREFigure 2 shows the basic layout of how our reconfigurable

CFU units interact with the rest of the processor pipeline.We will briefly explain each component of this interaction.

4.1 Decode stageSince the number of ports of the register file (and other

components in the microarchitecture of a superscalar proces-sor such as the RAT, free list, reservation station) is limited,we split a complex CI into multiple simple operations whichwe call rops (similar to the micro-ops in x86 processors).Sreg-rops only read from registers, dreg-rops only write toregisters and mem-rops read/write from/to memory.

The opcode of the CI is used to determine the numberand type (read/write) of mem-rops that the CI will launch.We introduce a SRAM table called CFU memop table tostore the mapping between the CI opcode and the numberof memory operations launched by the CI. This is a pro-grammable table which is filled up when a new applicationis launched.

4.2 Dispatch stageThe dispatch stage is the last stage in the pipeline where

instructions are processed in order (apart from the commitstage). Each instruction is assigned a sequence id or SIDwhich represents the program order. The main job of thedispatch stage is to perform resource allocation – i.e., cre-ate entries in the ROB (and assigning SIDs) for each in-

232

Page 3: fpga13_3

struction/rop, the reservation station and in the LSQ formemory instructions/rops. Since mem-rops get their ad-dress (and data) operands directly from the CIs (and notthe register file), each entry in the reservation table needsto be expanded to accommodate this information. We ana-lyzed that for a modern superscalar processor with 256 phys-ical registers and a 128-entry ROB, where CIs can have atmost 16 mem-rops, the area of the reservation station in thebaseline processor would be 0.17 mm2, while the modifiedreservation station would occupy 0.21 mm2 at the 45 nmnode (from McPAT [15]).

4.3 Scheduler and execute stageA CI is ready to execute when all the sreg-rops issued by

it are ready. The scheduler decides which CFU to assign toa particular CI. Once the CFU has obtained all its registeroperands, it begins execution. When a memory operation isencountered, the CFU computes the address (and possiblydata) for the operation and sends it over the bypass path toforward it to the mem-rop waiting in the RS.The mem-rop, using the address obtained from the CFU,

proceeds with the memory operation in a manner identicalto load/store instructions. Assuming there are no conflictsin the LSQ or TLB misses, the memory operation completesand waits in the ROB for retirement. If the mem-rop is amemory read operation, the read value is forwarded to theCFU over the bypass paths.

The CFU completes execution and forwards the results todreg-rops waiting in the RS after which the dreg-rops waitin the ROB for retirement.

4.4 Retire stageUnlike store instructions, write memorymem-rops launched

by CIs are allowed to update memory only when all the ropslaunched by the CI have retired because of reasons explainedin Section 3.3.

5. RESULTSWe use the LLVM compiler framework [2] to analyze and

extract CIs from the SPEC integer 2006 benchmark suite.Our baseline core is a 2-issue superscalar core with tightlyintegrated CFUs. Our chip is a tiled CMP system whereeach tile could be a core or a reconfigurable fabric tile with2000 slices and 150 DSP blocks (density based on Virtex-6numbers). We assume a 5-cycle, pipelined link between theFPGA fabric and core pipeline.

We use the AutoESL HLS tool for synthesizing our CFUsand Xilinx XPower for energy numbers for the FPGA. WeuseWisconsin’s GEMS simulation infrastructure [16] to modelthe performance of our system and McPAT [15] to estimateenergy of the processor core. Our cores and the CFUs runat different frequencies – the core runs at 2 GHz while thefrequency of the CFU is provided by Xilinx ISE. To keep theinterface logic as simple as possible, the CFU is operated ata clock period which is an integer multiple of the CPU clockperiod. For five of our benchmarks, the CFUs selected byour compiler pass could operate at 200 MHz (1 FPGA cy-cle = 10 core cycles) while for the other two benchmarks,the CFUs operated at 125 MHz (1 FPGA cycle = 16 corecycles).

Table 1 shows the performance when the CFUs are pipelined.The initiation interval of pipelining varies between 1 and 3FPGA cycles (as determined by AutoPilot). With pipelin-

ing, we begin to see significant performance improvements– an average of around 14%. The key point in our approachis to compare the performance of a 2-issue core augmentedwith CFUs (column 2) with a 4-issue core (column 5). Ourarchitecture can beat the performance of a 4-issue core whenusing a 2-issue core and CFUs. For benchmarks with signif-icant ILP (libquantum, hmmer, sjeng, h264 ), the speed-upis reasonable. Benchmarks such as mcf, which have a largeworking set, see very little improvement – mainly becausethey are limited by cache misses.

Table 2 shows the energy consumption (normalized to the2-issue core). Here, we see that having CFUs provides signif-icant energy savings. On the average, we see a 24% energyreduction. Of the total energy savings, we observe that 41%of the total energy savings in our system comes from the re-duced number of accesses to the I-cache, instruction bufferand decode logic, 32% comes from reduced energy consump-tion of the ALUs (since many arithmetic operations are per-formed in the FPGA now) and register files. The remaining27% is distributed among the reservation station, renamelogic and ROB.

6. CONCLUSIONSIn this paper we present an architecture by which CIs can

launch memory operations and execute speculatively whenintegrated with a superscalar processor pipeline. Our ap-proach does not need any synchronization or detailed mem-ory analysis by the compiler. Our experiments show thateven for pointer-heavy benchmarks, our approach can pro-vide an average of 24% energy savings and 14% performanceimprovement over software-only implementations.

7. ACKNOWLEDGMENTThe authors acknowledge the support of the GSRC, one

of six research centers funded under the FCRP, a SRC entityand Xilinx.

8. REFERENCES[1] Altera NIOS-II processor. http://www.altera.com/

devices/processor/nios2/ni2-index.html.

[2] The LLVM compilation infrastructure. http://llvm.org.

[3] Xtensa customizable processor. http://www.tensilica.com/products/xtensa-customizable.

[4] K. Atasu, C. Ozturan, G. Dundar, O. Mencer, andW. Luk. Chips: Custom hardware instructionprocessor synthesis. Computer-Aided Design ofIntegrated Circuits and Systems, IEEE Transactionson, 27(3):528 –541, march 2008.

[5] P. Biswas, N. D. Dutt, L. Pozzi, and P. Ienne.Introduction of architecturally visible storage ininstruction set extensions. Computer-Aided Design ofIntegrated Circuits and Systems, IEEE Transactionson, 26(3):435 –446, march 2007.

[6] P. Brisk, A. Kaplan, and M. Sarrafzadeh.Area-efficient instruction set synthesis forreconfigurable system-on-chip designs. In DesignAutomation Conference, 2004. Proceedings. 41st,pages 395 –400, july 2004.

[7] J. E. Carrillo and P. Chow. The effect of reconfigurableunits in superscalar processors. In Proceedings of the

233

Page 4: fpga13_3

Table 1: Normalized performance (#cycles elapsed) with pipelined CFUs on FPGAs2-issue/128 entries 2-issue/256 entries 4-issue/128 entries 4-issue/256 entries

baseline CFU baseline CFU baseline CFU baseline CFUbzip2 1.000 0.924 0.999 0.924 0.958 0.885 0.956 0.884libquantum 1.000 0.703 1.000 0.703 0.653 0.459 0.653 0.459hmmer 1.000 0.781 1.000 0.781 0.775 0.605 0.775 0.605mcf 1.000 0.938 1.000 0.938 0.973 0.913 0.973 0.913gobmk 1.000 0.940 0.999 0.940 0.982 0.924 0.981 0.923h264 1.000 0.802 0.998 0.801 0.765 0.613 0.763 0.612sjeng 1.000 0.899 0.999 0.898 0.895 0.805 0.894 0.804Average 1.000 0.855 0.999 0.855 0.857 0.743 0.857 0.743Improvement(%) - 14.460 - 14.461 - 13.277 - 13.278

Table 2: Normalized total energy consumption with pipelined CFUs on FPGAs2-issue/128 entries 2-issue/256 entries 4-issue/128 entries 4-issue/256 entries

baseline CFU baseline CFU baseline CFU baseline CFUbzip2 1.000 0.736 1.046 0.806 1.367 1.027 1.452 1.098libquantum 1.000 0.746 1.058 0.809 1.044 0.787 1.141 0.825hmmer 1.000 0.726 1.094 0.816 1.079 0.798 1.291 1.007mcf 1.000 0.748 1.060 0.836 1.035 0.800 1.123 0.876gobmk 1.000 0.734 1.011 0.788 1.391 1.060 1.412 1.102h264 1.000 0.782 1.051 0.757 1.137 0.881 1.224 0.923sjeng 1.000 0.731 1.029 0.769 1.290 0.992 1.343 1.001Average 1.000 0.743 1.050 0.797 1.192 0.907 1.284 0.976Improvement(%) 0.000 25.672 0.000 24.075 0.000 23.942 0.000 23.948

2001 ACM/SIGDA ninth international symposium onField programmable gate arrays, FPGA ’01, pages141–150, New York, NY, USA, 2001. ACM.

[8] N. Clark, A. Hormati, and S. Mahlke. Veal:Virtualized execution accelerator for loops. InComputer Architecture, 2008. ISCA ’08. 35thInternational Symposium on, pages 389 –400, june2008.

[9] N. Clark, M. Kudlur, H. Park, S. Mahlke, andK. Flautner. Application-specific processing on ageneral-purpose core via transparent instruction setcustomization. In Microarchitecture, 2004. MICRO-372004. 37th International Symposium on, pages 30 –40, dec. 2004.

[10] J. Cong, Y. Fan, G. Han, and Z. Zhang.Application-specific instruction generation forconfigurable processor architectures. In Proceedings ofthe 2004 ACM/SIGDA 12th international symposiumon Field programmable gate arrays, FPGA ’04, pages183–189, New York, NY, USA, 2004. ACM.

[11] J. Cong and K. Gururaj. Architecture support forcustom instructions with memory operations.Technical Report 120019, University of California LosAngeles, November 2012.

[12] Q. Dinh, D. Chen, and M. D. F. Wong. Efficient asipdesign for configurable processors with fine-grainedresource sharing. In Proceedings of the 16thinternational ACM/SIGDA symposium on Fieldprogrammable gate arrays, FPGA ’08, pages 99–106,New York, NY, USA, 2008. ACM.

[13] J. Hauser and J. Wawrzynek. Garp: a MIPS processorwith a reconfigurable coprocessor. In FPGAs forCustom Computing Machines, 1997. Proceedings., The

5th Annual IEEE Symposium on, pages 12 –21, apr1997.

[14] T. Kluter, S. Burri, P. Brisk, E. Charbon, andP. Ienne. Virtual ways: Efficient coherence forarchitecturally visible storage in automatic instructionset extensions. In HiPEAC, volume 5952 of LectureNotes in Computer Science, pages 126–140. Springer,2010.

[15] S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M.Tullsen, and N. P. Jouppi. Mcpat: an integratedpower, area, and timing modeling framework formulticore and manycore architectures. In Proceedingsof the 42nd Annual IEEE/ACM InternationalSymposium on Microarchitecture, MICRO 42, pages469–480, New York, NY, USA, 2009. ACM.

[16] M. M. K. Martin, D. J. Sorin, B. M. Beckmann, M. R.Marty, M. Xu, A. R. Alameldeen, K. E. Moore, M. D.Hill, and D. A. Wood. Multifacet’s generalexecution-driven multiprocessor simulator (gems)toolset. SIGARCH Comput. Archit. News, 33:92–99,November 2005.

[17] L. Pozzi and P. Ienne. Exploiting pipelining to relaxregister-file port constraints of instruction-setextensions. In Proceedings of the 2005 internationalconference on Compilers, architectures and synthesisfor embedded systems, CASES ’05, pages 2–10, NewYork, NY, USA, 2005. ACM.

[18] Z. Ye, A. Moshovos, S. Hauck, and P. Banerjee.Chimaera: a high-performance architecture with atightly-coupled reconfigurable functional unit. InComputer Architecture, 2000. Proceedings of the 27thInternational Symposium on, pages 225 –235, june2000.

234