Automatic Generation of ApplicationSpecific SystemsBased on a Micro-programmed Java Core
F. Gruian, P. Andersson, K. Kuchcinski,Dept. of Computer Science, Lund University
Box 118, S-221 00 LundSweden
Strausseng. 2-10/2/55, A-1050 ViennaAustria
ABSTRACTThis paper describes a co-design based approach for auto-matic generation of application specific systems, suitable forFPGA-centric embedded applications. The approach aug-ments a processor core with hardware accelerators extractedautomatically from a high-level specification (Java) of theapplication, to obtain a custom system, optimised for thetarget application. We advocate herein the use of a micro-programmed core as the basis for system generation in orderto hide the hardware access operations in the micro-code,while conserving the core data-path (and clock frequency).To prove the feasibility of our approach, we also presentan implementation based on a modified version of the JavaOptimized Processor soft core on a Xilinx Virtex-II FPGA.
Categories and Subject DescriptorsC.3 [Special Systems and Application-Based Systems]:real-time and embedded systems
Keywordssystem-on-chip, co-design, FPGA, Java
1. INTRODUCTIONIt is well known that hardware accelerators can give bothspeed-up and reduced power consumption. An attractivetechnology for implementing accelerators is reconfigurablelogic. Today it is even possible to build entire systems ona single FPGA, leading to a dramatic decrease in designtime and cost compared to ASIC based systems, makingthem increasingly useful in embedded systems. Single-chipsystems can be more tightly coupled, since the interconnectsare no longer limited by the pad amount, as in multi-chipsolutions.
The features just mentioned open new possibilities forapplication-specific systems. Hardware accelerators and co-processors can now be more deeply integrated with processor
Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.SAC05 March 13-17, 2005, Santa Fe, New Mexico, USACopyright 2005 ACM 1-58113-964-0/05/0003 ...$5.00.
cores. Furthermore, accelerators may be automatically ex-tracted from the software source code, grouped and mergedtogether depending on the space available on the device.In addition, this process may also be made transparent tothe user, who, unlike with other approaches for application-specific system design, involving application-specific proces-sors and accelerators, does not have to be aware of the un-derlying hardware, nor provide specific compilers or librariesfor accessing the accelerators.
In this paper, we propose a co-design flow for generat-ing application-specific systems-on-chip, hosted on a FPGA.The target architecture consists of a processor enhancedwith one or several accelerators. We also describe and dis-cuss trade-offs at different steps in the design flow. Specifi-cally we look at interconnection structures, communicationand synchronisation issues with respect to both the acceler-ators and the modifications to the processor. In addition, weillustrate the proposed design flow using our version of theJava Optimized Processor (JOP, ), which is augmentedwith accelerators. The accelerators are automatically ex-tracted form the application Java source code, the same codethat can be run on JOP without accelerators.
In traditional methodology, the use of hardware accelera-tors implied a considerable design effort. The function to beaccelerated had to be identified and then re-implemented,by an expert, using a hardware description language. Toreduce the design overhead for using FPGA accelerators,much work has been done to automate this process. In, Gokhale et al. present the NAPA C compiler for theNAPA1000, a hybrid RISC/FPGA processor. They extendANSI C with #pragmas, which steer whether variables andexpressions are to be mapped onto the RISC, or the FPGA.Their hardware generation is however limited to scalar ex-pressions (no memory accesses, conditions, or nested loops).
An approach for automatic accelerator generation in aJava environment is presented in . It uses the Java byte-code as input for the accelerator generation. Data path gen-eration is based on predefined and pre-synthesized macros,and only the generated controllers need to pass through logicsynthesis. In contrast we generate RTL-level vhdl code forboth the data path and controllers, starting from sourcecode. We believe this is more flexible and increase the pos-sibilities for optimizations. In , the Java Virtual Machine(JVM) and the accelerators do not share a common mem-ory, so data are copied between the JVM memory and theaccelerator local memory before and after the acceleratedcomputation. This leads to a significant difference in the
generated interface compared to our approach, where theJVM and accelerator do share a common memory.
In  Kent et al. present a hardware/software co-designof a JVM, on a general purpose processor connected to anFPGA. Unlike, that approach which focuses on speedingup a JVM in general, for any Java application, our ap-proach targets application-specific systems-on-chip by au-tomatic extraction of accelerators at compile time.
There have been many more approaches to acceleratorgeneration, most of them more or less automated. A surveyof such systems and tools can be found in , for example.
Recently, much work has been done on the topic of com-piling traditional software languages directly into hardware[1,5,11]. This makes a good foundation for fully automatedaccelerator generation directly from a software description.
The remainder of the paper is organized as follows. Sec-tion 2 gives an overview of our application specific systemdesign flow. Section 3 presents the architectural options forimplementing such systems, and motivates our choices. Insection 4, we give the implementation details specific for ourdesign flow and platform. Section 5 presents an experimen-tal evaluation that validates our design methodology. Thepaper concludes with a summary in section 6.
2. CO-DESIGN FLOW OVERVIEWOur view on the co-design flow for generating an application-specific system is detailed in Figure 1. The input to thewhole process is a high-level specification of the application(in our case in Java), including all the necessary sources,preferably together with libraries and packages. The appli-cation is profiled in order to obtain essential data for select-ing code regions suitable for hardware acceleration. Timeconsuming loops, frequently called short functions, and I/Ooperations are good candidates for hardware implementa-tion. The application is then partitioned into software andhardware, taking into account a number of requirementssuch as performance, device utilization, micro-code mem-ory size, etc. Although this step could be fully automated,we believe a tight interaction between the designer and thepartitioning tool is beneficial at this point. For example, anexperienced designer might directly recognize reusable coderegions, only slightly different, that could share the samehardware. At this point, the application is described by asoftware part, further handled by the software flow, and anumber of hardware functions (still described in Java in ourimplementation) that will undergo the hardware flow.
2.1 Software FlowAfter partitioning, the software part includes, besides theactual code (bytecode), also specific hardware calls that re-place the code to be implemented by the hardware accel-erators. In fact hardware calls are new instructions, im-plemented at the micro-program level. To make the soft-ware flow more generic and keep a unique micro-code forall applications, we adopted in principle three new instruc-tions (hardware calls): hwWrite to write a word to a cer-tain hardware accelerator, hwSynch to synchronize (waitfor results) with a certain accelerator, and finally hwReadto read a word from a given accelerator. In our implemen-tation, these hardware calls are specific functions, replacedby dedicated bytecodes using a custom JavaCodeCompactor(jcc). Otherwise the software flow involves the classic com-pilation and static linking steps, to get the executable image.
Custom JOP version
Compile & Link(JavaCodeCompact)
Hw Accelerator 1Hw Accelerator 1Hw Accelerator 1(VHDL)
microROM Core Base(JOP)
Synthesize and Generate Netlist/Bitmap(XPS)
Configuration(FPGA programming file)
Source + Hw Calls
Hw Merge &System Generation
Figure 1: Our co-design flow for generatingapplicationspecific systems.
2.2 Hardware FlowOnce the hardware functions are decided, each of the corre-sponding Java code regions will be translated into a registertransfer level specification rtl (a vhdl component in ourcase).
Our translator supports the semantic overlap of Java andC, i.e. loops, arrays and object references. We do not sup-port yet recursion, inheritance, or the dynamic aspects ofJava, i.e. object creation or dynamic class loading. The fo-cus of our translator is on the memory accesses. We analysethe memory accesses looking for data reuse and patterns, al-lowing memory accesses trough bus bursts. To exploit datareuse we introduce local memories, drastically reducing thenumber of accesses over the shared bus . The translatoris intended for small loops and therefore we do not considerresource sharing. However, if large portions of code are tobe translated, high-level synthesis employing resource shar-ing should be considered . The translated entities will belater on connected to the initial base core (in our case JOP)and possibly the system bus, according to one of the choicesdescribed in section 3. All the components are then glued to-gether during a hardware merge and system generation step.The system based on the micro-programmed processor, in-cluding the hardware accelerators and necessary peripheralsis the result of this process (specified as vhdl, micro-codememory contents, and possible third party IPs).
Finally, this specification is synthesized, producing a con-figuration file for the desired FPGA (at this point our flowmakes use of the Xilinx Platform Studio, CoreGenerator,and Xilinx ISE).
3. ARCHITECTURAL CHOICESThe specific hardware functionality extracted during thepartitioning step can be incorporated into the system in anumber of different ways, each of these having certain advan-tages and drawbacks. Next we review briefly three architec-tural options, pointing out our choices and the motivationbehind them.
A first, straightforward option is to simply connect allthe hardware accelerators as slaves (or master/slaves) tothe common system bus (see Fig. 2, Pheripheral 1 ). Allaccesses occur exclusively through the bus, with the excep-tion perhaps of optional interrupt requests to the processor.Note that the accelerators have to be accessible through nor-mal bus read/write operations. The advantage of such anarchitecture is that, in principle, any processor core can beaccelerated this way. No special instructions for accessingthe hardware registers are required. From the software part,hardware calls can be simply implemented as read/writes tothe memory addresses associated with the accelerators. Fur-thermore, the accelerators can run in parallel with the pro-cessor. However, because of using the common system busas a point of access, each read/write operation to a hardwareregister takes a rather long time and competes with otherbus transfers. For these reasons, this architecture is suitableonly for complex/coarse-grain hardware operations.
A second, more tightly coupled architecture implies theuse of a local access structure and special micro-instructionsfor reading/writing hardware registers (see Fig. 2, Hw1 ).The advantages over the previous method are the follow-ing. The hardware accelerators are not mapped over com-mon addresses and cannot be accessed from outside exceptthrough dedicated micro-code. Malicious or badly designedJava code cannot therefore directly affect the hardware ac-celerators. In addition, the read/write operations can beconstructed to be very simple and fast, since they do nothave to abide the system bus protocol. As a consequence,even the hardware may become less complex, since the businterfaces are not necessary. These features make this archi-tecture better suited for finer grain operations. However, ifthe accelerators must access data from the memory, they canonly do this through the core fetch unit, using micro-codeoperations. For large amounts and constant flow of data,this becomes an issue. This problem can be simply solvedby combining the two approaches just presented. In partic-ular, the accelerators can be connected as masters on thesystem bus, fetching the required data themselves insteadof using the core fetch unit (see Fig. 2, Hw2 ). This way thehardware accelerators are still unaccessible (unless throughthe dedicated hardware calls) from the Java code, yet theycan freely use the (shared) memory and peripherals whilethe core carries out other tasks. In our implementation, wedecided to use both the pure local access structure and thecombined localbus master access choice, as in Fig. 2.
Finally, a third possibility would be to integrate the hard-ware functions with the core execution unit. This wouldallow for the new hardware operations to access the core reg-isters and also share functional units among themselves andthe core. However, this approach (designing an Application-Specific Instruction-set Processor) would drastically alterthe data-path of the processor. In addition, the micro-decoder unit would have to accommodate specific instruc-tions for each hardware function. These changes would im-pact the processor clock frequency to an extent that could
System Bus (OPB)
RAM Peripheral 1Peripheral
HWA Local Bus
Hw 1 Hw 2
Figure 2: Our architectural choice.
need a detailed examination and re-design by an experiencedhardware designer. Furthermore, the parallelism existing inthe other architectures is lost, as the execution stage cancarry out only one operation at one time. Although thisapproach might be useful for very fine grained operations(such as the dsp multiply-accumulate), we considered it tobe unsuitable for fast and automatic system generation.
4. IMPLEMENTATION DETAILSThe current section describes some of the implementationdetails specific to the design flow introduced above. In par-ticular, we present the basic system organisa...