Automatic Instruction-set Architecture Synthesis for VLIW ...rjordans/downloads/rjordans2017micpro.pdf · Automatic Instruction-set Architecture Synthesis for VLIW Processor Cores

Automatic Instruction-set Architecture Synthesis for VLIWProcessor Cores in the ASAM Project

Roel Jordansa,∗, Lech Jozwiaka, Henk Corporaala, Rosilde Corvinob

aEindhoven University of Technology, Postbus 513, 5600MB Eindhoven, The NetherlandsbIntel Benelux B.V., Capronilaan 37, 1119NG Schiphol-Rijk

Abstract

The design of high-performance application-specific multi-core processor systems still is a timeconsuming task which involves many manual steps and decisions that need to be performed byexperienced design engineers. The ASAM project sought to change this by proposing an auto-matic architecture synthesis and mapping flow aimed at the design of such application specificinstruction-set processor (ASIP) systems. The ASAM flow separated the design problem intotwo cooperating exploration levels, known as the macro-level and micro-level exploration. Thispaper presents an overview of the micro-level exploration level, which is concerned with the anal-ysis and design of individual processors within the overall multi-core design starting at the initialexploration stages but continuing up to the selection of the final design of the individual proces-sors within the system. The designed processors use a combination of very-long instruction-word(VLIW), single-instruction multiple-data (SIMD), and complex custom DSP-like operations in or-der to provide an area- and energy-efficient and high-performance execution of the program partsassigned to the processor node.

In this paper we present an overview of how the micro-level design space exploration interactswith the macro-level, how early performance estimates are used within the ASAM flow to determinethe tasks executed by each processor node, and how an initial processor design is then proposedand refined into a highly specialized VLIW ASIP. The micro-level architecture exploration is thendemonstrated with a walk-through description of the process on an example program kernel tofurther clarify the exploration and architecture specialization process.

The main findings of the experimental research are that the presented method enables an auto-matic instruction-set architecture synthesis for VLIW ASIPs within a reasonable exploration time.Using the presented approach, we were able to automatically determine an initial architecture pro-totype that was able to meet the temporal performance requirements of the target application.Subsequently, refinement of this architecture considerably reduced both the design area (by 4x)and the active energy consumption (by 2x).

Keywords: Very Long Instruction Word (VLIW), Application Specific Instruction-set Processor(ASIP), Instruction-set architecture synthesis

1. Introduction

The recent nano-dimension semiconductor technology nodes enable implementation of complex(heterogeneous) parallel processors and very complex multi-processor systems on a single chip(MPSoCs). Those MPSoCs may involve tens of different complex parallel processors, substantialmemory and communication resources, and can realize complex high-performance computations

∗Corresponding authorEmail address: [email protected] (Roel Jordans)

Preprint submitted to Microprocessors and Microsystems April 19, 2017

in an energy efficient way. This facilitates a rapid progress in mobile and autonomous comput-ing, global networking and wireless communication which, combined with a substantial progressin sensor and actuator technologies, creates new important opportunities. Many traditional ap-plications can now be served much better, but what is more important, numerous new sorts ofsmart communicating mobile and autonomous cyber-physical systems became technologically fea-sible and economically justified. Various systems performing monitoring, control, diagnostics,communication, visualization or combination of these tasks, and representing (parts of) differentmobile, remote or poorly accessible objects, installations, machines, vehicles or devices, or evenbeing wearable or implantable in human or animal bodies can serve as examples. A new wave ofinformation technology revolution is arriving that started to create much more coherent and fit touse modern smart communicating cyber-physical systems (CPS) that are part of the internet ofthings (IoT).

The rapidly developing markets of the modern CPS and IoT represent great opportunities, butthese opportunities come with the high system complexity and stringent requirements of manymodern CPS applications. Two essential characteristics of the mobile CPS are:

• the requirement of (ultra-)low energy consumption, often combined with the requirement ofhigh performance, and

• heterogeneity, in the sense of convergence and combination of various earlier separated ap-plications, systems and technologies of different sorts in one system, or even in a single chipor package.

Smart wearable devices represent a huge heterogeneous CPS system area, covering many differ-ent fields and kinds of applications (from mobile personal computing and communications, throughsecurity and safety systems to wireless health-care or sport applications), with each applicationhaving its different specific requirements. Also a modern car involves numerous sub/systems im-posing different specific requirements. What is however common to virtually all these applicationsis that they are autonomous regarding their limited energy sources, and therefore they are re-quired to ensure (ultra-)low energy consumption. Some of these applications do not impose anyhigh demands in computing and communication performance, and can satisfactorily be served withsimple micro-controllers and a short distance communication with e.g. a smartphone. Some otherapplications impose more substantial computing and communication performance requirementsand need more sophisticated low-power micro-controllers or micro-controller MPSoCs, equippeda. o. with on-chip embedded memories, multiple sensors, A/D converters and Bluetooth LowEnergy, ZigBee or multi-standard communication.

However, many of the modern complex and highly-demanding mobile and autonomous CPSimpose difficult to satisfy ultra-high demands for which the micro-controller or general-purposeprocessor based solutions are not satisfactory. These are mainly applications that are required toprovide continuous autonomous service in a long time and involve big instant data generated/-consumed by video sensors/actuators and other multi-sensors/actuators, and in consequence, atthe same time demand (ultra-)low energy consumption (often below 1 W) and (ultra-)high per-formance in computing and communication (often above 1 Gbps). Examples of such applicationsinclude: mobile personal computing, communication and media, video sensing and monitoring,computer vision, augmented reality, image and other multi-sensor processing for health-care andwell-being, mobile robots, intelligent car applications, etc. Specifically, smart cars and variouswearable systems to a growing degree involve big instant data from multiple complex sensors (e.g.camera, radar, lidar, ultrasonic, EEG, sensor network tissues, etc.) or from other systems. Toprocess these big data in real-time, they demand a guaranteed (ultra-)high performance. Beingused for safety-critical applications and demanded to provide continuous autonomous service in along time, they are required to be highly reliable, safe and secure.

The (ultra-) high demands of the modern mobile CPS combined with todays unusual siliconand systems complexity result in serious system development challenges, such as:

• guaranteeing the real-time high performance, while at the same time satisfying the require-ments of (ultra-)low energy consumption, and high safety, security and dependability;

2

• accounting in design for a lot of aspects and changed relationships among aspects (e.g.leakage power, negligible in the past, is a very serious issue now);

• complex multi-level multi-objective optimization (e.g. processing power versus energy con-sumption and silicon area) and adequate resolution of numerous complex design trade-offs(e.g. between various conflicting requirements, solutions at different design levels, or solutionsfor different system parts);

• reduction of the design productivity gap for the increasingly complex and sophisticatedsystems;

• reduction of the time-to-market and development costs without compromising the systemquality, etc.

To satisfy the above discussed demands and overcome the challenges, a substantial system ar-chitecture and design technology adaptation is necessary. New sophisticated high-performance andlow-power heterogeneous computing architectures are needed involving highly-parallel application-specific instruction-set processors (ASIPs) and/or HW accelerators, as well as, new effective designmethods and design automation tools are needed to efficiently create the sophisticated architec-tures and perform the complex processes of application mapping on those architectures.

The systemic realizations of complex and highly demanding applications, for which the cus-tomizable multi-ASIP MPSoC technology is especially suitable, demand the performance andenergy usage levels comparable to those of ASICs, while requiring a short time-to-market andremaining cost-effective. Satisfaction of these stringent and often conflicting application demandsrequires construction of highly-optimized application-specific hardware architectures and corre-sponding software structures mapped on the architectures. This can be achieved through an effi-cient exploitation of different kinds of parallelism involved in these applications, implementation ofcritical parts of their information processing in application-specific hardware, as well as, efficienttrade-off exploitation among various design characteristics, and between solutions considered atdifferent design levels and for different system parts.

Modern complex mobile applications include many various parts and numerous different algo-rithms involving various kinds of information processing with various kinds of parallelism (task-level, loop-level, operation/instruction-level, and data parallelism). They are from their verynature complex and heterogeneous and require heterogeneous application-specific multi-processorsystem approach. In a customizable multi-ASIP MPSoC its various ASIPs have to be customizedtogether, and in a strict relation to the selection of the number of ASIPs, as well as to the mappingof the applications required computations on the particular ASIPs in both space and time. TheMPSoC macro- and micro-architectures, at multi-ASIP system level and at the single ASIP proces-sor level, are strictly interrelated. Important trade-offs have to be resolved regarding the numberand granularity of individual processors, and the amount of parallelism and resources at each ofthe two architecture levels. Moreover, at both architecture levels, the optimized parallel softwarestructures have to be implemented on the corresponding optimized for them parallel hardwarestructures. The two architecture levels are strongly interwoven also through their relationshipswith the memory and communication structures. Each micro-/macro-architecture combinationwith a different parallel computation structure organization requires different compatible memoryand communication architectures.

Moreover, the traditional algorithm and software development approaches require an existingand stable computation platform (HW platform, compilers etc.), while for the modern embeddedsystems as MPSoCs based on adaptable ASIPs the hardware and software architectures have tobe application-specific, and must be developed largely in parallel. Unfortunately, the efficiencyof the required parallel HW and SW development is much too low with the currently availabledevelopment technology due to lack of effective automated methods of industrial strength formany MPSoC design problems, and weak interoperability of the HW/SW architecture design, andhardware synthesis tools. The inefficiencies identified above result in a substantially lower thanattainable quality of the resulting systems, much longer than necessary development time, andmuch higher development costs.

3

In consequence, optimization of the performance/resources trade-off required by a particularapplication can only be achieved through a careful construction of an adequate application-specificmacro-/micro-architecture combination, where at both architecture levels, the optimized parallelsoftware structures are implemented on the corresponding optimized for them parallel hardwarestructures. The aim here is thus to find an adequate balance between the number and kinds ofparallel processors, the complexity of the inter-processor communication, and the intra-processorparallelism and complexity. To achieve this aim, based on the application analysis and restruc-turing several promising macro-architecture/micro-architecture combinations have to be automat-ically constructed and evaluated in an iterative process, and finally, the best of them has to beselected for an actual realization.

This paper considers the highly relevant problem of an automated design technology for 1)constructing high-quality heterogeneous highly-parallel computing platforms for the above brieflydiscussed highly-demanding cyber-physical applications, and for 2) efficient mapping of the com-plex applications on the platforms. Such new design technology for ASIP-based MPSoCs satisfyingthe briefly sketched above requirements has been developed in the scope of the European researchproject ASAM (Architecture Synthesis and Application Mapping for heterogeneous MPSoCs basedon adaptable ASIPs) of the ARTEMIS program. Its overview can be found in [1]. This paperis devoted to a part of the ASAM design technology related to the design methods and tools fora single ASIP-based HW/SW sub-system and developed by the ASAM research team from theEindhoven University of Technology led by the second author of the paper. Since the design flowdeveloped by the ASAM project substantially differs from earlier published flows, after discussingthe related work, the paper briefly explains the ASAM design flow. Subsequently, it focuses onthe part of the design flow related to the automatic VLIW ASIP-based HW/SW architecture ex-ploration and synthesis (which has not been published in as much detail previously), and explainsthe design methods and automation tools of this part. The paper proposes a novel formulation ofthe ASIP-based HW/SW sub-system design problem as the actual hardware/software co-designproblem, i.e. the simultaneous construction of an optimized parallel software structure and a cor-responding parallel ASIP architecture, as well as mapping in space and time of the constructedparallel software on the parallel ASIP hardware. It proposes a novel solution of the so formulatedproblem composed of a new design flow, its methods and corresponding design automation tools.Specifically, it extensively elaborates on the Instruction-Set Architecture Synthesis of the adapt-able VLIW ASIPs and discusses the related results of experimental research. The experimentalresults clearly confirm the high effectiveness and efficiency of the developed design methods anddesign automation tools.

2. Related work

The development of contemporary digital systems heavily relies on electronic design automation(EDA) tools. Placing, sizing, and connecting the 1 billion transistors of a contemporary MPSoCsimply is not possible without a huge amount of fully automated design assistance. Historically,EDA tools focussed solely at placement and routing of transistors. However, over time this limitedapproach became infeasible as circuit complexity increased. As a result, EDA tools adaptedlibraries of higher-level standard components. Initially these components were simple logic gates(and, or, etc.), but later usage of only these small blocks also proved insufficient and larger socalled Intellectual Property (IP) blocks were added to the libraries. These IP blocks can be assimple as a memory controller, but may also contain complete processors including local cachememories. Nowadays the design and support of such IP libraries containing complex processor IPhas become an important part of the digital electronics design industry and the sole reason forthe existence of companies such as ARM or Imagination Technologies.

Managing a system-level design involving several such complex IP blocks is a very complex taskwhich requires highly specialized tools. Currently three major EDA tool vendors deliver such tools

4

(Synopsys1, Cadence2, and Mentor Graphics3), and virtually everyone designing or using IP blockswill be using the EDA tools of one or more of these companies. Mentor Graphics, the smallest ofthe three, focusses mostly on the realization of designs provided by human experts and doesn’t (byitself) provide much support for choosing between alternative high-level designs. The tool thatcomes closest to providing an automated application-to-design path is Calypto Design Systems’Catapult-C, which started off as a product from Mentor Graphics. Catapult-C, however, is mostlyaimed at the high-level synthesis of hardware accelerators only and has no special advantage whenused to design application specific processor architectures. However, it can be useful when creatingcomplex custom operations for integration into a new processor design. In contrast, Cadence andSynopsys do provide tools which allow for more automatic design of both hardware acceleratorsand application specific processor architectures.

This paper presents an overview of several of the currently available methods for (automated)design of customized application specific processor architectures, which are in a quite close relationto the presented research. However, the original design choices of these previous approaches oftendo have a lasting impact on the abilities and strengths of each of these toolflows as will be discussedbelow.

This section first presents the commercially available tools from both Cadence and Synopsys,and then continues with a presentation of the recent research on the topic. We finalize the relatedwork with a discussion of the SiliconHive/Intel ASIP based MPSoC design framework that wasused within the ASAM project.

2.1. Commercial EDA tools

Both Cadence and Synopsys provide a large portfolio of EDA tools. These various toolsare aimed at different phases of the design process, and can often be used in combination witheach-other in a semi-integrated fashion to offer a complete design flow from a high-level designproblem specification to a detailed circuit design. In the last decade, through a series of externalacquisitions, both vendors have been moving to include more high-level design tools in their toolframeworks. This section will briefly discuss the tools of both vendors which are relevant in relationto automated instruction-set architecture synthesis of VLIW processors, the topic of this work.

2.1.1. Cadence

Similarly to Mentor Graphics, Cadence traditionally focussed on providing tools that take acomplete design and implement it in the latest technology. As such, Cadence mostly provides EDAtools that take a prepared high-level system design and iteratively translate the design into a moredetailed lower-level design until the actual circuit is realized. However, more recently, Cadence hasstrengthened its position in the automated high-level design market, first by acquiring Tensilicain 2013, and thereafter by the acquisition of Forte in 2014.

Forte’s Cynthesizer tool together with the Cadence C-to-Silicon design-flow provided Cadencewith a high-level synthesis design-flow similar to that of Mentor Graphics. However, as withMentor Graphics, Forte’s tools and Cadence’s C-to-Silicon design-flows mostly focus on high-level synthesis of hardware accelerators and less at the automatic synthesis of application-specificprocessor architectures. The acquisition of Tensilica changed this.

Tensilica was a company that specialized in customizable programmable IP solutions. Theirtools include a language that allows a designer to describe a new processor architecture at theinstruction-set architecture level and re-use the processor IP. Based on this architecture descrip-tion, the Tensilica tools automatically generate the processor architecture hardware-design to-gether with the required support software to program the newly designed processor architecture.These tools now live on as part of the Cadence XTensa tool-suite.

1http://www.synopsys.com2http://www.cadence.com3http://www.mentor.com

5

http://www.synopsys.com

http://www.cadence.com

http://www.mentor.com

The Cadence XTensa design-flow4 automates the construction of new processor architecturesand their corresponding support software. The designer is presented with a configurable base pro-cessor architecture which can be extended with extra operations. These operations are specifiedmanually by the expert designer and are included directly in the processor datapath as designerdefined instructions. Both hardware description (RTL) of an instantiated and/or extended proces-sor architecture and supporting system modeling and software tools (simulator/compiler) are thengenerated for the revised architecture in minutes. This provides an expert user with a methodol-ogy to quickly evaluate the effect of different processor architecture variations on the performanceof the target application. This design-flow helps a lot when designing an application specific pro-cessor architecture, but still relies on design exploration and decisions of an experienced humandesigner. Identifying customization possibilities and other extensions, such as the addition of cus-tom operation patterns, require either the usage of external tools or the presence of an expertuser.

2.1.2. Synopsys

Synopsys, currently the largest of the three main EDA companies, has been involved in elec-tronic system-level design a bit longer than the other two but, like Cadence, has also recentlybeen expanding its interest in processor architecture synthesis tools. These tools include SynopsysProcessor Designer5 which features the design-flow that was acquired from Coware in 2010, andthe Synopsys ASIP Designer tools6 (formerly IP Designer) which were acquired from Target in2014.

The Processor Designer toolflow allows a user to describe a processor architecture in the LISAarchitecture description language and automatically creates both the hardware description andsoftware support tools. The LISA language provides a high flexibility to describe the instruction-sets of various processors, such as SIMD, MIMD and VLIW-type architectures [2, 3]. Moreover,processors with complex pipelines can be easily modeled. The original purpose of LISA was toautomatically generate instruction-set simulators and assemblers [3]. It was later extended to alsoinclude hardware synthesis [2].

Synopsys ASIP Designer provides a different architecture description language (nML [4, 5])which is also aimed at the description and synthesis of application specific processor architecturesand their support software. The nML language is very similar in intents and purpose to the LISAlanguage but several subtle differences exist. For example, nML aims more directly at describingthe detailed instruction-set architecture of the processor, including the semantics and encoding ofinstructions, which makes it slightly more suited for generating a retargetable C compiler togetherwith the simulator and processor hardware [4]. This stronger focus on generating a full compilerdoes however restrict the possible hardware optimizations compared to the LISA based flow.For example, sharing hardware resources within a function unit is more limited for nML basedarchitectures than it is for those described using LISA [2].

Outside of these small differences both Processor Designer and ASIP Designer remain verysimilar. In both cases an expert user has to provide an architecture description from which thetools are then able to generate a compiler and simulator. Using these generated tools, the usercan then compile and simulate the target application on the proposed architecture. Cycle countand resource usage statistics are then gathered by the user upon which further alterations to theprocessor architecture can be proposed. The selection and implementation of these extensionsis done manually by the user. Hardware (RTL) generation is usually only performed after theuser is satisfied with the performance of a simulated version of the processor because of thetime consuming nature of running the actual hardware synthesis and the extremely slow RTLsimulation.

Synopsys also offers a high-level synthesis tool called Synphony C Compiler. Again, this

4http://ip.cadence.com/ipportfolio/tensilica-ip/xtensa-customizable5http://www.synopsys.com/systems/blockdesign/processordev6http://www.synopsys.com/dw/ipdir.php?ds=asip-designer

6

http://ip.cadence.com/ipportfolio/tensilica-ip/xtensa-customizable

http://www.synopsys.com/systems/blockdesign/processordev

http://www.synopsys.com/dw/ipdir.php?ds=asip-designer

high-level synthesis tool is aimed more at non-programmable hardware accelerators and less atapplication specific processor architecture design. However, this hasn’t always been the case. TheSynphony C Compiler was the product of Synfora which originated in 2003 from the PICO projectas a start-up company. PICO, which stands for Program-In, Chip-Out, did much more than thesynthesis of hardware accelerators and will be discussed in more detail in Section 2.2.3.

2.2. Tools from research projectsIn parallel to the commercial offerings discussed above, several research projects have recently

been performed in relation to the high-level design of application specific processor architectures.Most of these research projects focused on providing or improving architectural description lan-guages for the construction of application specific processor architectures (see Section 2.2.1. Twoprojects differentiate themselves from the others in that they provide support for automated archi-tecture exploration. These projects, the TCE framework and the PICO project, will be discussedseparately and in more detail in Sections 2.2.2 and 2.2.3 respectively.

2.2.1. Architecture description languages

Several domain specific languages have been developed for the description of both functionalityand structure of application specific processor architectures. Using such an architecture descriptionlanguage (ADL) and its related EDA tools allows a designer to quickly instantiate variations of aprocessor architecture and consider the effects of design choices on the cost and performance of theresulting processor. Various design analysis tools, including simulation, are commonly providedwith the ADL tools for this purpose. Examples of such languages are the nML and LISA languagesas used by Synopsys ASIP Designer and Synopsys Processor Designer respectively. More diversityexist in research: ArchC [6] is still being developed at the University of Campinas in Brazil7,Codal [7, 8] is being researched by both Codasip8 and the Technical University of Brno in theCzech Republic, and a new version (3.0) of LISA [2, 9, 10] is in development at RWTH AachenUniversity9.

The current research on these languages and their respective frameworks focusses mostly onthe translation of a high-level processor architecture description into a corresponding structuraldescription in a hardware description language, such as VHDL or Verilog, as well as, the generationof programming tools such as a C/C++ compiler or assembler, debugging and instruction-setsimulation tools, and application analysis and profiling tools. In some cases (e.g. LISA 3.0 andArchC), support for system-level or multi-processor integration is also being added. LISA 3.0 alsoimproves on its previous incarnation by the addition of support for reconfigurable computing [9, 10].A reconfigurable processor introduces a reconfigurable logic component in or near the datapath.Such a component can implement either a fine grained reconfigurability, similar to a small fieldprogrammable gate array (FPGA), or coarse grained, more like a coarse grained reconfigurablearray (CGRA) [11]. This addition allows for further customization of the instruction-set even afterthe finalization of the processor silicon, for example during the processor initialization or possiblyeven at runtime.

In general, the process of using these ADL based tools is very similar to that of the Synop-sys tools described above. Support is provided for constructing both development tools such asa compiler and simulator, as well as, an RTL description of the hardware of a processor speci-fied using the ADL. Application analysis tools focussing at highlighting hot spots and candidateinstruction-set extensions can also be provided to the designer. However, like with the Cadenceand Synopsys tools, the final decision making on which processor variation to consider for the nextdesign iteration is left to the expert designer.

Many more ADL frameworks exist and this section named only a few which were relevant tothe research presented in this paper; for more information on this topic see the book “ProcessorDescription Languages” by Mishra and Dutt [12].

7http://www.archc.org8http://www.codasip.com9http://www.ice.rwth-aachen.de/research/tools-projects/lisa/lisa

7

http://www.archc.org

http://www.codasip.com

http://www.ice.rwth-aachen.de/research/tools-projects/lisa/lisa

Figure 1: Transport Triggered Architecture processor templatea

aSource: http://tce.cs.tut.fi/screenshots/designing_the_architecture

2.2.2. TCE: TTA-based Co-design Environment

The TTA-based Co-design Environment10 is a set of tools aimed at designing processor ar-chitectures according to the Transport Triggered Architecture (TTA) template. TTA processorsare, like VLIW processors, a sub-set of the explicitly programmed processor architectures andcan be seen as exposed datapath VLIW processors. Unlike VLIW processors, TTA processorsdo not directly encode which operations are to be executed, but are programmed by specifyingdata movements. As a result, all register file bypassing from function units and register files isfully exposed to the compiler. The TTA programming model has the benefit of enabling soft-ware bypassing, a technique where short-lived intermediate results of computations are directlyforwarded to the function-unit consuming the data, while completely bypassing the register file.This reduces both the register file size and port requirements, as well as, energy consumption forthe TTA architecture compared to more traditional VLIW architectures. A similar reduction ofthe register file energy consumption can be obtained using hardware bypassing [13, 14], but thattechnique generally has a larger hardware overhead as it requires run-time detection of bypassingopportunities. Figure 1 illustrates the TTA processor architecture template. It shows how thefunction units, register file, and control unit are connected through sockets to the transfer buses.Programming is achieved by controlling the connections of the sockets with the buses. Fromthis figure it is also clear that register file bypassing can be implemented in software, simply byforwarding a result from one function unit directly to the input of another.

Research on Transport Triggered Architectures started with the MOVE project [15, 16] at DelftUniversity of Technology during the 90s. Later, when the Delft MOVE project was discontinued,Tampere University of Technology continued the research, and created the next generation of theMOVE framework which they named the TTA-based Co-design Environment. Hoogerbrugge andCorporaal [17, 15, 16] investigated automatic synthesis of TTA processor architectures as part ofthe MOVE project. Derivatives of this work are still available as the MOVE-PRO architecture[18]and within the TCE. Chapter 6 of the TCE manual [19], titled “Co-design tools”, is dedicatedto the tools available for supporting automatic processor architecture exploration. These tools

10http://tce.cs.tut.fi/

8

http://tce.cs.tut.fi/screenshots/designing_the_architecture

http://tce.cs.tut.fi/

NPA constructor

VLIW compiler

VLIW constructor

Cache constructor

NPA spacewalker

VLIW spacewalker

Cache spacewalker

NPAparameters

VLIWprocessor

Controlinterface

Input C code

Nonprogrammableaccelerator

PICO-generated system

VLIW code Compute-intensivekernel

1

25

8

7

10

69

3

4

Executable

Cachehierarchy

Cacheparameters

Machinedescriptiondatabase

Abstractarchitecture

specifications

Figure 2: PICO design-flow organization [20]

are somewhat similar to the tools and techniques presented in this paper. However, this paperpresents several techniques and tools that target another processor architecture style (VLIW).

Most of the architecture exploration techniques proposed in this paper, after small modifi-cations, could also apply to the TCE and could help to further reduce TTA-based processorarchitecture exploration times.

2.2.3. PICO: Program-In Chip-Out

As mentioned above, the PICO project, grandparent to some parts of the current Synopsisdesign-flow, offered more than only the automatic synthesis of hardware accelerators which wasincorporated into the Synphony C compiler. In its original form PICO covered the automatic syn-thesis of a processor system containing a set of non-programmable hardware accelerators combinedwith a single VLIW processor [20, 21].

In essence, the goals of PICO were very similar to those of the ASAM project. Both projectsaimed to automatically develop an application specific multi-processor system. However, there arealso several key differences. PICO approaches the problem by synthesizing a system with a singleVLIW processor and a set of hardware accelerators, whereas the ASAM project utilizes one ormore heavily specialized highly parallel VLIW processors and no hardware accelerators. Severalother substantial differences between these two approaches stem mostly from the differences intheir VLIW processor templates. For example, the PICO VLIW processor template uses a singleregister file for each data-type (integer, floating point, etc.), this severely limits the number ofoperations which can be executed in parallel. Many read ports need to be available to provide theoperands to each operation executed in parallel. Large many ported register files quickly becomevery expensive regarding both area and energy, and limit the maximum operating frequency ofthe PICO VLIW processor [22, 23].

9

The PICO design-flow, illustrated in Figure 2, provides a fully automated design flow fordeveloping the non-programmable accelerator (NPA) subsystems, the VLIW control processor, andthe cache memory hierarchy. To limit the size of the design space, each of these three components(NPAs, VLIW architecture, and cache hierarchy) is explored independently from the others. Thisis different from our approach. Considering all three components combined resulted in a toolarge design space which severely limited the effectiveness of any automated exploration [20, 21].During the exploration, a Pareto-optimal set of solutions is obtained for each of the three systemcomponents. First compute-intensive kernels are identified (2) (see step 2 in Figure 2) in the inputC code (1) and NPA architectures are explored for each of these kernels (3), (4). The compute-intensive parts are then replaced with calls to the hardware accelerators in the original C code(5) and a set of alternative VLIW processor architectures is then designed (6), (7), (8). Finally,the cache hierarchy is tuned for the memory requirements of the application (9) and compatibleVLIW, cache, and NPA designs are combined to form a set of Pareto-optimal designs (10). Thefocus for the system architecture exploration by PICO is on the trade-off between the area andtiming. Area is measured in either physical chip area or gate count, whereas an estimation ofthe application’s processor runtime is used for the timing. Similar to our approach, the timingestimate is computed as the total sum of each basic block’s schedule length multiplied by itsprofiled execution count.

2.3. The SiliconHive tools

The initial work on the SiliconHive VLIW ASIP development technology started at Philips.The results from that research were carried over into SiliconHive when it was spun-out of PhilipsResearch as part of the Philips Technology Incubator program in 2003. Intel acquired SiliconHivein 2011 after a period of growth and further development. This section uses a selection of the infor-mation made available through the ASAM project and publications from both the earlier Philipsand SiliconHive periods, to introduce the SiliconHive VLIW ASIP architecture template and thepre-existing SiliconHive development framework which were exploited by the ASAM project.

2.3.1. Overview

Similarly to the other tool flows discussed in this chapter, the SiliconHive tool flow, illustratedin Figure 3, offers an architecture description language (called TIM) and a set of tools to generatehardware RTL descriptions, an instruction-set simulator, a retargetable C compiler, and variousother debugging and software development tools. In parallel, the SiliconHive tools also offer asecond language (called HSD) that allows a user to construct a multi-processor system consisting ofone or more VLIW processors (specified using the TIM language) and other hardware components(such as hardware accelerators, memory units, and peripherals) taken from a library or importedfrom external sources.

Using an original application description (usually expressed in sequential C), the user starts bycomposing or selecting an initial MPSoC platform design and then decomposing the applicationcode into a parallel version tuned for the initial platform. After mapping the application onto theplatform, the user can then compile and simulate the so designed HW/SW system to find the re-maining critical points of the design. Using this information, the user then can manually refine theparallelization, mapping, and/or the platform composition in an iterative design process. Severalpre-selected processor and platform designs are available for an easy start, but the parallelization,mapping, and design-space exploration steps have to be performed manually.

2.3.2. Architecture template

The SiliconHive ASIP design technology offers a highly flexible template of a customizableVLIW processor. Figure 4 illustrates this processor architecture template. A processor (cell) isorganized in two parts, one part (coreio) contains the local memories and handles the interfacewith the external world, while the other part (core) performs the actual operations as describedby the program in the local program memory.

10

CC

Parallelization

CCCCCC

TIM

IP blocks

IP blocks

TIM TIM HSD

P1 P2 P3

busapplication platform

Mapping

HSD compiler

TIM compiler

HiveCC

P1 P2 P3

bussystem

CCCCCC

P1 P2 P3

bussystem

GenSys

Systemsimulation

VHDL

asm asm asm

data

manual

SiliconHive tool

Application design

Platform design

Figure 3: SiliconHive flow showing the steps required for creating a custom MPSoC platform and its correspondingapplication implementation.

Interfacing to the external world and memories. SiliconHive processors usually contain one ormore local scratchpad memories which are used for low-latency storage of (intermediate) dataused by the algorithm running on the processor. Several memories with different organization(e.g. different sizes and/or data-widths) and differently implemented (e.g. register-based or SRAM)can be included in a single processor as required by the target application. Local memories arealso connected through a slave interface to the global MPSoC interconnect hierarchy so that theother elements of the MPSoC containing the SiliconHive processor can access these memorieswhen providing input and/or consuming output to/from the SiliconHive processor. One or moremaster interfaces may also be present in the coreio. These master interfaces can either connect toan external memory, a direct memory access (DMA) controller, or the local memory of anotherSiliconHive processor. Stream interfaces (FIFOs) can also be added to the processor. Such FIFOinterfaces are mainly suitable for small data transactions and are usually utilized by providinghand-shake signals while performing larger transactions through a master interface transfer.

In parallel to these data storage and transfer components, the coreio of a SiliconHive processoralso contains the program memory, as well as, a set of status and control registers which can be usedfor reconfiguring the processor. Examples of such reconfiguration are actions like reprogrammingthe program memory, entering/leaving the low-power sleep mode, and starting/stopping a kernel.

The core itself contains a very small sequencer block which interfaces the datapath with thestatus and control registers and fetches the appropriate instructions from the program memory.Instruction decoding is kept relatively cheap in the SiliconHive processor architecture by using ahorizontally programmed processor style. Instruction bits usually correspond directly to a subsetof the configuration bits of the processor registers and input select multiplexers.

Core structure. The actual execution of the program is performed inside the datapath. Here,operations are executed within issue-slots. Each issue-slot is composed of a set of function-units

11

registerfile

registerfile

registerfile

functionunit

functionunit

functionunit

functionunit

issue slot issue slot issue slot

functionunit

core

data path

storageor I/Odevice

slaveinterface

storageor I/Odevice

storageor I/Odevice

storageor I/Odevice

address mapping addressmapping

addressmapping

status& control

bank

programmemory

storageor I/Odevice

slaveinterface

slaveinterface

addressmapping

addressmapping

addressmapping

programmable interconnect

statusregister

sequencer

programcounterregister

logical memory logical memory logical memory

coreio

cell

streamport

slaveport

slaveport

slaveport

masterport

Figure 4: SiliconHive processor architecture template [1]

which implement the actual operations. This division of issue-slots into function-units can besomewhat confusing as most of the related work [24, 22, 16, 15, 23] uses the term function-unit(or functional-unit) to designate what is called an issue-slot in SiliconHive terminology. However,this paper will use the SiliconHive terminology since it builds upon their processor architectureframework.

Within the datapath, issue-slots are connected to multiple register files using a (optionallyshared) interconnect [24, 25, 26, 27]. This provides an efficient implementation for very wide VLIWprocessor architectures without incurring the overhead of a large, centralized, register file [22, 23].

The explicitly programmed nature of the SiliconHive processors makes it relatively easy to com-pute the final instruction width for newly constructed processor architectures. This is especiallyuseful when modeling the impact of architecture changes.

2.4. Compiler support for VLIW processors

One of the major difficulties in the development and usage of VLIW processors is the abilityof the programmer to obtain a sufficiently high resource utilization of the processor. The highamount of programming freedom provided by explicitly programmed processor architectures, which

12

enables the high performance benefits of these architectures, also increases the complexity of thescheduling process. Combining this with the presence of large, complex, custom operations andVLIW scheduling techniques such as software pipelining [28] easily results in a highly complexcompiler.

This high compiler complexity, up to the point where the creation of a compiler becomesinfeasible, is often one of the motivations for moving parts of the decision making process tothe programmer, the user of the compiler. Annotations added to the input C code allow for asimplification of the compilers decision making process which often leads to a better final resultwhen an experienced programmer is using the compiler. Even with a good compiler, annotationsmay bring a substantial profit as they allow for localized overrides of the compiler heuristics incases where sub-optimal results are being produced.

Source code annotation. Source code annotation is a popular technique to enable some of the moreesoteric processor features without invasive changes to the compiler itself. Classical examples areintrinsics and annotations that capture the mapping of data into one of the (possibly many)different memories of a processor. However, even when the compiler itself is capable of optimizingfor such complex processor features hints for enabling and disabling these specific optimizations canalso be provided as source code annotations. Complex optimizations performed by the compilercan be very time consuming, while most of their benefit can only be observed for a (small) portionof the application code. Allowing these optimizations to be enabled for limited parts of theapplication code can significantly speedup the compilation process.

Complex operations. Complex operations, such as an FFT butterfly operation or those working onvector elements, are difficult to represent or efficiently detect in the C language. For this reason,most compilers provide direct access to such operations through intrinsics. A compiler intrinsiclooks like a C function call, but it is translated within the compiler directly into the (complex)operation it represents. This allows the programmer to force the compiler to select the intendedoperation and allows for a strong simplification of the custom operation selection heuristics in thecompiler itself.

2.4.1. Memory mapping

The mapping of (global) data arrays onto one of the local memories of a processor is oftencontrolled by added annotations. For example, OpenCL [29] recognizes private as a keywordwhich denotes that the thus marked data should be mapped into the private memory of theprocessor running the kernel.

2.4.2. Optimization hints

Optimization hints can often be provided either by using standardized C keywords such asinline, restrict, and register, by in-line annotations such as builtin expect() in GCC,or by compiler directives using #pragma statements.

2.4.3. Code transformations

Code transformations form a base for to enabling a high instruction level parallelism (ILP) anddata level parallelism (DLP), but also for controlling the buffer sizes required for intermediate data.Code transformations, such as speculation and unrolling, or scheduling techniques, such as softwarepipelining, are mostly aiming at increasing the explicitly available ILP within important sections ofthe program. However, such ILP enhancing techniques usually also come at the cost of an increaseof the program size. The SiliconHive compiler supports these ILP enabling optimizations both asautomatic optimizations, but also provides the user with direct control through enabling/disablingthese optimizations for parts of the code using annotations.

Loop transformations, a subset of the code transformations with focus on the structure of loopnests, generally aim at controlling the distribution and size of data elements communicated betweenconsecutive loop nests, but also affect the sizes of data communicated between the processor andexternal inputs and outputs (such as external memories or other processing tiles). The loop

13

transformations used within the ASAM project are mainly: loop fusion, loop tiling, and loopvectorization. These loop transformations are currently not provided by the SiliconHive compilerand are performed within the ASAM tools as source-level code transformations.

For the purpose of loop transformations, formal mathematical models of loop nests, such asthe Polyhedral model [30], are used. Such formal models allow for a direct analysis of the effectsof loop transformations on the memory requirements of the transformed code. Although theSiliconHive compiler does not yet support such loop transformations, several tools are alreadyavailable [31, 32, 33, 30, 34] which allow for translations to and from the polyhedral domain,as well as, the automatic exploration of loop transformations. Section 3 further illustrates howthese methods are used within the instruction-set architecture synthesis and the ASAM projectin general.

2.4.4. Extensions for architecture exploration

Using a compiler in the context of an architecture construction or exploration frameworkincreases the demands on the compiler. A lot of time can be saved if the same compiler canbe used for several variations of a specialized processor without the need for rebuilding (partsof) the compiler. In the SiliconHive compiler, this is achieved through the addition of severalcompiler controls. These compiler controls allow the programmer to override the set of availableprocessor resources that the compiler is allowed to use. Such compiler flags make it very easyto investigate the effects of removing specific function unit, or even a complete issue-slot, from aproposed processor architecture.

However, code annotations pertaining to resource allocation, such as an explicit function unitbinding or the use of a complex operation through an intrinsic, are considered definitive by theSiliconHive compiler. Thus, removing an explicitly used resource from a processor will result ina conflict with source code annotations present in the target application. For example, removinga function unit may result in the removal of an intrinsic from the compiler which was used inthe target application. In some cases, replacement or emulation code can be provided by theprogrammer. Such emulation code needs to be provided in the source code form itself, as pre-compiled emulation libraries will not have the opportunity to take removed resources into account.

A similar difficulty appears when a load-store unit is removed, making one of the memories ofthe candidate processor inaccessible. Such a removal can invalidate the current memory mappingof the target application in a way that can not be covered easily by the use of replacement code.It is quite likely that the entire memory mapping needs to be reconsidered in such a case. Asa result, either the exploration can not have the freedom of removing load-store units and theirrelated issue-slot and memory interfacing hardware, in which case the exploration needs to beprovided with a set of different initial architectures representing different memory mappings; orthe compiler needs to provide an automated method for finding an appropriate distribution of dataacross memories so that the distribution of data across different local memories will not requireannotation.

3. ASAM design flow

As previously mentioned, the ASAM project focusses on the automatic synthesis of heteroge-neous VLIW ASIP-based multi-processor systems. The ASAM design-flow is built upon the VLIWMPSoC design framework of Intel Benelux (formerly SiliconHive) that was presented above. Theaims of the ASAM project included the automation of several of the (previously) manual designsteps. In particular the exploration, analysis, and decision making regarding the multi-ASIP plat-form design, application parallelization, ASIP customization, and application mapping steps wereconsidered for automation.

One of the key problems addressed by ASAM is that it is impossible to perform an efficient ap-plication parallelization and mapping without information on the performance of the applicationparts on specific processing elements of the platform, but it is equally impossible to construct a rea-sonable multi-processor platform and each of its application-specific processing elements withoutknowledge of the parallelization and mapping due to the cyclic dependency between the two.

14

The ASAM project tries to break the cyclic dependency of this ‘chicken and egg’ problemby tight coherent coupling of various design stages and phases, as well as, through staticallycomputing early performance estimates of single tasks, and combining them to propose promisingapplication mappings using a probabilistic application partitioning and parallelization phase.

The ASAM approach divides the MPSoC design space exploration into two main stages: themacro-architecture exploration, and the micro-architecture exploration. These two stages aretightly coupled to form a coherent MPSoC architecture exploration framework. The macro-levelarchitecture exploration focusses on the construction of the system out of a set of (initially un-known) VLIW ASIP processors, whereas the micro-level architecture exploration focusses on theanalysis of a set of tasks assigned to a single ASIP and the construction of new VLIW ASIPsspecialized for (groups of) specific tasks. Figure 5 illustrates the tight bi-directional cooperationbetween the macro- and micro-level architecture exploration. Further information on the ASAMflow can be found in [1]. The work presented in this paper lies within the micro-level architectureexploration and focusses mostly on the application analysis and mapping and the correspondingASIP instruction-set architecture synthesis (highlighted in Figure 5). However, to adequatelyplace this work in its context some more information is needed about the macro-micro interaction,as well as, on the application intra-task parallelization within the micro-level stage.

3.1. Macro- and micro-architecture exploration

The ASAM flow starts 1 with its input composed of the C-code of the target application,as well as, user supplied constraints and design objectives. From these, it extracts the overallapplication structure and a set of compute intensive kernels using the Compaan Compiler [35],which translates these kernels into tasks in a Polyhedral process network 2 . The usage of theCompaan Compiler poses several constraints on the provided C-code to enable automatic analysis(e.g. affine loop structures and access patterns and statically allocated buffers). However, notall applications translate equally well to this structure. We therefore allow for more complexapplications by providing an option to start the architecture exploration with a user providedgraph model of the application. In this case the only constraint is that the application canbe decomposed into several communicating data-flow tasks. The difficulty here is that such amanually provided model only has limited optimization potential in the ASIP memory hierarchyexploration (discussed below) which also works within the Polyhedral model.

The user supplied constraints and design objectives consist of both structural and parametricrequirements. These requirements guide the automated tools and are used to control explorationaspects, such as the granularity of the computational tasks and the available processor architecturecomponents (both structural requirements), but also to allow user defined limits on the energyconsumption, throughput, and maximal area occupation (parametric requirements). Throughthese constraints and objectives, the user is able to control both the size and complexity of theoverall exploration problem and can influence the trade-offs that are considered during the designprocess.

3.2. Finding good task combinations to be executed in single ASIPs

Each of the tasks of the Polyhedral process network is then analyzed by the micro-level ap-plication analysis 3 . The best-case and worst-case execution times of tasks for a future VLIWprocessor of unknown issue-width, designed according to the SiliconHive template, is then de-termined using the parallelism estimation methods presented in [36, 37]. These best-case andworst-case execution time estimates, computed by the micro-level for the set of tasks assignedby the macro-level to be executed on a single ASIP, are then taken by the macro-level and usedduring the probabilistic system exploration [38, 1, 39]. Using an evolutionary algorithm combinedwith Monte Carlo simulation and models of the inter-processor communication the probabilisticsystem exploration finds promising task clusterings. For each of the task-clusterings the micro-level is then consulted to produce an initial parallel computation structure of the task cluster anda corresponding coarse customized VLIW processor architecture 4 .

15

Input

Macro-lev

elM

icro-level

SiliconHivetools

Cco

de

Stimuli

Userco

nstraints

Compaan

Compiler

(App.Analysis)

System

-lev

elInterconnectand

Mem

ory

DSE

Probabalistic

exploration

Deterministic

exploration

Complete

system

architectu

re

Applica

tion

Analysis

Applica

tion

Parallelization

TIM

Gen

eration

Instru

ction-set

architectu

resynth

esis

Power

control

SHMPI

prototyping

C-to-C

HiveC

C

TIM

Compiler

HSD

Compiler

System

simulator

Gen

esys

Processor

area/en

ergy

model

1System

input

2Task

parallelization

3W

CET

4ASIP

design

5ASIP

Design+

Sim

.

6ASIP

instantiation

7C

transform

ation

8Inteltools

10

BuildMaster

9Instru

ction-set

11

Instru

ction-set

+Sim

.

12

Comm.optimization

13

Power

control

14

UNIC

Asimulator

15

System

prototyping

16

System

outp

ut

Fig

ure

5:

AS

AM

flow

over

vie

ww

ith

the

mic

ro-l

evel

arc

hit

ectu

reex

plo

rati

on

cover

edin

this

pap

erh

igh

lighte

din

red

.

16

3.2.1. Deciding the ASIP memory hierarchy and initial VLIW ASIP architecture

In the micro-level application-parallelization and coarse architecture synthesis stage, multi-ple tasks within a single cluster may be transformed so that their data-locality and re-use areoptimized. The micro-level application parallelization tool [1, 40] transforms the Polyhedral rep-resentation of the clustered loop kernels to reorder and fuse the kernel executions in such a wayas to find possible promising trade-offs between the data-throughput and the processor area andpower consumption. Although at this stage, the power consumption is assumed to be proportionalto the area, which is a well funded assumption [41] that sufficiently serves this early explorationstage. Currently, loop fusion, loop tiling, and kernel vectorization are considered during thisexploration, but the repertoire of transformations can be extended.

Loop fusion combines two kernels into a single new kernel which localizes any intermediateresults which were communicated between the original kernel pair. This reduces the totalmemory requirements for the task cluster.

Loop tiling enlarges the granularity of data on which the kernel is running. This allows for atrade-off between the size of the local cache or scratch-pad memories in the processor versusthe bandwidth required between the considered processor and its external memory.

Loop vectorization changes the data granularity for the computations performed within thekernel and reduces the number of instructions required for the execution of a given application(part).

The application parallelization uses estimates of the instruction-level parallelism (ILP) availablein each kernel. These estimates are obtained as one of the results of the application analysis,as explained in [36, 37]. The application parallelization creates an optimized parallel structureof the application part mapped to a single VLIW ASIP and constructs a corresponding initialVLIW architecture with sufficient resources to achieve the required throughput. A Pareto-set ofsuch coarse VLIW architecture designs is then returned to the macro-level exploration 5 whichthen uses the performance metrics of these more accurate designs in combination with a set ofcommunication models (obtained from the communication and global memory exploration) todetermine the final multi-processor system architecture. Each of these returned candidate VLIWprocessor architectures can be synthesized 6 and a transformed C code can be generated to match

the selected high-level loop transformations 7 . After the ASIP synthesis and corresponding Ccode generation, the ASIP based hardware/software subsystem is ready for simulation using theSiliconHive tools 8 .

3.2.2. Finalizing the ASIP design

The final optimization step is applied when the area and/or energy consumption of the thusfar synthesized ASIP based platform is not yet satisfying the design constraints or for further opti-mization of the design objectives. This optimization is performed by the micro-level instruction-setarchitecture synthesis, separately for each single VLIW ASIP in the system and its correspondingapplication part when requested by the macro-level architecture exploration 9 . This proces-sor architecture optimization step tries to improve the ASIP architecture both by the additionof application-specific instruction-set extensions as custom operations [42, 43, 44], and by theremoval of unused or scarcely used processor components which were included as part of the pro-cessor building blocks used during the construction of the initial processor prototype. Severalimprovements to the processor area and energy models used in this step were previously presentedin [45], the exploration algorithms of this step in [46, 47], and techniques for reducing the explo-ration time 10 in [48]. The resulting refined ASIP design, together with more precise area, energy,

and execution time estimates, are then returned to the macro-level architecture exploration 11 .

3.2.3. Finalizing the MPSoC platform

Finally, the macro-level architecture exploration continues with an exploration of the MPSoCinterconnect and global memory structure 12 based on the available design alternatives for the

17

1 int input_image[N][M];

2 int temp_image[N/2][M];

3 int output_image[N/2][M/2];

4

5 void downsample2d(void)

6 {

7 int h, w;

8

9 // kernel 1: vertical down sampling

10 for(h=0; h < N/2; h++) {

11 for(w=0; w < M; w++) {

12 temp_image[h][w] =

13 (input_image [2*h][w] + input_image [2*h+1][w]) >> 1;

14 }

15 }

16

17 // kernel 2: horizontal down sampling

18 for(h=0; h < N/2; h++) {

19 for(w=0; w < M/2; w++) {

20 output_image[h][w] =

21 (temp_image[h][2*w] + temp_image[h][2*w+1]) >> 1;

22 }

23 }

24 }

Listing 1: 2D down sampling a N×M image, original code

various VLIW processor cores in the system. The macro-level also selects the appropriate VLIWinstances from the set of refined architectures produced through the micro-level architecture ex-ploration. Combining the selected VLIW instances and the synthesized interconnect and globalmemory structure allows the introduction of system level power control 13 (e.g. voltage scaling,

power gating) after which the full system can be simulated through the SiliconHive tools 14 or

emulated on FPGA 15 . After this final validation of the system design, the tools can finalize thedesign and (semi-)automatically produce the required RTL and software descriptions for perform-ing further (ASIC) hardware synthesis 16 and the production of the final system prototype.

4. ASIP architecture exploration: An example

This section presents an example walk-through of the automatic ASAM micro-level architectureexploration and synthesis process, to further illustrate the ASAM approach for designing a fullycustomized VLIW ASIP processor together with its corresponding parallel software structure. Theapplication shown in Listing 1, 2D down sampling, was selected for this demonstration. Downsampling is a common function in image processing which benefits from most of the consideredtransformations without being overly complex, and therefore it is appropriate to be used for theexplanation. However, the methods presented here are equally applicable to much more complexapplications and algorithms.

As can be seen from the code in Listing 1, 2D down sampling consists of two main kernels,performing vertical and horizontal down sampling respectively. As the kernels are fairly small (withonly a few operations each) and at the same time have a quite high communication requirement(half of the original image is communicated from the first kernel to the second kernel), it is fairlylikely that the macro-level architecture exploration will decide to map both kernels onto a singleprocessor.

Array-OL[49], a graphically enriched representation of the polyhedral model, is used withinthe ASAM project in the second phase of the micro-level architecture exploration for perform-ing the application restructuring and coarse processor architecture synthesis. Figure 6a shows agraphical representation of the Array-OL model corresponding to the input code of the 2D downsampling application. Both vertical and horizontal down sampling kernels (V scale and H scale

18

V scale H scale

N x M N/2 x M N/2 x M/2

(a) Original code structure

V scale H scale

N/2 x M/2N x M

(b) After kernel fusion

V scale H scale

N/2 x M/2N x M

(c) After both kernel fusion and vectorization with vectorized data elements colored grayand a double box for denoting the vectorized iteration domain

Figure 6: Array-OL representation of the horizontal and vertical down-sampling kernels after different optimizations

respectively) are shown as grey rectangles. The input and output sizes for each kernel iterationare illustrated next to the input and output ports. Both kernels consume two data elements andproduce a single data element. The main difference between them is the orientation of the twoconsumed elements in the 2D data domain. The repetition domain surrounds each kernel andrepresents the loop-nest that wraps around each kernel in Listing 1.

From both the Array-OL model and the original source code, it can be observed that thisimplementation of the algorithm requires half of the original input image in temporary storagelocations, as well as, both the complete input and output image directly accessible by the processor.This data needs to be stored either in the ASIP local memories or in a, usually slower, externalmemory.

4.1. Application code restructuring and initial architecture construction

The application parallelization phase of the micro-level architecture exploration[1, 40, 50] startsto optimize the data locality and memory architecture of the customized processor with the Array-OL model of Figure 6a as input.

The first transformation that will be considered is loop (or kernel) fusion. The main advantagesof this transformation are a reduced temporary data storage and an increased kernel size. Figure 6billustrates the effect of kernel fusion on the example code. In order to perform loop fusion, a singlerepetition domain needs to be put around both kernels. Before this is possible, the V scale kernelwill need to be executed twice so that it produces the appropriate data elements for the H scale

kernel. This results in a second repetition domain wrapping only the V scale kernel. This secondrepetition domain only has two iterations and will be unrolled in the generated C code. Unrolling

19

repetition domains with a small repetition count increases the kernel size and usually has a positiveeffect on the instruction-level parallelism available in the application. As can be seen in Figure 6b,loop fusion significantly reduces the temporary storage requirements of the application. In thiscase, loop fusion successfully removed the entire temporary storage except for a few single dataelements.

After the application of fusion, both vectorization and tiling are explored using a geneticalgorithm[50, 40]. Vectorization is mainly used to increase the number of data elements that areprocessed per cycle, as it changes the data granularity at which the computations are performed.Two repetition domains exist in the fused version of the application, the inner and the outerrepetition domain. Both have no dependencies on previous iterations and can be vectorized atwill. However, the inner repetition domain wrapping the V scale kernel only has two iterationsand will provide only very limited performance impact when vectorized. The more logical choiceis to unroll this inner loop and to vectorize the outer repetition domain. The main limitationof the vectorization here is that vectorization puts a constraint on the possible input image sizes(unless strip-mining is used or explicit padding is added to the data). In this case we assumethat the input image width is a multiple of 32, which results in a maximum vector width of 16since the application needs to read two (vector) elements next to each other to feed the innerrepetition domain. Figure 6c illustrates the vectorized and fused kernels. A double box on theouter repetition domain and the colored data elements illustrate the changed data granularity.Care should be taken when vectorizing the H scale kernel, as it consumes two consecutive dataelements to produce its result. When vectorizing such a kernel, vector shuffling operations arerequired to reorganize the data in such a way that the consecutive elements are put into separatevector elements so that the original kernel code can be kept.

Finally, tiling is applied to the restructured loop nest to enable more freedom in the globalmapping of the input and output data arrays. Tiling is used in the ASAM project to improve thedata locality of the processor cores and to adequately distribute data between the global memoryand the local memories of each processor. This allows a significant reduction in the requiredsize of local memories for the final processor designs. It also allows us to use direct-memory-access (DMA) controllers to perform data transfers in parallel to the actual computations. A tiledimplementation therefore enables streaming operation of the kernel. The restructured kernelscan start processing data while it is still being produced by the source of the data (an imagesensor or other processing step in a larger application) which represents a major benefit. The onlyrequirement for a streaming application is that initially there is sufficient data to perform at leastone iteration and that there is space available to store the (partial) results.

During the loop transformation exploration process, the parallelism available in the core loopnest(s) of the code is estimated, using the methods presented in [36]. This enables the coderestructuring and initial architecture construction process to construct an ASIP architecture thathas an issue-width that is appropriate for obtaining the predicted throughput. Figure 7 shows theprocessor that was constructed for the transformed C-code of the down-sampling application. Theconstructed processor is a wide VLIW ASIP with 4 16-way vector issue-slots, 2 vector memoriesfor storing the data of the input and output buffers, 3 scalar issue-slots (including fifo and controloperations), and a small scalar memory for storing the width and height parameters of the kernel.

Running the restructured code on the so constructed initial processor architecture demon-strates that this ASIP architecture and parallel application code combination is indeed capable ofprocessing pixels at the predicted throughput of a full vector width of input pixels (16 pixels) percycle. The transformed code also utilizes the proposed processor architecture quite well, as it hasan 85% utilization of the issue-slots in the core loop. However, large parts of the instruction-setremain unused which results in an inefficient use of the provided function units and requires anunnecessarily wide program memory. As such, this architecture is still quite overdimensioned andcan substantially be reduced during the architecture refinement phase to decrease both the areaand power consumption, while still realizing the required throughput.

At this point, the tasks of the code restructuring and initial architecture construction arecompleted and the second phase of the micro-level architecture exploration can be concluded.The proposed initial processor architecture is returned to the macro-level architecture exploration

20

do

wn

_c

ore

_s

tre

am

ing

_p

roc

es

so

r

do

wn

_c

ore

_s

tre

am

ing

c0

_c

5_

is6

_is

_is

_o

p2

_B

US

c0

_c

1_

is0

_is

_is

c0

_c

1_

is0

_is

_rf

64

x3

2

c0

_c

1_

is0

_d

me

m_

me

m

c0

_c

5_

is6

_is

_is

_o

p1

_B

US

c0

_c

1_

is1

_is

_is

c0

_c

1_

is1

_is

_rf

64

x3

2

c0

_c

1_

is0

_c

on

fig

_p

me

m_

co

nf_

pm

em

c0

_c

5_

is6

_is

_is

_o

p0

_B

US

c0

_c

1_

is2

_is

c0

_c

1_

is2

_rf

32

x3

2

c0

_c

1_

is1

_fi

fo_

fifo

c0

_c

5_

is5

_is

_is

_o

p2

_B

US

c0

_c

5_

is5

_v

me

m_

me

m

c0

_c

5_

is3

_is

c0

_c

5_

is3

_rf

08

x3

2

c0

_c

5_

is5

_is

_is

_o

p1

_B

US

c0

_c

5_

is6

_v

me

m_

me

m

c0

_c

5_

is4

_is

c0

_c

5_

is3

_rf

18

x5

12

c0

_c

5_

is5

_is

_is

_o

p0

_B

US

c0

_c

5_

is5

_is

_is

c0

_c

5_

is3

_rf

24

x1

6

c0

_c

5_

is4

_is

_o

p2

_B

US

c0

_c

5_

is6

_is

_is

c0

_c

5_

is4

_rf

08

x3

2

c0

_c

5_

is4

_is

_o

p1

_B

US

c0

_c

5_

is4

_rf

18

x5

12

c0

_c

5_

is4

_is

_o

p0

_B

US

c0

_c

5_

is4

_rf

24

x1

6

c0

_c

5_

is3

_is

_o

p2

_B

US

c0

_c

5_

is5

_is

_rf

08

x3

2

c0

_c

5_

is3

_is

_o

p1

_B

US

c0

_c

5_

is5

_is

_rf

18

x5

12

c0

_c

5_

is3

_is

_o

p0

_B

US

c0

_c

5_

is5

_is

_rf

24

x1

6

c0

_c

1_

is2

_is

_o

p_

BU

S

c0

_c

5_

is6

_is

_rf

08

x3

2

c0

_c

1_

is1

_is

_is

_o

p_

BU

S

c0

_c

5_

is6

_is

_rf

18

x5

12

c0

_c

1_

is0

_is

_is

_o

p_

BU

S

c0

_c

5_

is6

_is

_rf

22

x1

6

c0

_c

1_

is0

_is

_is

_s

r_B

US

c0

_c

1_

is0

_is

_p

c

c0

_c

1_

is0

_is

_is

_p

c_

BU

S

c0

_c

1_

is0

_is

_s

r

bru

suu

aru

bm

ulg

uls

um

acp

sush

u

mem

2x

32

aru

bm

ulg

um

acp

sush

usr

u

sta

t_c

trl

9x

32

pm

em

64

x3

55

aru

bm

ulg

um

acp

sush

u

fifo

02

x3

2fi

fo1

2x

32

mem

8x

51

2

va

ruv

intr

vlg

ufl

gu

vm

ul

vp

su

vtr

an

sp

vs

hu

psu

fps

uv

sli

ce

mem

2x

51

2

va

ruv

intr

vlg

ufl

gu

vm

ul

vp

su

vtr

an

sp

vs

hu

psu

fps

uv

sli

ce

va

ruv

intr

vlg

ufl

gu

vm

ul

vp

su

vsa

lsu

vtr

an

sp

vs

hu

psu

fps

uv

sli

ce

va

ruv

intr

vlg

ufl

gu

vm

ul

vp

su

vsa

lsu

vtr

an

sp

vs

hu

psu

fps

uv

sli

ce

Fig

ure

7:

Init

ial

pro

cess

or

arc

hit

ectu

refo

rth

ed

ow

n-s

cali

ng

ap

plica

tion

base

don

cod

ere

stru

ctu

rin

gan

din

itia

larc

hit

ectu

reco

nst

ruct

ion

exp

lora

tion

dec

isio

ns

21

which can now explore the system memory architecture based on each processor’s minimal memorysize requirements. Extra buffer space can be added based on the overall mapping of the targetapplication tasks and their respective communication buffer size requirements during the globalinterconnect and memory hierarchy exploration.

4.2. ASIP instruction-set architecture synthesis through architecture refinement

As already mentioned above, the initial coarse processor architecture proposed by the secondphase of the micro-level architecture exploration is usually overdimensioned. It is composed ofissue-slots taken from a standard library and thereby supports a large variation of operations ineach issue-slot. However, usually not all of these operations need to be replicated into each issue-slot and many of them can be removed without any impact on the execution time of the targetapplication. In some specific cases, even the number of the VLIW issue-slots can be reduced.Furthermore, register files also have been introduced with quite large sizes and may be reduced aswell. Removing these redundant resources from the initial architecture will greatly simplify thestructure of the interconnect between the issue-slot outputs and the register file inputs. This inturn can result in a large reduction of the number of program word bits required for each instructionin the program memory which further reduces the processor’s area and energy consumption.

The third phase of the micro-level architecture exploration, the instruction-set architecture syn-thesis, performs the architecture refinement using, in combination, the following three refinementtechniques: instruction-set extension, (passive) architecture shrinking, and (active) architecturereduction.

4.2.1. Instruction-set extension

Instruction-set extension can (optionally) be applied as the first step of the instruction-setarchitecture synthesis when a very high performance and/or an extremely low power solution isrequired. During this step, frequently occurring operation patterns are identified in the targetapplication, function units implementing each of them in hardware are constructed, and thenthe application specific instructions corresponding to them are added to the instruction-set ofthe initial prototype which was obtained from the previous exploration phase. For example, thedown sampling application has a frequently occurring pattern where two values are added andthe result is shifted by one place, which effectively computes the average of both values. Thisoperation pattern is the key computation in both the vertical and horizontal down samplingkernels. Implementing the complete pattern as a single complex operation provides two benefits:it results in a smaller kernel body, as fewer operations are required to encode the entire algorithm,and it improves the latency of the kernel execution by executing both the addition and the shiftin the same clock cycle. As an additional advantage, the use of complex operations also reducesthe number of register file accesses as intermediate values between the operations in a pattern areno longer stored in the register file. This results in a further reduction in the energy consumptionand can result in a lowered register file pressure which allows us to further reduce the register filesize.

The detection and selection of candidate operation patterns for the creation of custom opera-tions, as well as, the insertion of these custom operations into the initial prototype is performed bythe designer of the processor with the help of an operation pattern enumeration tool[42, 43, 44].This tool supports the designer by enumerating candidate operation patterns based on the fre-quency of their occurrence, as well as, the possibilities for hardware sharing between customoperations.

4.2.2. Architecture shrinking

Architecture shrinking passively strips components from an oversized ASIP architecture thatare unused for execution of a given parallel application structure, and estimates the effects ofthe removal of these components on the area and performances of the so modified processorarchitecture design. During the shrinking process, individual issue-slots, function-units, register-files, memories, and/or (custom) operations can get removed from the architecture. Register files

22

and memories can also be resized to provide exactly the required amount of space. As a result,the connectivity of interconnect, as well as, the size of the instruction word and program memory,can be drastically reduced, without any negative impact on the temporal performance (latency,throughput) of the overall ASIP design.

The passive shrinking approach is fully implemented in the area and energy modeling [48].Doing so allows for an efficient estimation of the benefits of a specific architectural shrinking.In our implementation, we allow for user control of the kinds of elements which are removedduring the shrinking process. For example, enabling or disabling the removal of single operationsfrom a candidate architecture can have a large impact on the functional completeness and re-programmability of the resulting processor. Removing all operations except those that are requiredfor the functioning of the target application will result in an architecture that may only (effectively)support small variations on the original algorithm but will also provide a higher efficiency thatis closer to a non-programmable hardware implementation. However, keeping some of the lesscostly (though currently unused) instructions in some of the issue-slots of the final architecturedesign can improve the support for variations of the original algorithm at the cost of a somewhatlower energy efficiency. Deciding the granularity at which to perform the exploration dependsstrongly on the intended purpose of the design and, as such, is left to the designer of the processorarchitecture.

4.2.3. Architecture reduction

Often, performing only the passive architecture shrinking for a given compiled parallel appli-cation structure will not result in the most efficient architecture design. For instance, instructionscheduling heuristics may have decided to map several operations of the same kind onto differ-ent issue-slots, while these could have been mapped into the same issue-slot. This may result inthe same operation being supported by multiple issue-slots when this is not strictly required forachieving the required performance of the algorithm. The ASIP architecture reduction activelytries to suppress the usage of specific architecture components (issue-slots or function-units) bydisabling them as selection alternatives for an operation in the compiler. Doing so will render themunused in the resulting application mapping which allows for the successive architecture shrinkingto remove them from the final design. The instruction-set exploration strategies implemented forthis active reduction stage have been described in [46, 47]. An appropriate combination of theabove three architecture refinement techniques results in a fast and efficient method for investiga-tion of instruction-set architecture variants of the initial prototype produced in the second stageof the micro-level architecture exploration and selection of the most preferred architecture.

4.3. Experimental results

Application of our architecture refinement techniques on the example down-sampling applica-tion demonstrates their high effectiveness. Figure 8 shows the architecture optimization effectsduring various stages of the optimization on both the area and energy consumption of differentarchitecture variations. Each bar in the graphs is subdivided to show the area and energy dis-tribution across the various components of the processor architectures and shows the most costlycomponents at the bottom. It should be noted however, that these figures only show the energyand area cost of the core processor architecture with its local memories. The other componentsof the system such as the (usually larger but much cheaper per bit) external memories are notaccounted for in the graphs. The first (leftmost) bar shows the area and energy requirement ofthe initial prototype that was proposed by the previous micro-level architecture exploration phase.All bars are normalized in relation to the initial prototype.

The second bar illustrates the effect of adding the custom operation pattern for the add-shift(avg) custom operation which is applied on two vectors of data elements. It shows the increase inthe processor area (due to the extra hardware) and a decrease in the active energy consumptiondue to a decreased number of reads from the register file. The total execution time of the algorithmdid not change due to the inclusion of this operation. This can be explained since the cycle-countof the down-sampling application is limited by the initiation interval of the main loop kernel,

23

init

ial

init

ial+

avg

shri

nkin

g

shri

nkin

g+av

g

reduce

d

reduce

d+

avg

full-c

ust

om

0.0

0.2

0.4

0.6

0.8

1.0fifodmempmemregister.filesissue.slots

(a) area

init

ial

init

ial+

avg

shri

nkin

g

shri

nkin

g+av

g

reduce

d

reduce

d+

avg

full-c

ust

om

0.0

0.2

0.4

0.6

0.8

1.0dmemfifoinstruction.decoderinterconnectclkpmemregister.filesissue.slots

(b) dynamic energy

Figure 8: Estimated area and energy during different stages of optimization, normalized to initial prototype. Boththe initial prototype and an extended version (with the averaging operation) are shown. The architecture cost ofthe final actual (full-custom) implementation is also shown for reference to illustrate the estimation error in theexploration.

24

which in turn is dominated by the amount of load operations from the input memory. Instead,the introduction of the custom operation results in a decrease of the required number of issue-slotsfor the processor. This is achieved by joining operations that were already scheduled in parallelinto a single custom operation. The main area and energy savings from this optimization are theremoval of an issue-slot and its register files, as well as, the corresponding reduction in the widthof the program memory, as fewer operations need to be encoded per instruction.

The result of passively shrinking for both the initial architecture and the version which includesthe custom operation is shown in bars three and four respectively. Shrinking has a large impacton the size of the issue-slots, as many standard operations can be completely removed and othersare only needed in a few issue-slots. The register files provided by the initial architecture are alsosignificantly over-dimensioned. The results shown in this example demonstrate those obtainedusing a full-customization of the processor architecture and include the removal of all unusedelements from the initial (extended) architecture.

Continuing the process by actively exploring further possible architecture reductions results inanother decrease in both the area and energy requirements for the processor architecture design, asshown by the fifth and sixth bars. In both cases, the design was optimized to improve the energy-delay product of the final processor in an attempt to find a smaller, more efficient, version of thearchitecture without giving up too much on the temporal performance. As a result, both proposedreduced architectures (with and without custom operation) consume about 5% less energy whencompared to their shrunk versions. For the processor architecture which includes the customoperation, the architecture reduction phase was able to remove a complete vector issue-slot. Thisresulted in a final proposed architecture which has an as high as 91% utilization of its issue-slots.

A significant speedup of the exploration process can be achieved when it is observed that thereare many symmetrical intermediate exploration points when an initial architecture composed oftemplated issue-slots is reduced. For example, removing an operation from one issue-slot may onlyresult in the compiler using that same operation from another issue-slot. In some cases this canstill be beneficial but in many cases an equivalent solution will be found. To improve upon this wehave introduced the BuildMaster framework [48]. This framework recognizes when the compilationresult for a new architecture alternative will be equivalent to one that was considered earlier, andwill directly reuse the performance estimation of the previously considered equivalent architecture.This significantly reduces the amount of compilation and performance estimation runs that may berequired during the overall exploration. Furthermore, the BuildMaster framework also recognizeswhen two architecture alternatives have different resulting compiled applications but which sharethe same control-flow structure internally. In this case the compilation of the target programis still required but we may be able to re-use the simulation results that were obtained for thepreviously considered alternative. More detail on these techniques is available from [48].

The exploration time required for both the original and extended initial architectures alsoclearly shows the effectiveness of our intermediate result caching techniques. The exploration ofboth architectures took only 54 minutes in total on a 2.8 GHz Intel Core i7 with 6 GB of RAMmemory. Our BuildMaster caching framework was able to reuse a significant number of compilationand simulation results, over 72% of the compilation time and over 94% of the simulation timewere avoided by the use of our caching techniques. Overall, the automatic exploration frameworkconsidered 407 different processor architecture variations in less than 1 hour and produced twohighly optimized processor designs. Both final designs proposed by the automated architectureexploration reduce the area of the initial design by more than a factor of 4x, and the energyconsumption by almost 2x.

After the exploration, we verified the predicted results by comparing them to the resultsobtained by actually constructing this proposed architecture using the SiliconHive tools. Theresult of this verification is demonstrated in the final bar of the bar-graph and the resultingarchitecture is shown in Figure 9. From this experiment we can learn that the area requiredfor actually constructing the proposed processor architecture is slightly higher, while the energyrequired for running our algorithm on this architecture is slightly lower than predicted. Theseeffects demonstrate some of the limitations of our current area and energy model in relation to thearchitecture template. In the case of our down sampling application, three discrepancies between

25

down_core_fullcustom c0_c5_is5_is_is_op1_BUS

c0_c1_is0_is_is

c0_c1_is0_is_rf

3x

32

c0_c1_is0_dmem_mem

c0_c5_is5_is_is_op0_BUS

c0_c1_is1_is_is

c0_c1_is1_is_rf

2x

32

c0_c1_is0_config_pmem_conf_pmem


c0_c1_is2_is

c0_c1_is2_rf

3x

32

c0_c1_is1_fifo_fifo


c0_c5_is4_vmem_mem

c0_c5_is3_is

c0_c5_is3_rf1

5x

512

c0_c5_is3_is_op1_BUS

c0_c5_is5_vmem_mem

c0_c5_is4_is_is

c0_c5_is4_is_rf0

5x

32

c0_c1_is2_is_op_BUS

c0_c5_is5_is_is

c0_c5_is4_is_rf1

1x

512

c0_c1_is1_is_is_op_BUS

c0_c5_is5_is_rf0

2x

32

c0_c1_is0_is_is_op_BUS

c0_c5_is5_is_rf1

3x

512

c0_c1_is0_is_is_sr_BUS

c0_c1_is0_is_pc

c0_c1_is0_is_is_pc_BUS

c0_c1_is0_is_sr

bru

suu

aru

lgu

lsu

mac

mem

2x

32

sru

shu

stat_ctrl

9x

32

pmem

64

x134

aru

lgu

psu

fifo

02

x32

fifo

12

x32

mem

8x

512vavg

mem

2x

512

vsalsu

vsalsu

vtransp

Fig

ure

9:

Th

efi

nal

full-c

ust

om

arc

hit

ectu

rebase

don

the

exp

lora

tion

resu

lt

26

the predicted and obtained area and energy numbers can be observed.Firstly, the load-store unit, connected to the input memory, is only used for load operations.

However, being a load-store unit, it is derived from a standard template library element which hasa fixed set of input and output ports. Therefore, an extra register file must be added to be ableto actually construct the processor without diverging from the current template, and in order tokeep all input and output ports correctly connected. This is the 3th register file from the rightin Figure 9, which is not used but does provide space for one 512 bit wide vector element andrepresents approximately 10% of the register file area in the final architecture. Improving thearchitecture template library such that it allows for a load-store unit with only load operations(and thus fewer input ports) would enable the removal of this extra register file and bring thetotal register file area back down to the predicted space. The automated tools presented in thispaper are currently not able to perform this optimization but the SiliconHive template does allowthe construction of such a load-only unit.

Secondly, the architecture model is completely agnostic of the operations implemented withinfunction units (except for their names) and only counts accesses to register files. This is a limitationof the current implementation of the exploration tools and should be resolved as part of the futurework. As a result, the architecture exploration is currently unable to recognize when a specificregister file read or write port is unused in the proposed architecture. Removing such unusedregister file ports during the final architecture construction further simplifies the interconnect,reduces the cost of the register files in general, and reduces the number of instruction word bitsrequired for programming the final processor architecture.

Finally, the automated exploration framework resizes the program memory based on the num-ber of instructions required for encoding the target application. However, in its current implemen-tation, it does not take into account the small processor initialization routine that also needs tobe loaded into the program memory. In the case of the down sampling application, the addition ofthis initialization code results in a different rounding of the program memory size (from 32 linesto 64 lines). This results in a program memory area which is larger than predicted, though nottwice as large as fewer bits are required to encode the instruction word due to the reduction ofthe interconnect complexity as explained previously.

5. Conclusion

In the scope of the ASAM project a novel automated design technology has been developed forconstructing of high-quality heterogeneous highly-parallel ASIP-based multi-processor platformsfor highly-demanding cyber-physical applications and efficient mapping of the complex applicationson the platforms. This paper discussed a part of the ASAM technology. The paper brieflyexplained the ASAM design flow that substantially differs from earlier published flows, focusedon the design methods and tools for a single VLIW ASIP-based HW/SW sub-system synthesis,extensively elaborated on the Instruction-Set Architecture Synthesis of the adaptable VLIW ASIPsand discussed the related results of experimental research.

As explained in the previous sections, the ASAM design flow and its tools implement anactual coherent HW/SW co-design process through performing a quality-driven simultaneous co-development and co-tuning of the parallel application software structure and processing platformarchitecture to produce HW/SW systems highly optimized for a specific application. Moreover,the ASAM flow and its tools consider the macro-architecture and micro-architecture synthesis asone coherent complex system architecture synthesis task, and not two separate tasks, as in thestate-of-the-art methods. There are common aims and a strong consistent collaboration betweenthe two sub-tasks. The macro-architecture synthesis proposes a certain number of customizableASIPs of several types with a part of the application assigned to each of the proposed ASIPs.The micro-architecture synthesis proposes architecture for each of the ASIPs, together with itslocal memories, communication and other blocks, and correspondingly restructures its softwareto implement the assigned application part as effective and efficient as possible. The decisionson the application-specific processor, memory and communication architectures are made in astrict collaboration to ensure their compatibility and effective trade-off exploration among the

27

different design aspects. Subsequently, the RTL-level HDL descriptions of the constructed ASIPsare automatically generated and synthesized to an actual hardware design, and the restructuredsoftware of the application parts of particular ASIPs is compiled. From several stages of theapplication restructuring and ASIP design, including the actual HW/SW implementation, themicro-architecture synthesis provides feedback to the macro-architecture synthesis on the physicalcharacteristics of each particular sub-system implemented with each ASIP core. This way thewell-informed system-level architecture decision re-iterations are possible and the macro-/micro-architecture trade-off exploitation is enabled. After several iterations of the combined macro-/micro-architecture exploration and synthesis an optimized MPSoC architecture is constructed.To our knowledge, the ASIP-based MPSoC design flow as explained above is not yet explored inany of the previously performed and published works. The related research in the MPSoC, ASIP,application analysis and restructuring, and other areas considers only some of the sub-problemsof the adaptable ASIP-based MPSoC design in isolation. In result, the proposed partial solutionsare usually not directly useful in the much more complex actual context.

The paper proposed a novel formulation of the ASIP-based HW/SW sub-system design problemas the actual hardware/software co-design problem, i.e. the simultaneous construction of anoptimized parallel software structure and a corresponding parallel ASIP architecture, as well asmapping in space and time of the constructed parallel software on the parallel ASIP hardware.It discussed a novel solution of the so formulated problem composed of a new design flow, itsmethods and corresponding design automation tools. As explained in Sections 3 and 4, the ASIP-based HW/SW sub-system architecture exploration and synthesis flow involves the following threemain phases:

Phase1: performing application analysis and characterization;

Phase2: performing coherent co-synthesis of parallel software structures and related parallel pro-cessing, communication and storage architectures;

Phase3: performing instruction-set architecture synthesis and related refined application restruc-turing.

While Phase2 performs an actual coarse software/hardware co-development through exploringand selecting promising parallel software structures and the corresponding ASIP hardware archi-tectures, Phase3 performs a fine hardware and software co-tuning through refinement of the coarsearchitecture and application software.

Phase 2 produces one or several promising coarse designs of a parallel ASIP-based HW/SWsystem. Each of these initial prototypes guarantees satisfaction of all the hard constraints (e.g.computation speed), but can be oversized. For instance, an initial prototype may involve toomany and/or too large register files, memories or communication structures, it may involve moreissue slots than necessary for the satisfaction of the hard constraints, its issue slots only includethe standard operations and do not include application-specific operations, some of its issue slotsmay involve too many operations. Phase 3 accepts such a coarse, possibly oversized, ASIP-basedHW/SW sub-system design as its input and performs its precise instruction-set architecture syn-thesis. The ISA synthesis of Phase3 involves several collaborating processes as the instruction-setextension that replaces some of the most frequent compound operation patterns with application-specific instructions, architecture shrinking that removes the unused components from an oversizedarchitecture, and architecture reduction that disables usage of specific operations for applicationmapping, observes the consequences and removes them and their related hardware from the ASIParchitecture if the consequences are positive. Moreover, we introduced two intermediate resultcaching techniques which greatly accelerate the iterative process of architecture refinement bymemorizing and reusing information on the previously considered architectures. The BuildMastercaching framework automatically recognizes when previous compilation and/or simulation resultscan be re-used and this way provides a big reduction of the required architecture exploration time.Finally, the paper discussed the results of experimental research that compared several implemen-tations of the presented architecture refinement methods, demonstrated the benefits of combiningthe shrinking and reduction techniques, and compared their effectiveness and exploration time.

28

The main findings of the experimental research are that the presented method enables an au-tomatic instruction-set architecture synthesis for VLIW ASIPs within a reasonable explorationtime. Using the presented approach, we were able to automatically determine an initial archi-tecture prototype that was able to meet the temporal performance requirements of the targetapplication. Subsequently, refinement of this architecture considerably reduced both the designarea (by 4x) and the active energy consumption (by 2x). This was achieved by automaticallyexploring 407 architecture variations in a targeted search for the most suitable design. The entireprocess for proposing and refining the custom VLIW ASIP into this final design took only about1 hour. Furthermore, we found that the final design has a low resource overhead and comes veryclose to the design that an experienced engineer would propose for the same target application.

The experimental results confirm the high effectiveness and efficiency of the developed designmethods and design automation tools.

Acknowledgement

The research reported in this paper has been performed in the scope of the ASAM projectof the European ARTEMIS Research Program and has been partly supported by the ARTEMISJoint Undertaking under Grant No. 100265.

References

[1] L. Jozwiak, M. Lindwer, R. Corvino, P. Meloni, L. Micconi, J. Madsen, E. Diken, D. Gan-gadharan, R. Jordans, S. Pomata, P. Pop, G. Tuveri, L. Raffo, G. Notarangelo, ASAM: Au-tomatic architecture synthesis and application mapping, Microprocessors and Microsystems37 (8) (2013) 1002–1019. doi:10.1016/j.micpro.2013.08.006.

[2] O. Schliebusch, A. Hoffmann, A. Nohl, G. Braun, H. Meyr, Architecture implementationusing the machine description language LISA, in: Design Automation Conference, 2002. Pro-ceedings of ASP-DAC 2002. 7th Asia and South Pacific and the 15th International Conferenceon VLSI Design, IEEE, 2002, pp. 239–244.

[3] S. Pees, A. Hoffmann, V. Zivojnovic, H. Meyr, LISA – machine description language for cycle-accurate models of programmable DSP architectures, in: Proceedings of the 36th AnnualACM/IEEE Design Automation Conference, ACM, 1999, pp. 933–938.

[4] A. Fauth, J. Van Praet, M. Freericks, Describing instruction set processors using nML, in:European Design and Test Conference, 1995. ED&TC 1995, Proceedings., IEEE, 1995, pp.503–507.

[5] G. Goossens, D. Lanneer, W. Geurts, J. Van Praet, Design of ASIPs in multi-processorsocs using the chess/checkers retargetable tool suite, in: System-on-Chip, 2006. InternationalSymposium on, IEEE, 2006, pp. 1–4.

[6] R. Azevedo, S. Rigo, M. Bartholomeu, G. Araujo, C. Araujo, E. Barros, The archc archi-tecture description language and tools, International Journal of Parallel Programming 33 (5)(2005) 453–484.

[7] J. Podivinsky, O. Cekan, Z. Kotasek, et al., Fpga prototyping and accelerated verificationof asips, in: Design and Diagnostics of Electronic Circuits & Systems (DDECS), 2015 IEEE18th International Symposium on, IEEE, 2015, pp. 145–148.

[8] L. Charvat, A. Smrcka, T. Vojnar, Automatic formal correspondence checking of ISA andRTL microprocessor description, in: Microprocessor Test and Verification (MTV), 2012 13thInternational Workshop on, IEEE, 2012, pp. 6–12.

29

http://dx.doi.org/10.1016/j.micpro.2013.08.006

[9] A. Chattopadhyay, I. G. Ascheid, P. Ienne, Language-driven exploration and implementationof partially re-configurable ASIPs (rasips), Ph.D. thesis, Lehrstuhl fur Integrierte Systemeder Signalverarbeitung (2008).

[10] K. Karuri, A. Chattopadhyay, X. Chen, D. Kammler, L. Hao, R. Leupers, H. Meyr, G. As-cheid, A design flow for architecture exploration and implementation of partially reconfig-urable processors, Very Large Scale Integration (VLSI) Systems, IEEE Transactions on 16 (10)(2008) 1281–1294.

[11] M. Wijtvliet, L. Waeijen, H. Corporaal, Coarse grained reconfigurable architectures in the past25 years: Overview and classification, in: Proceedings of the 16th International Conferenceon Embedded Computer Systems: Architectures, MOdeling and Simulation (SAMOS), IEEE,2016.

[12] P. Mishra, N. Dutt, Processor Description Languages, Vol. 1, Morgan Kaufmann, 2011.

[13] D. She, Y. He, H. Corporaal, Energy efficient special instruction support in an embedded pro-cessor with compact ISA, in: Proceedings of the 2012 International Conference on Compilers,Architectures and Synthesis for Embedded Systems, ACM, 2012, pp. 131–140.

[14] S. Park, A. Shrivastava, N. Dutt, A. Nicolau, Y. Paek, E. Earlie, Register file power reductionusing bypass sensitive compiler, IEEE Transactions on Computer Aided Design of IntegratedCircuits and Systems 27 (6) (2008) 1155.

[15] H. Corporaal, Transport triggered architectures; design and evaluation, Ph.D. thesis, Tech-nische Universiteit Delft (1995).

[16] J. Hoogerbrugge, Code generation for transport triggered architectures, Ph.D. thesis, Tech-nische Universiteit Delft (1996).

[17] J. Hoogerbrugge, H. Corporaal, Automatic synthesis of transport triggered processors, in:Proceedings of ASCI, 1995, pp. 1–10.

[18] Y. He, D. She, B. Mesman, H. Corporaal, Move-pro: a low power and high code density ttaarchitecture, in: Embedded Computer Systems (SAMOS), 2011 International Conference on,IEEE, 2011, pp. 294–301.

[19] Tampere University of Technology, Department of Computer Systems, TTA-based Co-designEnvironment v1.9 User Manual (Januari 2014).

[20] V. Kathail, S. Aditya, R. Schreiber, B. Ramakrishna Rau, D. Cronquist, M. Sivaraman,PICO: automatically designing custom computers, Computer 35 (9) (2002) 39–47.

[21] S. Aditya, V. Kathail, High-Level Synthesis: From Algorithm to Digital Circuit, SpringerScience, 2008, Ch. Algorithmic Synthesis using PICO, pp. 53–74.

[22] P. Mattson, W. J. Dally, S. Rixner, U. J. Kapasi, J. D. Owens, Communication scheduling, in:Proceedings of the Ninth International Conference on Architectural Support for ProgrammingLanguages and Operating Systems, ASPLOS IX, ACM, New York, NY, USA, 2000, pp. 82–92.doi:10.1145/378993.379005.

[23] A. Terechko, Clustered VLIW architectures: a quantative approach, Ph.D. thesis, EindhovenUniversity of Technology (2007).

[24] M. Bekooij, Constraint driven operation assignemnt for retargetable VLIW compilers, Ph.D.thesis, Eindhoven University of Technology (2004).

[25] Y. Okmen, SIMD floating point processor and efficient implementation of ray tracing algo-rithm, Master’s thesis, TU Delft, Delft, The Netherlands (October 2011).

30

http://dx.doi.org/10.1145/378993.379005

[26] J. Leijten, G. Burns, J. Huisken, E. Waterlander, A. van Wel, Avispa: A massively parallelreconfigurable accelerator, in: System-on-Chip, 2003. Proceedings. International Symposiumon, IEEE, 2003, pp. 165–168.

[27] E. van Dalen, S. G. Pestana, A. van Wel, An integrated, low-power processor for image signalprocessing, in: Multimedia, 2006. ISM’06. Eighth IEEE International Symposium on, IEEE,2006, pp. 501–508.

[28] M. Lam, Software pipelining: An effective scheduling technique for VLIW machines, ACMSIGPLAN Notices 23 (7) (1988) 318–328.

[29] J. E. Stone, D. Gohara, G. Shi, OpenCL: A parallel programming standard for heterogeneouscomputing systems, Computing in Science & Engineering 12 (3) (2010) 66.

[30] C. Bastoul, Code generation in the polyhedral model is easier than you think, in: Proceedingsof the 13th International Conference on Parallel Architectures and Compilation Techniques,IEEE Computer Society, 2004, pp. 7–16.

[31] M. M. Baskaran, J. Ramanujam, P. Sadayappan, Automatic C-to-CUDA code generation foraffine programs, in: Compiler Construction, Springer, 2010, pp. 244–263.

[32] T. Grosser, A. Groesslinger, C. Lengauer, Polly: Performing polyhedral optimizations on alow-level intermediate representation, Parallel Processing Letters 22 (04).

[33] S. Verdoolaege, T. Grosser, Polyhedral extraction tool, in: Second International Workshopon Polyhedral Compilation Techniques (IMPACT’12), Paris, France, 2012.

[34] U. Bondhugula, A. Hartono, J. Ramanujam, P. Sadayappan, A practical automatic polyhedralparallelizer and locality optimizer, ACM SIGPLAN Notices 43 (6) (2008) 101–113.

[35] B. Kienhuis, E. Rijpkema, E. Deprettere, Compaan: Deriving process networks from Mat-lab for embedded signal processing architectures, in: Proceedings of the 8th InternationalWorkshop on Hardware/Software Codesign, ACM, 2000, pp. 13–17.

[36] R. Jordans, R. Corvino, L. Jozwiak, H. Corporaal, Exploring processor parallelism: Esti-mation methods and optimization strategies, in: DDECS 2013 - 16th IEEE Symposium onDesign and Diagnostics of Electronic Circuits and Systems, Karlovy Vary, Czech Republic,2013, pp. 18–23.

[37] R. Jordans, R. Corvino, L. Jozwiak, H. Corporaal, Exploring processor parallelism: Esti-mation methods and optimization strategies, International Journal of Microelectronics andComputer Science 4 (2) (2013) 55 – 64.

[38] L. Micconi, D. Gangadharan, P. Pop, J. Madsen, Multi-ASIP platform synthesis for real-timeapplications, in: SIES 2013 - 8th IEEE International Symposium on Industrial EmbeddedSystems, Porto, Portugal, 2013. doi:10.1109/SIES.2013.6601471.

[39] L. Micconi, A probabilistic approach for the system-level design of multi-ASIP platforms,Ph.D. thesis, Technical University of Denmark (2014).

[40] R. Corvino, E. Diken, A. Gamatie, L. Jozwiak, Transformation based exploration of dataparallel architecture for customizable hardware: A JPEG encoder case study, in: DSD 2012- 15th Euromicro Conference on Digital System Design, Cesme, Izmir, Turkey, 2012.

[41] L. Jozwiak, D. Gaweowski, A. Slusarczyk, A. Chojnacki, Static power reduction in nano cmoscircuits through an adequate circuit synthesis, in: Mixed Design of Integrated Circuits andSystems, 2007. MIXDES ’07. 14th International Conference on, 2007, pp. 172 –177.

31

http://dx.doi.org/10.1109/SIES.2013.6601471

[42] A. S. Nery, L. Jozwiak, M. Lindwer, M. Cocco, N. Nedjah, F. M. Franca, Hardware reuse inmodern application-specific processors and accelerators, Microprocessors and Microsystems37 (6) (2013) 684–692.

[43] A. S. Nery, N. Nedjah, F. M. Franca, L. Jozwiak, H. Corporaal, Automatic complex instruc-tion identification for efficient application mapping onto ASIPs, in: Circuits and Systems(LASCAS), 2014 IEEE 5th Latin American Symposium on, IEEE, 2014, pp. 1–4.

[44] A. S. Nery, N. Nedjah, F. M. G. Franca, L. Jozwiak, H. Corporaal, A framework for automaticcustom instruction identification on multi-issue ASIPs, in: Proceedings of the 12th IEEEInternational Conference on Industrial Informatics, IEEE computer society, IEEE, 2014, pp.428–433.

[45] R. Jordans, R. Corvino, L. Jozwiak, H. Corporaal, An efficient method for energy estimationof application specific instruction-set processors, in: DSD 2013 - 16th Euromicro Conferenceon Digital System Design, Santander, Spain, 2013, pp. 471–474. doi:10.1109/DSD.2013.120.

[46] R. Jordans, R. Corvino, L. Jozwiak, H. Corporaal, Instruction-set architecture explorationstrategies for deeply clustered VLIW ASIPs, in: ECyPS 2013 - EUROMICRO/IEEE Work-shop on Embedded and Cyber-Physical Systems, Budva, Montenegro, 2013, pp. 38–41.doi:10.1109/MECO.2013.6601361.

[47] R. Jordans, L. Jozwiak, H. Corporaal, Instruction-set architecture exploration of VLIW ASIPsusing a genetic algorithm, in: MECO 2014 - 3rd Mediterranean Conference on EmbeddedComputing, 2014, pp. 32–35. doi:10.1109/MECO.2014.6862720.

[48] R. Jordans, E. Diken, L. Jozwiak, H. Corporaal, BuildMaster: Efficient ASIP architectureexploration through compilation and simulation result caching, in: DDECS 2014 - 17th IEEESymposium on Design and Diagnostics of Electronic Circuits and Systems, 2014.

[49] P. Boulet, Array-ol revisited, multidimensional intensive signal processing specification, Rap-port de Recherche Institut National de Recherche en Informatique et en Automatique 6113(2007) 1–27.

[50] R. Corvino, A. Gamatie, M. Geilen, L. Jozwiak, Design space exploration in application-specific hardware synthesis for multiple communicating nested loops, in: SAMOS XII - 12thInternational Conference on Embedded Computer Systems, Samos, Greece, 2012.

32

http://dx.doi.org/10.1109/DSD.2013.120

http://dx.doi.org/10.1109/MECO.2013.6601361

http://dx.doi.org/10.1109/MECO.2014.6862720

Roel Jordans received both the MSc and PhD degrees in field of ElectricalEngineering from Eindhoven University of Technology in 2009 and 2015 re-spectively. He worked within the PreMaDoNA project on the MAMPS toolflow as a researcher afterwards. His dissertation focussed on the automaticdesign space exploration of VLIW ASIP instruction-set architecture withinthe ASAM project. His research interest include compilers and compilationtechniques for application specific systems, digital signal processing systensbased on customized VLIW architectures, and reliability and fault-tolerantdesign for space applications. Currently he is employed at the RadboudUniversity Nijmegen where he is active as science DSP architect in the Rad-

boud Radio Lab, working on the software-defined radio astronomy receiver for the NCLE mission.In parallel he is the primary lecturer of the parallelization, compilation, and platforms courseat Eindhoven University of Technology. He also serves as a program committee member for theEUROMICRO Symposium on Digital System Design.

Lech Jozwiak is an Associate Professor, Head of the Section of Digital Cir-cuits and Formal design Methods, at the Faculty of Electrical Engineering,Eindhoven University of Technology, The Netherlands. He is an author of anew information driven approach to digital circuits synthesis, theories of in-formation relationships and measures and general decomposition of discreterelations, and methodology of quality driven design that have a considerablepractical importance. He is also a creator of a number of practical prod-ucts in the fields of application-specific embedded systems and EDA tools.His research interests include system, circuit, information theory, artificialintelligence, embedded systems, re-configurable and parallel computing, de-

pendable computing, multi-objective circuit and system optimization, and system analysis andvalidation. He is the author of more than 150 journal and conference papers, some book chap-ters, and several tutorials at international conferences and summer schools. He is an Editor of“Microprocessors and Microsystems”, “Journal of Systems Architecture” and “International Jour-nal of High Performance Systems Architecture”. He is a Director of EUROMICRO; co-founderand Steering Committee Chair of the EUROMICRO Symposium on Digital System Design; Ad-visory Committee and Organizing Committee member in the IEEE International Symposium onQuality Electronic Design; and program committee member of many other conferences. He is anadvisor and consultant to the industry, Ministry of Economy and Commission of the EuropeanCommunities. He recently advised the European Commission in relation to Embedded and High-performance Computing Systems for the purpose of the Framework Program 7 preparation. In2008 he was a recipient of the Honorary Fellow Award of the International Society of QualityElectronic Design for “Outstanding Achievements and Contributions to Quality of Electronic De-sign”. His biography is listed in “The Roll of Honour of the Polish Science” of the Polish StateCommittee for Scientific Research and in Marquis “Who is Who in the World” and “Who is Whoin Science and Technology”.

33

Henk Corporaal received the M.S. degree in theoretical physics from theUniversity of Groningen, Groningen, The Netherlands, and the Ph.D. de-gree in electrical engineering, in the area of computer architecture, from theDelft University of Technology, Delft, The Netherland. He has been teachingat several schools for higher education. He has been an Associate Professorwith the Delft University of Technology in the field of computer architectureand code generation. He was a Joint Professor with the National Univer-sity of Singapore, Singapore, and was the Scientific Director of the jointNUS-TUE Design Technology Institute. He was also the Department Headand Chief Scientist with the Design Technology for Integrated Information

and Communication Systems Division, IMEC, Leuven, Belgium. Currently, he is a Professor ofembedded system architectures with the Eindhoven University of Technology, Eindhoven, TheNetherlands. He has co-authored over 250 journal and conference papers in the (multi)processorarchitecture and embedded system design area. Furthermore, he invented a new class of very longinstruction word architectures, the Transport Triggered Architectures, which is used in severalcommercial products and by many research groups. His current research interests include singleand multiprocessor architectures and the predictable design of soft and hard real-time embeddedsystems.

Rosilde Corvino is a software engineer at Intel Benelux B.V., Eindhoven.Before that, she worked as scientist and project manager in the ElectronicSystems group at the Department of Electrical Engineering at EindhovenTechnical University, The Netherlands. In 2010, she was a post-doctoralresearch fellow in DaRT team at INRIA Lille Nord Europe and was in-volved in Gaspard2 project. She earned her PhD in 2009, from UniversityJoseph Fourier of Grenoble, in micro and nanoelectronics. In 2005/06, sheobtained a double Italian and French M.Sc in electronic engineer. Her re-search interests involve design space exploration, parallelization techniques,data transfer and storage mechanisms, high level synthesis, application spe-

cific processor design for data intensive applications. She is author of numerous research papersand a book chapter.

34

Documents

Automatic Instruction-set Architecture Synthesis for VLIW ...rjordans/downloads/rjordans2017micpro.pdf · Automatic Instruction-set Architecture Synthesis for VLIW Processor Cores