W R C Workshop on Reconfigurable Computing · W R C Workshop on Reconfigurable Computing Workshop Proceedings January 28, 2007 Sofitel, Ghent, Belgium

W R C

Workshop on Reconfigurable Computing

Workshop Proceedings

January 28, 2007

Sofitel, Ghent, Belgium

Table of Contents

ForewordStamatis Vassiliadis, and Dirk Stroobandt

1

Reconfigurable Supercomputing means to brave the Paradigm ChasmReiner Hartenstein

5

Profiling floating point value ranges for reconfigurable implementationAshley W Brown, Paul H J Kelly, and Wayne Luk

6

Exposed datapath for efficient computingMagnus Björk, Magnus Själander, Lars Svensson, Martin Thuresson, John Hughes, Kjell Jeppson, Jonas Karlsson, Per LarssonEdefors Mary, Sheeran, and Per Stenstrom

17

Design and system level evaluation of a high performance memory system for reconfigurable SoC platformsHolger Lange and Andreas Koch

23

An FPGAbased system for development of realtime embedded vision applicationsHongying Meng, Nick Pears, and Chris Bailey

33

Welcome to theFirst HiPEAC Workshop onReconfigurable ComputingStamatis Vassiliadis, General Chair∗,1,Dirk Stroobandt, Program Chair†,2

∗ Delft University of Technologies, The Netherlands† Ghent University, Belgium

Foreword

On behalf of the organizing committee, we would like to cordially welcome you to the veryfirst HiPEAC Workshop on Reconfigurable Computing!

HiPEAC, the European Network of Excellence on High-Performance Embedded Archi-tecture and Compilation, addresses the design and implementation of high-performancecomputing devices in the 10+ year horizon, covering the processor design, the optimisingcompiler infrastructure, and the evaluation of upcoming applications made possible by theincreased computing power of future devices. Likewise, the HiPEAC Conference providesa high-quality forum for computer architects and compiler builders working in the field ofhigh performance computer architecture and compilation for embedded systems. This sec-ond HiPEAC conference starts off with a day of workshops, of which you are now attendingthe Workshop on Reconfigurable Computing.

We feel that the introduction of the Workshop on Reconfigurable Computing in theHiPEAC Conference comes at the right time. More and more embedded systems need veryhigh performance as well as high flexibility. Traditional computer architectures can deliverperfect flexibility but are sequential in nature. Even if a handful of computing cores are puttogether to work in parallel, the basic sequential nature of the programming instructionsdoes not allow the exploitation of massive parallelism. Reconfigurable Computing cores,generally based on Field Programmable Gate Arrays (FPGAs), take an implementation ap-proach from hardware ASICs in that they are not programmed by a sequence of instruc-tions but configured to perform a single task. This task can be either a sequence of simplesubtasks that each take a portion of the FPGA chip or a massive amount of parallel compu-tations all done in parallel. It is through the exploitation of massive parallelism that thesecomponents can outperform highly optimised and inherently faster (in terms of clock speed)general purpose computer architectures. Yet, FPGA-based systems are much more flexiblethan hardwired ASIC solutions because they can be reconfigured at arbitrary time intervals.Hence they combine the massive parallel computing power with high flexibility. It is thus no

1E-mail: [email protected]: [email protected]

1

surprise that reconfigurable systems are often the only way to enable the use of current em-bedded systems with high enough performance. It is for this reason that, within the HiPEACnetwork, a cluster on reconfigurable computing was formed. This workshop is meant as aninitiator of discussions on the various topics the cluster will deal with. It wants to explorethe boundaries between classical computer architectures/tools and less general computingarchitectures.

Reconfigurable Computing, mainly by using FPGAs, has been successful in embeddedsystems already for many years. Due to sufficient hardware expertise available in the em-bedded system communities astonishing speed-ups have been achieved in a variety of areassuch as signal and image processing, bio-informatics, cryptology, communication process-ing, data and text mining, global optimization, and others. More recently, ReconfigurableComputing also attracts an increasing number of experts from supercomputing and otherhigh performance computing scenes where, however, software perspectives are dominant,leading to a clash of paradigms. This clash of paradigms will be the very interesting topicof our invited presentation “Reconfigurable Supercomputing means to brave the paradigmchasm”, presented by Prof. Reiner Hartenstein of the Technical University of Kaiserslauternin Germany. We are very fortunate to have the opportunity to schedule this talk in the work-shop, as Prof. Hartenstein is a distinguished researcher and lecturer. He is not afraid ofchallenging the community with novel ideas and insights, so be prepared.

The remainder of the program is a collection of four papers that were submitted to theworkshop. Ashley Brown (Imperial College of London, UK) will present a profiling tool de-signed to identify where an application can benefit from reduced precision or reduced rangein floating-point computations, entitled “Profiling floating point value ranges for reconfig-urable implementation.” The other three papers present new reconfigurable computing ar-chitectures. The paper “Exposed Datapath for Efficient Computing”, presented by MagnusSjälander (Chalmers University of Technology, Sweden), introduces a processor based on anew processor paradigm, as well as the approach to compile for this new architecture. Inthe paper “Design and System Level Evaluation of a High Performance Memory System forReconfigurable SoC Platforms”, Holger Lange (Technical University Darmstadt, Germany)focuses on a high performance memory attachment for custom hardware accelerators. Fi-nally, a full working system is presented in the paper “An FPGA-based System for Devel-opment of Real-time Embedded Vision Applications”, presented by Hongying Meng fromThe University of York, UK.

We are very pleased that the first HiPEAC Workshop on Reconfigurable Computing isa big success. This half-day workshop received 9 paper submissions of wich only 4 (44 %)could be accepted to be in the program. Also, with registration figures well above 30 people(sample taken one month before the workshop date), this workshop is the best attended ofthe HiPEAC workshops. We hope you will all enjoy its program!

The workshop and the HiPEAC Conference itself are of course the main reason for youto be here. But be aware that you are also in one of the loveliest cities in Belgium and even inEurope. We encourage you to spend some additional time in Ghent and explore the curltural,historical, gastronomical and simply adorable delicacies this city has to offer. Enjoy!

Stamatis Vassiliadis Dirk StroobandtGeneral Chair Program Chair


2

Organizing Committee

General Chair

Stamatis VassiliadisTU Delft, The [email protected]

Technical Program Chair

Dirk StroobandtGhent University, [email protected]

Publicity Chair

Stephan WongTU Delft, The Netherlands

Technical Committee Members

Lieven Eeckhout, Ghent University, BelgiumSkevos Evripidou, University of Cyprus, CyprusGeorgi N. Gaydadjiev, TU Delft, The NetherlandsMike Hutton, Altera, USAWolfgang Karl, University of Karlsruhe, GermanyStefanos Kaxiras, University of Patras, GreecePaul Kelly, Imperial College, UKWayne Luk, Imperial College, UKNacho Navaro, UPC Barcelona, SpainDionisios Pnevmatikatos, Technical University of Crete, GreeceLeonel Sousa, Technical Univ. Lisbon, PortugalPer Stenström, Chalmers, SwedenPedro Trancoso, University of Cyprus, CyprusTheo Ungerer, University of Augsburg, GermanySteve Wilton, UBC, Canada


3

List of reviewers

Muhammad Aqeel, TU Delft, The NetherlandsLieven Eeckhout, Ghent University, BelgiumGeorgi N. Gaydadjiev, TU Delft, The NetherlandsMike Hutton, Altera, USAWolfgang Karl, University of Karlsruhe, GermanyStefanos Kaxiras, University of Patras, GreecePaul Kelly, Imperial College, UKWayne Luk, Imperial College, UKNacho Navaro, UPC Barcelona, SpainDionisios Pnevmatikatos, Technical University of Crete, GreeceKamana Sigdel, TU Delft, The NetherlandsVlad-Mihai Sima, TU Delft, The NetherlandsLeonel Sousa, Technical Univ. Lisbon, PortugalPer Stenström, Chalmers, SwedenPedro Trancoso, University of Cyprus, CyprusTheo Ungerer, University of Augsburg, GermanySteve Wilton, UBC, Canada

Acknowledgements

The organizing committee would like to thank HiPEAC Workshop Chair Lieven Eeckhoutfor providing all input and help on the organisation of this workshop, Thomas Van Parysfrom Ghent University for typesetting these proceedings and Yana Yankova from Delft Uni-versity of Technology for developing and updating the HiPEAC WRC website.


4

Reconfigurable Supercomputing means to brave the Paradigm Chasm*

Reiner Hartenstein, TU Kaiserslautern http://hartenstein.de

Reconfigurable Computing, mainly by using FPGAs, has beensuccessful in embedded systems already for many years. Dueto sufficient hardware expertize available in these communitiesastonishing speed-ups have been achieved in a variety of areassuch as signal and image processing, bio-informatics,cryptology, communications processing, data and text mining,global optimization, and others. An interesting side effect havingbeen observed accidentally is going to be a national strategicissue: the speed-up by software to configware migration comingalong with a drastic reduction of the electricity bill.

More recently Reconfigurable Computing also attracts anincreasing number of experts from supercomputing and otherhigh performance computing scenes where, however, softwareperspectives are dominant, leading to a clash of paradigms. Formany applications growing conventional MPP (massivelyparallel processing) parallelism does not scale well and reducesthe programmer productivity so that „The Law of More“ is theproblem, and not the Law of Moore. Will ReconfigurableSupercomputing solve these problem? Can portability becombined with very high performance? Is further progresslimited by fundamental misconceptions of algorithmiccomplexity theory, instead of hitting physical limits? For bothparadigms, configware and software, and for proper partitioningof dual paradigm solutions, we have to re-think all basicassumptions behind computing. We need a new roadmap.

*. prepared for the HiPEAC Workshop on Reconfigurable Computing, Ghent, Belgium, January 28, 2007

5

Profiling floating point valueranges for reconfigurableimplementationAshley W Brown, Paul H J Kelly,Wayne Luk

∗ Department of Computing, Imperial College London, United Kingdom

ABSTRACT

Reconfigurable architectures offer potential for performance enhancement by specializing the im-plementation of floating-point arithmetic. This paper presents FloatWatch, a dynamic executionprofiling tool designed to identify where an application can benefit from reduced precision or re-duced range in floating-point computations. FloatWatch operates on x86 binaries, and generatesa profile output file recording, for each instruction and line of source code, the overall range offloating-point values, the bucketised sub-ranges of values, and the maximum difference between64-bit and 32-bit executions.

We present results from the tool on a suite of four benchmark codes. Our tool indicates poten-tial performance loss due to denormal values, and helps to identify opportunities for using dualfixed-point arithmetic representation which has proved effective for reconfigurable designs. Ourresults show that applications often have highly modal value distributions, offering promise foraggressive floating-point arithmetic optimisations.

1 Introduction

Scientific applications often require high accuracy, high data throughput and high speedcalculations. Most commodity hardware sacrifices accuracy for speed, potentially limitingits usefulness in the scientific arena. Single-instruction multiple-data (SIMD) architecturesand graphics cards provide high speed calculations on IEEE single precision floating pointonly. Other architectures, such as IBM’s Cell [Hofs05] have also prioritised single precisionfloating point to target the games market.

In the reconfigurable computing space, placing a full-featured floating point unit onto anFPGA consumes vast amounts of on-chip resources for even single-precision floating point.Re-using the same unit repeatedly is possible, but introduces an artificial bottleneck. Knowl-edge of the likely value ranges which will reach a floating point unit allows us to refine boththe data representation and the functional units to reduce resources whilst still providingperformance. The BitSize [Gaff04] tool allows this type of refinement looking at source-codealone and may be a useful companion to FloatWatch. Cheung et al [pub05] developed a6

method for generating fixed-point versions of elementary functions, such as logarithm andsquare root, while using IEEE floating point for input and output. The conversion betweenIEEE floating point and fixed point is transparent to the user.

Conversion of scientific software to single-precision floating point would permit betteruse of this commodity hardware, however with an associated loss of accuracy. With reconfig-urable fabrics at our disposal custom floating point representations are possible. Moreover,we are able to change the representation for different phases of an application if we are ableto identify both the phases and appropriate representations, as demonstrated by Styles andLuk [Styl05].

2 Tool Structure

The FloatWatch tool operates on x86 binaries compiled with debugging information, underthe Valgrind [Neth03] dynamic instrumentation framework.

Output consists of a raw data file containing profiling information and a dynamic HTMLuser interface to manipulate and explore the data. Alternatively the data may be exportedfor plotting in GNUPlot or Excel.

FloatWatch provides the following information for each assembly instruction in the pro-gram:

• Overall range of values

• Bucketised sub-ranges of values

• Maximum difference between 64-bit and 32-bit floating point executions

The information can be aggregated for each line, then on a line-by-line basis by the user.Figure 1 shows the source display and value graph provided by the HTML user interface.

FloatWatch was conceived to provide an insight into the behaviour of scientific code,which is often “dusty desk” software where the authors have long since left the organisationconcerned. During attempts to accelerate some scientific applications it was realised that wehad very little insight into the overall behaviour of the code, or characteristics of particularfrequently-executed sections. Valgrind provides a convenient base to build upon, althoughperformance is currently an issue.

FloatWatch operates as a tool in the Valgrind framework, much like Cachegrind andCallgrind [Weid04]. Valgrind reads x86 and PowerPC binaries, converting them to an in-termediate representation consisting of simple operations. Complicated x86 arithmetic withmemory operands is flattened to a sequence of loads, stores and arithmetic with temporaries.Basic blocks are processed one at a time, with each block passed to an instrumentation tool(FloatWatch in this case) which inserts, removes or modifies instructions as necessary. Theintermediate representation is then converted back to machine instructions, cached and ex-ecuted.

The FloatWatch instrumentation tool adds instrumentation code to track the results offloating point operations in the target application, optionally inserting single precision ver-sions of double precision operands, then tracking the difference between single and doubleresults.


7

Figure 1: Exploring profile results using the FloatWatch user interface. Each highlightedline of source code can be expanded to show assembly-level code. Each line’s floating-pointvalue distribution can be selected for graphical display.

After execution the tool creates a raw output file with the data it has collected, along withthe intermediate representation of basic blocks with floating point operations. The Float-Watch post-processor takes this raw output and the application source files, producing anHTML+JavaScript report which can be dynamically manipulated by the user to producegraphs of value ranges for particular lines of code. The values may then be exported tographing software for use in reports.

The same technique may be used to track integer values if required, however scientificapplications are the focus here.

Valgrind

FloatWatch

FloatWatch

Post-

processorRaw

Output

Web

Browser

Graphing

Tools

User

Data

ManipulationCSV export

x86 binary

Source Files (C, FORTRAN)

HTML

Figure 2: Block Diagram of the FloatWatch Tool Chain.


8

3 Optimisation Opportunities

The data produced by FloatWatch helps guide the process of optimisation for targets witha variety of floating point configurations. 32-bit vs 64-bit comparisons provide an intuitionas to the safety of execution on a GPU, for example. Analysis of data ranges may guide theimplementation of a custom floating point unit which is more efficient in the active ranges.

This section provides an overview of some possible modifications or optimised imple-mentations.

3.1 Optimised Floating Point Unit

For some of our real-world applications, ranges are the same or similar across a wide varietyof real-world datasets. With knowledge of likely value ranges available it becomes possibleto design optimised floating point units, which have their highest performance within theseranges. Outside the “standard” ranges a smaller, slower or software implementation couldbe used instead. This is similar to the implementation of denormal floating point numbersin many microprocessors, which resort to software emulation of floating point operationswhen denormal numbers are seen.

The example program MORPHY [Pope96] illustrates one of the potential optimisations:the results of most operations fall within the orders of magnitude of [±20, ±2−4]. Knowingthis allows the floating point unit to be simplified by reducing some of the most expensiveparts – the barrel shifters required for operand alignment and post-operation normalisation.A shifter capable of 4-bit rather than 52-bit shifts may be used, reducing congestion andresource usage. Values outside this range may be aligned via multi-cycle shifting or softwareemulation, as they occur infrequently enough to prevent the penalty being significant.

3.2 Removal of Excessive Zero Values and Denormal Numbers

Our profiling of the SpecFP95 ’mgrid’ benchmark indicates a large number of zeroes or de-normal values in the results. Calculating zeroes implies input of zeroes which is, in general, awaste of computing resources. Calculations with denormal numbers have an adverse effecton performance because typical processor designs implement it in software, whilst customhardware designs do not implement it at all.

Identifying use of denormal numbers allows optimised hardware to be produced forsuch numbers, or the underlying code to be modified to avoid them, if possible.

3.3 Alternative Representations

A variety of alternative representations are available in custom hardware, including a widerange of floating point formats, fixed point and variations such as dual fixed point.

3.3.1 IEEE 32-bit float (vector instructions)

At its simplest level, a program may be converted from 64-bit to 32-bit floating point, pro-viding possibilities for vectorisation by hand or with a vectorising compiler. FloatWatchprovides information about the accumulated error should parts of the program be run insingle precision.


9

Gradual Underflow in Double Precision

0

2

4

6

8

10

12

2500

000

9.536

7432

7.88

9E-24

7.34

7E-33

4.26

E-102

3.2E

-138

1.89

E-263

9.33

E-295

4.45

E-301

1.11

E-301

2.78

E-302

6.95

E-303

Matrix Initial Value

Exe

cuti

on

Tim

e (s

)

P4 3.2GHz

Apple G5

Opteron 250

Figure 3: The Effect of Gradual Underflow. Convolution with a kernel containing progres-sively smaller values. Performance on Intel and AMD processors falls off dramatically asvalues enter the subnormal range. On the Opteron, performance begins to fall off even be-fore this range.

3.3.2 Fixed point

In applications where a very narrow range is required, fixed point arithmetic may be ap-propriate. Using pure integer arithmetic provides a dramatic performance improvement ordecrease in cost when compared to floating point. FloatWatch is unable to guarantee thatvalues outside a particular range will not appear, so appropriate handling of exceptionalcases is required.

Many of the test runs we have performed show symmetry around 0, allowing the signof a fixed-point number to be represented in the standard way for a processor or customdesign. For those with asymmetric value ranges a solution such as Dual Fixed-point may besuitable.

3.3.3 Dual fixed-point

Dual fixed-point (DFX) [Ewe04] is a variation on standard fixed point arithmetic, used wheretwo distinct ranges of values must be represented. It occupies the middle ground betweenthe flexibility of floating point and the efficiency of fixed-point.

Fixed-point, by definition, has the decimal point in a fixed place determined by the im-plementation. In DFX a single bit is reserved as selector, allowing one of two positions forthe decimal point to be selected. Figure 4 compares 64-bit floating point against 64-bit DFX.


10

s exp (11) mantissa (52)

r value (63)

Effective Maximum

Precision

Most significant

bits like exponent

s – sign bit

exp – exponent: 11 bits in IEEE double precision

mantissa – fractional part, 52 bits in IEEE double

precision

r – flag bit to switch between fixed point representations

value – fixed point value of the appropriate representation. The

maximum effective precision is the maximum precision

achievable with a given value range. If the range is 26 wide, this

would lead to a maximum effective precision of 57 for a 64-bit

dual-format fixed point solution.

Increased Precision, Decreased Range

IEEE 64-bit ‘double’ Floating Point

Dual Fixed Point (64-bit)

Figure 4: Double precision floating-point and dual fixed point representations.

3.4 Dynamic Representations

Custom floating point units and a variety of alternative representations require a customhardware implementation. The profiling results shown in the next section also reveal thepossibility of dynamic representations, where the representation could change at each phasein the program or even for each line. Implementing a new ASIC for each possibility wouldbe prohibitively expensive, consuming vast amounts of time and resources. Reconfigurablearchitectures provide a solution to this problem, allowing the generation of custom hard-ware designs at compile time, to be loaded in sequence at run-time.

One of the biggest advantages with reconfigurable architectures is the ability to gener-ate a complete pipeline for a block of code, converting between representations within thepipeline itself to preserve the maximum accuracy possible. Different pipelines can be also beloaded onto a reconfigurable device as the program progresses, or as the execution contextchanges. The line-by-line refinement possible with FloatWatch allows this characteristic tobe identified.

4 Results

We have profiled a sample of applications from different sources, including both “real-world” applications and benchmarks. This section describes the applications and their pro-filing results.

4.1 MORPHY

MORPHY [Pope96] is a commercial application under development at the University ofManchester which performs an automated topological analysis of a molecular electron den-


11

sity. It has two modes, fully analytical and semi-automatic, with the semi-automatic methodrunning faster but not always able to produce a result.

The application was run with data for water, peroxide and methane molecules. Figure 7show the value ranges for this application, using the semi-automatic method. The y-axisshows the fraction of values falling within the range shown – the number of calculationsperformed varies widely between the datasets.

Examination of the results reveals some interesting features. Firstly, the graph has twodistinct ranges, one either side of zero. These ranges are slightly asymmetric but very nar-row, indicating the possibility of a 64-bit fixed point or DFX implementation. Finally, theranges are similar across the three datasets tested so far. Further work is being carried out todetermine why this should be the case.

4.2 “ydl_pij” (Molecular Mechanics)

“ydl_pij” is an iterative solver for computational chemistry, using the Molecular Mechanics -Valence Bond [BM03] method. The code has many uses, including modelling magnetism andother electromagnetic properties and is currently being used in the Department of Chemistryat Imperial College. Figure 8 show the key sections of the graph of value ranges on a varietyof datasets. The number of each test indicates the number of electrons.

This small graph does not adequately illustrate one of the key features of this application:while values concentrate in ranges that are not excessively wide, there are a large number ofvalues spread across a much wider range. This makes them almost unnoticeable on a full-size graph, however this long tail indicates that a specialised implementation for a narrowrange of values would not be appropriate - the application would spend a large proportionof its time executing with out-of-range data, which would most likely be implemented usinga slow but cheap method.

Viewing the graph develop over time may reveal that the low-level wide range of valuesdisappears after the first few iterations, allowing an alternative representation to be usedlater in the program.

4.3 SpecFP95 ‘mgrid’

The SpecFP95 ‘mgrid’ benchmark is simplified multigrid solver which calculates a 3D po-tential field. As with MORPHY, it has two primary ranges. In this case the ranges are evenlyspread with the exception of a spike at each end. A pronounced spike in the centre, indi-cating zero or denormal numbers points to possible unneccessary performance problemswhich could be reduced.

4.4 SpecFP95 ‘swim’

The SpecFP95 ‘swim’ benchmark is a weather predictor based on shallow-water equations,using finite difference approximations. It is the only single precision benchmark in SpecFP95,however on x86 most calculations occur in double precision, with the result converted tosingle for storage only. It has several interesting features not seen on previous (pure double-precision) test applications.

The graph shows two primary ranges of results, one either side of the centre. A saw-tooth form is seen, with 4 “teeth”, indicating four separate sub-ranges. Work is currently


12

taking place to analyse this trend over time – it may be that each sub-range correspondsto a different iteration of the algorithm, in which case some dynamic modification of therepresentation used may be possible.

4.5 Summary of Results

Table 1 summarises the potential optimisation techniques which could be used on each ap-plication.

Application OptimisationMORPHY Conversion to dual-fixed point formatydl_pij Few likely candidates – temporal profiling may reveal optionsmgrid Potential to remove zero or denormal numbersswim Phase-dependent customisation of floating point units/representation

Table 1: Optimisation Options for Floating Point Applications

5 Current and Future Work

FloatWatch is currently able to return useful results, however they can only act as an insightso far. Many additions are planned to improve the decisions which can be made on theresults.

5.1 Improved Verification

The data generated by FloatWatch over multiple runs is only able to give an intuition aboutthe general behaviour of a piece of code with multiple datasets. No verification is possible atpresent, so "fall-back" options must always be provided in the case that previously observedbehaviour is not repeated with a new dataset.

One of the most useful possible extensions is to provide a verification framework, im-plementing techniques such as search-based testing to produce more rigourous results. Thegoal here is to look at real-world behaviour rather than theoretical behaviour, however anyimprovement in the confidence one can have in the results is desirable. The BitSize [Gaff04]tool provides similar functionality to this.

5.2 Extended Simulation

At present FloatWatch is only able to track the error for 32-bit vs 64-bit floating point, ratherthan for the vast array of floating- and fixed-point point formats available when creating cus-tom hardware. A planned future extension is to allow plug-in modules for custom floating-point formats, providing a method of experimentation without resorting to RTL-simulation.

The proposed solution would allow the user to add a custom simulation object into theFloatWatch system, with results presented in the same way as at present. The option oftesting several different custom formats at once would also prove useful.

Related to this is the ability to dynamically swap data formats as the profiled applicationprogresses, simulating the possibility of dynamic reconfiguration on an FPGA for example.


13

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

-256

-32 -4 -0.5

-0.06

25

-0.00

7812

5

-0.00

0976

6

0.00195

313

0.01562

50.125 1 8 64

Value Magnitude

Cum

ulat

ive

%

Methane

Water

Peroxide

Figure 5: Profile Results for ’Morphy’,showing core range. Values sporadicallyfall out of this range.

0%

20%

40%

60%

80%

100%

120%

-2

-0.12

5

-0.00

7812

5

-0.00

0488

3

-3.05

2E-0

5

-1.90

7E-0

6

-1.19

2E-0

7

2.3842E

-07

3.8147E

-06

6.1035E

-05

0.00097

656

0.01562

50.25

Value Magnitude

Cum

ulat

ive

%

9a 7 6D2 12_2 13_d3h

Figure 6: Profile results for MMVB code(“ydl_pij”). This graph represents therange with the highest concentration ofvalues.

0

500

1000

1500

2000

2500

3000

3500

4000

-1x2

^18

-1x2

^-9

-1x2

^-36

-1x2

^-63

-1x2

^-90

-1x2

^-11

7

-1x2

^-14

4

1x2^

-151

1x2^

-124

1x2^

-97

1x2^

-70

1x2^

-43

1x2^

-16

1x2^

11

Value Magnitude

Val

ue O

ccur

renc

e (m

illio

ns)

Figure 7: Profile Results for ’mgrid’,showing the full range of values – the ob-vious key ranges at either side are verywide given the scale.

0

1000

2000

3000

4000

5000

6000

7000

-1x2

^19

-1x2

^0

-1x2

^-19

-1x2

^-38

-1x2

^-57

-1x2

^-76

-1x2

^-95

1x2^

-89

1x2^

-70

1x2^

-51

1x2^

-32

1x2^

-13

1x2^

6

1x2^

25

1x2^

44

Value Magnitude

Val

ue O

ccur

ence

(m

illio

ns)

Figure 8: Profile Results for ’swim’, show-ing the full range of values.


14

5.3 Using the Data

We intend to use the data gathered from our test runs to generate custom FPGA designsor GPU programs to accelerate our real-world applications and benchmarks. A number offurther test runs are required, followed by an implementation of the most likely candidateapplications.

6 Conclusion

The FloatWatch tool can provide a valuable insight into the floating point behaviour of avariety of scientific applications, regardless of implementation language. At present all im-plementation decisions based on the data generated by the tool would require a fallback, asno conclusive proof of the value ranges used is offered. A number of future enhancementswill expand the capabilities of the system.

The behaviour of applications varies widely, with some using a very narrow range of val-ues for all tested datasets, whilst others had a wide range of values, or narrow ranges whichwere dataset-dependent. All tests so far have revealed near-symmetric behaviour aroundzero, with approximately balanced numbers of positive and negative values. The FloatWatchtool has helped identify promising candidates for implementation with optimised floating-point formats, and various reconfigurable designs are currently being developed.

References

[BM03] B. BEARPARK MJ. Excited states of conjugated hydrocarbon radicals using themolecular mechanics - valence bond (MMVB) method. THEORETICAL CHEM-ISTRY ACCOUNTS, pages 105–114, 2003.

[Ewe04] C. EWE, P. CHEUNG, AND G. CONSTANTINIDES. Dual Fixed-Point: An EfficientAlternative to Floating-Point Computation. In Proceedings of International Confer-ence on Field Programmable Logic 2004, pages 200–208. Springer-Verlag, 2004.

[Gaff04] A. GAFFAR, O. MENCER, W. LUK, AND P. CHEUNG. Unifying Bit-Width Optimi-sation for Fixed-Point and Floating-Point Designs. In FCCM ’04: Proceedings of the12th Annual IEEE Symposium on Field-Programmable Custom Computing Machines,pages 79–88, Washington, DC, USA, 2004. IEEE Computer Society.

[Hofs05] H. HOFSTEE. Power Efficient Processor Architecture and The Cell Processor.In HPCA ’05: Proceedings of the 11th International Symposium on High-PerformanceComputer Architecture, pages 258–262, Washington, DC, USA, 2005. IEEE Com-puter Society.

[Neth03] N. NETHERCOTE AND J. SEWARD. Valgrind: A Program Supervision Framework.Electronic Notes in Theoretical Computer Science, 89(2), 2003.

[Pope96] P. POPELIER. MORPHY, a program for an automated ”atoms in molecules” anal-ysis. Computer Physics Communications, 93:212–240, Februari 1996.

[pub05] Automating Custom-Precision Function Evaluation for Embedded Processors, 2005.


15

[Styl05] H. STYLES AND W. LUK. Compilation and Management of Phase-OptimizedReconfigurable Systems. In Proc. International Conference on Field ProgrammableLogic, pages 311–316, 2005.

[Weid04] J. WEIDENDORFER, M. KOWARSCHIK, AND C. TRINITIS. A Tool Suite for Simu-lation Based Analysis of Memory Access Behavior. In International Conference onComputational Science, pages 440–447, 2004.


16

Exposed Datapath for Efficient ComputingMagnus Bjork Magnus Sjalander Lars Svensson Martin Thuresson

John Hughes Kjell Jeppson Jonas Karlsson Per Larsson-Edefors Mary Sheeran Per Stenstrom

Chalmers University of Technology

Abstract— We introduce FlexCore, which is the first exemplarof a processor based on the FlexSoC processor paradigm. TheFlexCore utilizes an exposed datapath for increased performance.Manually scheduled micro-benchmarks yield a performanceboost of up to a factor of two over a traditional five-stage pipelinewith the same functional units as the FlexCore. The compileris always capable of scheduling the instructions of a general-purpose application onto the FlexCore on par with a traditionalGPP in terms of cycle count.

The flexible interconnect allows the FlexCore datapath to bedynamically reconfigured as a consequence of code generation.Additionally, specialized functional units may be introduced andutilized within the same architecture and compilation framework.

The exposed datapath requires a wide control word. Theconducted evaluation confirms that this increases the instruc-tion bandwidth and memory footprint. This calls for efficientinstruction decoding as proposed in the FlexSoC paradigm.

I. INTRODUCTION

Cost- and performance-sensitive application areas, such ascellular phones and other battery-powered multimedia devices,are not well served by present-day general-purpose computingplatforms. To meet user expectations of features and batterycapacity, designers instead resort to highly heterogeneous sys-tems where a collection of specialized hardware accelerators(built for encryption, image and video coding, audio playback,etc) are controlled by an embedded microprocessor, such as anARM core. For cost reasons, several accelerators will typicallybe colocated with the microprocessor on a single System-on-Chip (SoC). The SoC then constitutes a highly heterogeneouscomputing system, tuned for a set of specialized tasks.

The present practice has drawbacks. Tasks outside the setoriginally intended may not benefit from the computing capac-ity available: computing resources hidden inside an acceleratormay be difficult or impossible to use in ways other than thoseconsidered by the accelerator designer. Even when possible,the software constructs necessary to access a “hidden” hard-ware block bear little resemblance to ordinary code.

For ease of software development and maintainability, auniform hardware/software interface, similar to those offeredby general-purpose processors (GPPs), would be highly de-sirable; but present-day GPPs cannot compete with the het-erogeneous SoC style in terms of performance at a givenpower level. Merging the accelerator datapath elements intothe GPP infrastructure would be possible in principle, but avery wide instruction word would be required to make fine-grained control possible. Moreover, unlike in a conventionalVLIW machine, most of the specialized datapath elements

would be idle at any given moment, so instruction bandwidthand memory footprint would be wasted.

In our approach, called the FlexSoC scheme [1], we addressthese problems by moving away from the traditional GPPinstruction set architecture (ISA) and use a very wide controlword. This allows us to control the functional units in thedatapath at a much more fine grain level. Other important dif-ferences between our approach and the standard GPP-like ISAis that each control word, or Native-ISA (N-ISA), controls thedatapath in the current cycle only and that forwarding needs tobe done statically. Previous work on exposed control using co-design has shown significant performance improvements [2].

ExpansionExpansion

AS-ISA2AS-ISA2AS-ISA2AS-ISA2

AS-ISA1AS-ISA1AS-ISA1AS-ISA1

Control

§

Program

Memory

Instruction

Cache

+ Datapath

N-ISA

Fig. 1. AS-ISA instruction decoding into N-ISA instructions

Intuitively, both the instruction bandwidth as well as thestatic code size will be negatively affected by the wide controlwork. Therefore, in our FlexSoC scheme, we introduce anarchitecture with a programmable instruction decoder. Thisdecoder allows the compiler to use a compressed Application-Specific ISA (AS-ISA) for each application, or set of ap-plications, to be executed. Applications are stored as AS-ISA instructions which are expanded on the fly to N-ISAinstructions when fetched from the program memory. Figure 1illustrates this scheme.

In this paper, we introduce one exemplar of the FlexSoCparadigm, dubbed the FlexCore [3]. The FlexCore is used toevaluate datapath utilization and for investigating the demandson the reconfigurable instruction decoder. Our results showthat the performance increase indeed comes at a cost ofincreased instruction bandwidth and larger static code size.

17

II. THE BASELINE FLEXCORE ARCHITECTURE

To offer the full programmability of a GPP, we have decidedto include all the functional units necessary to emulate a full-featured processor in our baseline architecture. Applicationstudies in our field of interest, in particular comparisons [4]of two audio compression standards (MP3 and Ogg Vorbis),have convinced us that full GPP functionality is necessary forflexibility. We have opted to use a traditional, single-issue, five-stage processor similar to the Hennessy-Patterson 32-bit DLXand MIPS R2000 as a pattern [5]. A five-stage processor isnot a high-performance design, but in FlexSoC, our ambitionis to provide application performance mainly through the useof specialized accelerators rather than through conventionalmethods. Thus, our core is designed to be flexible and grad-ually extensible, with accelerators according to applicationrequirements.

The datapath is fully exposed and is controlled throughthe 91 bit wide N-ISA control word (see Figure 2). Thebaseline FlexCore consists of four functional units: registerfile, arithmetic logic unit (ALU), load/store unit (LS Unit),and a program control unit (PC Unit), with all units connectedto a flexible, fully connected interconnect. To allow for GPPfunctionality, each functional unit output port is connectedto a data register. These act as pipeline registers in thevarious pipeline configurations that can be assembled usingthe interconnect. Thus, it is easy to create different pipelinesby routing a result from one functional unit to the next.

To allow instructions to be scheduled on the FlexCore in thesame way as on a conventional five-stage pipeline, two extradata registers (RegA and RegB) have been included. The twodata registers are used in the execute and load/store stage andallow data to bypass the ALU and LS unit.

Inte

rcon

nect

RegisterFile

OpBWrAddr

OpA

WrEn

DATAReg

DATAReg

ALU DATAReg

DATAReg

LSUnit

DATAReg

DATAReg

ALUop

LS

AddressData

OpAOpB

Interconnect Addresses

Con

trol

Immediate

Ready

ZeroNeg

Size

Interconnect

RegisterFile

OpB

WrA

ddr

OpA

WrE

n

DA

TAR

eg

DA

TAR

eg

ALU

DA

TAR

eg

RegB

LSUnit

RegA

DA

TAR

eg

ALUop

LS

AddressD

ata

OpA

OpB


Control

Ready

Size

Inte

rcon

nect

RegisterFile

OpBWrAddr

OpA

WrEn

DATAReg

DATAReg

ALU DATAReg

LSUnit

DATAReg

DATAReg

ALUop

LS

AddressData

OpAOpB


Con

trol

Immediate

Ready

Size

PCUnit

DATAReg

ControlFB

ImmSelPCop

PCUnit

DA

TAR

eg

Imm

Imm

Sel

PCop

DATAReg

Fig. 2. Illustration of a baseline FlexCore. Note that each DATA Reg alsohas an enable signal not shown in the figure.

The baseline FlexCore can act as a GPP since it canemulate a conventional pipeline and run instructions in thesame way1. The FlexCore architecture also allows for highresource utilization due to the flexibility in scheduling, thanksto the fine-grained N-ISA control word.

1Excluding support for floating point operations and multiplications.

A. N-ISA: Exposed Datapath

As described in Section I, the datapath of a FlexSoC pro-cessor can be precisely controlled through the N-ISA controlword. The N-ISA depends heavily on the architecture andits functional units. The N-ISA for the baseline FlexCorearchitecture is shown in Figure 3. Starting from the leastsignificant bit, the control word consists of bits that control theinterconnect, the program counter unit (which also includes the32-bit immediate value), the two data buffers, the load/storeunit, the ALU, and finally the register bank. Most of thebits are very straightforward to determine from the expectedfunctionality of the functional unit. For example, the bitscontrolling the register file contain the fields for which tworegisters to read, which register to write, a write enable signal,and two stall signals for each read port data register2. ThePC unit handles the immediate value, and the ImmSel signalselects if the value emitted from the PC unit should be thecurrent Immediate value, or the address of the next instruction(which is used in jump-and-link-like instructions).

The N-ISA includes the bits controlling the interconnect.Since each output can be associated to any input in everycycle, the number of bits, n, needed to control an N -input,M -output interconnect is n = N · dlog2(M)e.

Interconnect PC D LS A Register

Fig. 3. FlexCore N-ISA control word. The different fields are: Interconnect(24 bits), PC (37 bits, of which 32 bits are immediate), D (Data buffers, 2bits), LS (Load/Store, 5 bits), A (ALU, 5 bits), and Register (18 bits). Thetotal length is 91 bits.

In each cycle, the controller outputs an N-ISA word, whichcontrols all units in the datapath as well as the interconnect.The exposed-datapath approach differs from the traditionalpipelined control-word found in general-purpose processorsand digital-signal processors, where one control word (corre-sponding to one instruction) contains information about all thepipeline stages, for this and consecutive cycles. In a space/timediagram, the GPP creates a diagonal control word, while theN-ISA gives a horizontal control word.

As can be seen in Figure 3, the N-ISA word consists of91 bits. Compared to a traditional 32-bit GPP, the FlexCorerequires an instruction bandwidth that is almost three timesas large in order to keep the datapath busy. The N-ISA isclearly not an efficient representation for storage of the pro-gram. Therefore, FlexSoC assumes a reconfigurable instruc-tion decoder/expander in the hardware/software interface. Ourresults show that both the static code size and the instructionbandwidth must be addressed using the proposed scheme.

The goal to give the compiler complete control of thehardware has the drawback that binary compatibility betweenprocessors with different datapath architectures is lost. With areconfigurable instruction decoder, as proposed in the FlexSoCscheme [1], it may however be possible to reuse the same AS-ISA for different hardware configurations, but that is a topicfor future research and thus not addressed here.

2These are used to stall the datapath i.e. on a data cache miss.


18

In Section IV, we show performance gains from using anexposed datapath and a flexible interconnect with the samefunctional units as a traditional five-stage processor.

B. Extensions to the Baseline FlexCore

The FlexCore architecture can be extended with applicationspecific accelerator units, simply by adding more ports to theinterconnect and extending the N-ISA control word to includecontrol signals for the new units. Since different units aretreated equally, we hope to avoid complex ad-hoc solutionsusually found in irregular interconnects. For instance, for eachunit added to a normal pipeline, the forwarding network withcontrol logic has to be modified. A conventional fixed pipelinedepth also makes it cumbersome to add functional units andutilize them efficiently: either the new unit is put in the executestage and can thereby only be used if the ALU is not used; ora new pipeline stage is added which changes the architectureconsiderably; or the unit can be added as a coprocessor, whichcauses communication overheads.

A fully connected crossbar guarantees that the interconnectwill not restrict the scheduling of operations on the functionalunits. This motivates its use in the explorative phase in the pro-cessor design. As seen in Section IV, the full connectivity maynot be needed for a given application domain; this provides anopportunity to reduce the area and power requirements of theprocessor, once a suitable collection of functional units hasbeen determined.

III. COMPILING FOR FLEXCORE

The flexibility of FlexSoC architectures enables numerouscompilation strategies. Given the ability of FlexCore to emu-late a conventional processor, we chose as our initial approachto translate conventional GPP-like code into N-ISA code. Thiswork has shown that the FlexCore can indeed emulate aconventional five-stage processor in real time: on the exampleswe have tried out, the number of cycles required for runningthe same program on a DLX and on the FlexCore has differedby less than 3%.

The translation of single GPP instructions to N-ISA codeis straightforward. We use the same pipeline structure as inthe DLX, but the instruction fetch stage is implicitly handledby the FlexCore control unit. In other words, each instructionspans four cycles. The first cycle uses the immediate port andthe read ports of the register bank. The second cycle uses theALU and one data register. The third cycle uses the load/storeunit and the other data register. Finally, the fourth cycle usesthe write port of the register bank. Sequences of such in-structions are merged using the static optimization techniquesdescribed below, to achieve pipelining and forwarding.

The technique of executing GPP programs on the FlexCoreis useful for showing that it is at least as powerful as the DLXprocessor, and can be configured to work as one. Obviously,this is not the best way to exploit the architecture. Even thoughthe interconnect allows communication between any two units,DLX programs use only the paths corresponding to thosefound in the DLX processor. We therefore aim to compile high

level code down to N-ISA along the lines of other compilationmethods for general datapaths, such as [6]. This enables thepipeline length and structure to be changed as often as needed,and even allows programs to use the functional units in anyorder. Currently, profiling allows the programmer to manuallyschedule performance-critical regions.

A. Instruction-Level Static Code Optimization

Translating DLX assembly code into N-ISA code yieldsa number of N-ISA instruction sequences that should bescheduled as tightly as possible, overlapping each other asallowed by resource conflicts and data dependencies. Resourceconflicts are not an issue in the case of DLX code, thanks toits pipeline structure. Data dependencies are more important,since consecutive operations often use the same register.

When one operation uses the contents of a register that isupdated by the previous operation, several cycles can oftenbe saved by forwarding: taking the value directly from thefunctional unit that produces it, rather than waiting for it to bewritten to the register bank first. Another simple optimizationis to change the register read port, if two registers wouldotherwise be read by the same port in the same cycle.

Pipelined processors usually do these optimizations at run-time. However, due to the exposed control word of theFlexCore, we can and must do them statically. The basicoperation is to compose two sequences of N-ISA instructionssequentially, with as much overlap as possible. This is doneby annotating each instruction with information about whatresources that are used, and the status of all registers. Eachregister can have status available, unavailable, or rerouted(p),where p is the name of an output port of a functional unit.Normally, registers are marked as available, which means thattheir value can be read from the register bank. A register isunavailable when a new value for the register is currentlybeing computed and is not yet available. When the valueis available but not yet written to the register bank, thererouted(p)-annotation tells the compiler where the value canbe found. In such a case, the register read is omitted, and thevalue is fetched from the port p instead of the register port.

Techniques such as these are not restricted to rescheduledconventional-pipeline programs, but can be used for any N-ISA code. They help determine whether any two annotatedN-ISA sequences are composable with a fixed overlap. Tofind out the maximal possible overlap, we begin by composingthem without overlap, then with one cycle overlap, thereafterwith two cycles overlap, and so on until we fail. It may bepossible to continue even further, but then we must performa more careful analysis to make sure that no write orderconflict occurs. However, we do not expect such aggressiveoptimizations to have a significant efficiency impact.

B. Basic-Block Level Static Code Optimization

Basic blocks should also be scheduled as tightly as possible.We model basic blocks as sequences of N-ISA instructions,ending with a branch, a static jump, or a dynamic jump. Abranch has a condition (such as the zero flag of the ALU is


19

set) and two addresses; a static jump has one address; anda dynamic jump has no information in the N-ISA word (thedestination address is passed through the interconnect). At theend of the block, there may be a tail, which is another sequenceof N-ISA instructions. These instructions are merged at run-time with the instructions of the next basic block by bitwiseor operations. The tail is used to model configurations whereeach AS-ISA instruction may span over several cycles.

How soon the next block can start to execute after a jumpcan be calculated using the instruction level optimizationmethods for the tail of the first block and the code of thesecond block. To do this, all possible paths in the programmust be calculated. This is easy to do for branches and staticjumps. The possible destinations of dynamic jump operationscan be found out by keeping track of all code addresses thatare stored in data registers (i.e., all jal and jalr instructions).The delay of branches and static jumps can be stored in theblock, while the delay of dynamic jumps is better representedas a number of nops in the successor block (assuming thatonly one function can return to a specific address, while aspecific function can return to several different addresses).

IV. EXPERIMENTS AND RESULTS

The initial benchmarks for FlexCore are taken from theembedded domain. The first benchmark is the matrix operationsum of absolute differences (SAD), a common kernel in manymedia applications such as MPEG-2 video encoding. Thebenchmark takes two 3x3 matrices and returns the sum of theabsolute difference between each of the corresponding matrixelements. In the other benchmark, matrix convolution, a 3x3filter matrix is applied on each pixel of a 4x4 image. Theimage also has a border consisting of zeros around it in orderto handle the pixels on the edges correctly. Equation 1 showsthe operation on each pixel, where F is the filter image andX the original image. The computed result is rounded downto 255 or up to 0 if necessary.

Y [x, y] =2∑

i=0

2∑j=0

F [i, j] · X[x + i − 1, y + j − 1] (1)

In order to compare the FlexCore against a 5-stage GPP, wehave used WINDLX3, a simulator for the DLX architecture.To distinguish between the performance gains achieved byan exposed datapath and the flexible interconnect, a FlexCorewith only the interconnects present in the GPP pipeline hasalso been simulated; it is identified as “Exposed GPP” in thetables. Each benchmark has been manually optimized for thethree architectures and both the static and dynamic instructioncount have been analyzed.

For the convolution benchmark, we have added a multiplierto the FlexCore architecture. To make the comparison to theGPP-implementation, we have used the same 4-cycle delay asin WINDLX in our architecture.

3WINDLX: Developed by Herbert Grunbacher, University of TechnologyVienna, Inst. fur Technische Informatik

In the FlexCore datapath, all functional units, including amultiplier, can work directly with data in the register file. Thisis not true for the original DLX, which has special purposemultiplication registers and associated move instructions tomove data to and from these registers. The WINDLX sim-ulator does not model this special purpose register for integermultiplication and will have better performance compared toan exact model. This difference works in favor for the GPPand our result might be a bit more pessimistic because of this.

A. Performance Evaluation

The metrics used in the experiments are static code size anddynamic instruction count. The results are presented in Table I.For both benchmarks, the FlexCore architecture managed toperform the same task using only half the cycles of the GPP.The speedup of a factor of two is clearly achieved by the moreefficient use of the available functional units. The exposed datapath together with the flexible interconnect yields a substantialperformance boost without resorting to codesign or addingdedicated accelerators.

TABLE ISIMULATION RESULTS FOR THE BENCHMARKS SAD AND CONVOLUTION.

SAD ConvolutionCode Size Cycle Count Code Size Cycle Count

GPP 17 inst 152 (100%) 40 inst 2735 (100%)Exposed GPP 20 inst 85 ( 56%) 68 inst 1774 ( 65%)FlexCore 18 inst 76 ( 50%) 49 inst 1423 ( 52%)

The code size presented in Table I show that the FlexCoreprograms are of the same size or slightly larger than their GPPcounterparts. This together with the fact that each FlexCoreinstruction is almost three times larger than a GPP instructionclearly shows that both the static code size and instructionbandwidth need to be addressed.

As can be seen in Table I, a FlexCore with a fully in-terconnected network achieves a speedup of 11% to 20%compared to only utilizing an exposed datapath, as in theExposed GPP example. This is because several interconnectpaths not available in the GPP were used. Table II shows alist of those paths. However, out of the 90 paths availablein the interconnect, only 12 were used for SAD, and 17 forconvolution. The interconnect is clearly underutilized for thesebenchmarks.

TABLE IINON-GPP INTERCONNECT PATHS USED IN FLEXCORE BENCHMARKS.

From ToRegister bank Register bankALU Register bankImmediate Register bankMultiplier ALURegister bank Load/StoreLoad/Store Multiplier

In a GPP, all calculated values are written to the registerfile. Nevertheless, all written values that are used in the nextinstruction will be routed through the by-pass network; in


20

cases where the value is never again read from the registerfile, the write is unnecessary. In the FlexCore architecture, acompiler could potentially skip the generation of such writes,as long as it can statically find such locations in the program.In this particular case, the number of register writes wasreduced from 76 to 57 (by 25%) for the SAD benchmarkand from 1438 to 921 (by 36%) for the convolve benchmarkby manual scheduling of the instructions. This reduces thecontention of the register file as well as saves power4.

B. Implementation

To evaluate the performance in terms of delay, power,and area, VHDL implementations have been created for thedifferent processors. Note that no control logic has beenimplemented for the evaluated processors in our comparison.The exposed GPP is a traditional GPP where the control logichas been removed. Therefore, when disregarding control logicand instruction fetch the exposed GPP and traditional GPPare equivalent. The impact on performance by control logicand instruction fetch for the different processors is addressedwithin the FlexSoC project, however, it is not the topic of thispaper.

The VHDL has been synthesized (Synopsys Physical Com-piler [7]) and placed and routed (Cadence NanoEncounter [8])using a commercially available 0.13µm technology. Timingand power estimations (Synopsys PrimeTime [9] and Prime-Power [10]) were done on the placed and routed netlists usingback-annotated capacitances. The power estimations are formaximum clock frequency for each of the processors andwith PrimePowers default values for activities on the inputs.Table III shows the result of timing and power estimations forthe baseline FlexCore, a FlexCore extended with a multiplier,the exposed GPP and a traditional GPP. Both the GPP andexposed GPP are equipped with a multiplier.

TABLE IIITIMING, POWER, AND AREA ESTIMATIONS.

Timing (ns) Power (mW) Area (mm2)GPP (MULT) 4.4 (100%) 35.70 (100%) 0.426 (100%)Exposed (MULT) 4.4 (100%) 35.70 (100%) 0.426 (100%)FlexCore (MULT) 5.5 (125%) 34.47 ( 97%) 0.444 (104%)FlexCore 5.1 (116%) 34.39 ( 96%) 0.275 ( 65%)

The FlexCore implementations perform worse in terms ofdelay in comparison to a GPP and exposed GPP pipeline. Thisis most likely due to the more flexible interconnect that givesa performance penalty. On the other hand, power and area arecompetitive with that of a GPP and exposed GPP pipeline.

The FlexCore implementation shows more promising resultswhen considering execution time and energy dissipation for awhole application. The Convolution benchmark shows com-parable execution time and slightly lower energy dissipationcompared to the exposed GPP pipeline, Table IV. For the muchshorter SAD benchmark the full potential is not shown for the

4It also complicates exception handling, which is however not the topic ofthis paper.

FlexCore, since the difference in cycle-count is much smaller.Compared to a traditional GPP the FlexCore has both animproved execution time as well as lower energy dissipation.This is due to the much fewer cycles needed to execute thetwo benchmarks.

TABLE IVEXECUTION TIME AND ENERGY DISSIPATION FOR SAD AND

CONVOLUTION.

SAD ConvolutionTime (ns) Energy (nJ) Time (ns) Energy (nJ)

GPP 669 (100%) 23.88 (100%) 12034 (100%) 430 (100%)Exposed 374 ( 56%) 13.35 ( 56%) 7806 ( 65%) 279 ( 65%)FlexCore 418 ( 62%) 14.41 ( 60%) 7827 ( 65%) 270 ( 63%)

The longer delay of the FlexCore implementation can bedecreased by restricting the flexibillity of the interconnect. Asshown in the previous section, only a limited number of allavailable paths are being utilized.

V. RELATED WORK

Reconfigurable architectures is an active area of research.Dedicated hardware is becoming less attractive because ofhuge initial costs, long time to market, and inability to adaptto new and changing standards. Reconfigurable hardware isa promising approach to address these problems, withoutforsaking the performance of dedicated hardware. Hartensteinhas compiled a thorough survey of reconfigurable architec-tures [11]. Many modes of reconfigurability have been pro-posed: reconfigurable accelerators may be connected to a stan-dard pipeline [12]; or reconfigurable tiles may be orchestratedto solve given problems [13], [14]. In contrast, the FlexSoCapproach employs reconfigurability only in the instructiondecoding hardware, leaving the actual data processing tohighly efficient dedicated hardware.

The exposed datapath concept has recently been used inthe No Instruction Set Computer (NISC) [15], [2], [6] project,where the control pipeline is removed and the controller emitsa wide instruction word each cycle. Co-design refinement ofhardware and software is used to reach the desired perfor-mance. The reported speedups are comparable to those wesee for the FlexCore example. However, the static code sizeof a NISC program is claimed to be comparable to that ofa GPP. While this might be true for a co-design approachwhere common complex operations can be implemented withfew control bits, we have not seen the same results for theFlexCore architecture. In FlexSoC, we rely on compressionand run-time expansion to solve the code size problem. In-creased controllability of the datapath has been have also beenmotivated by the reduction in hardware complexity, as in theTransport Triggered Architecture [16].

Liang et al [17] propose an architecture based on a recon-figurable interconnect and show good performance for somedomain-specific computations. It is, however, not clear howthe results translate to a wider domain of applications.

A common way to accelerate multimedia applications is toadd sub-word parallelism within the functional units (SIMD).


21

This technique is used both in modern general purpose com-puters and specific media processors. For a 5-stage DLXimplementation, Nia and Fatemi report a speedup of more thana factor of 3 with only minor growth in chip area [18]. Theapproach is orthogonal to those proposed here and would seemto make a fruitful addition to a FlexSoC core.

Similarly to FlexSoC, the FITS project [19], [20], [21] alsoenvisions the use of flexible instruction decoders. Applicationprofiling allows the selection of a 16-bit application-specificISA that gives the same performance as the 32-bit baselinecase. FlexSoC combines a similar application-specific ISAapproach with the performance gains offered by the exposeddatapath and the flexible interconnect.

The translation envisioned in the FlexSoC project is some-what similar to microcode processing, where a complex ISAis broken down into micro-operations that are executed onthe pipeline. The main purpose of microcode is to separatethe architecture from the implementation and the microcodeis usually derived from the already given ISA. In FlexSoC,this constraint is relaxed and the AS-ISA can be created bythe compiler to fit the needs of the applications.

VI. CONCLUSION

The exposed pipeline of the FlexCore offers distinct perfor-mance benefits when compared to a GPP with correspondingdatapath hardware. The flexible interconnect network furtherimproves performance, and also allows special-purpose dat-apath elements to be integrated while maintaining a uniformprogramming interface. With knowledge of the datapath struc-ture, a compiler can realize these performance benefits. Ad-ditionally, it is always possible to execute programs compiledfor the DLX at cycle-counts comparable to a standard DLXimplementation.

We have analyzed two media kernels and shown that theFlexCore has a speedup of a factor of two in terms of cyclecount, compared to a traditional 5-stage GPP with the samefunctional units. Using cycle times obtained from placed androuted layouts, we show that this cycle count translates to a 35to 38% total execution-time improvement. This performanceboost comes at a cost of both instruction bandwidth and staticcode size. The FlexCore instructions are about three times aswide as a standard GPP instruction. Since the number of staticinstructions does not get improve for the FlexCore, the staticcode size is also larger. This clearly motivates the FlexSoCscheme of introducing a reconfigurable instruction decoder,to reduce both the static code size as well as instructionbandwidth.

Future work includes incorporating a reconfigurable instruc-tion decoding framework in the FlexCore. Different compres-sion schemes can be expected to be more or less suited to theISA transformations needed, and to carry different implemen-tation costs. Configuration of the instruction decoder couldbe a one-time event; but run-time, on-demand reconfigurationoffers intriguing possibilities, where several tasks, each witha distinct AS-ISA, could be sharing the same hardware.

ACKNOWLEDGMENT

The FlexSoC project is sponsored by the Swedish Founda-tion for Strategic Research.

REFERENCES

[1] J. Hughes, K. Jeppson, P. Larsson-Edefors, M. Sheeran, P. Stenstrom,and L. J. Svensson, “FlexSoC: Combining flexibility and efficiency inSoC designs,” in Proceedings of the IEEE NorChip Conference, 2003.

[2] M. Reshadi, B. Gorjiara, and D. Gajski, “Utilizing horizontal and verticalparallelism with no-instruction-set compiler for custom datapaths,” inInternational Conference on Computer Design (ICCD), October 2005.

[3] M. Bjork, J. Hughes, K. Jeppson, J. Karlsson, P. Larsson-Edefors,M. Sheeran, M. Sjalander, P. Stenstrom, L. Svensson, and M. Thuresson,“FlexSoC technical report Q1 2006,” Computer Science and Engineer-ing, Chalmers University of Technology, Tech. Rep. 2006-8, 2006.

[4] J. Marts and T. Carlqvist, “A Hardware Audio Decoder Using FlexibleDatapaths,” MSc Thesis, Chalmers University of Technology, March2006.

[5] D. A. Patterson and J. L. Hennessy, Computer Organization & Design,The Hardware/Software Interface, 2nd ed. Morgan Kaufman PublishersInc., 1998.

[6] M. Reshadi and D. Gajski, “A cycle-accurate compilation algorithmfor custom pipelined datapaths,” in International Symposium on Hard-ware/Software Codesign and System Synthesis (CODES+ISSS), Septem-ber 2005.

[7] Physical Compiler User Guide Version W-2004.12.[8] Encounter User Guid Version 4.1.[9] PrimeTime X-2005.06 Synopsys Online Documentation.

[10] PrimePower Manual Version W-2004.12.[11] R. Hartenstein, “A decade of reconfigurable computing: a visionary

retrospective,” in Proceedings of Design, Automation and Test in Europe,2001, March 2001, pp. 642–649.

[12] Z. A. Ye, A. Moshovos, S. Hauck, and P. Banerjee, “CHIMAERA:a high-performance architecture with a tightly-coupled reconfigurablefunctional unit,” in ISCA ’00: Proceedings of the 27th annual interna-tional symposium on Computer architecture. New York, NY, USA:ACM Press, 2000, pp. 225–235.

[13] M. B. T. et al., “Evaluation of the RAW microprocessor: An exposed-wire-delay architecture for ILP and streams,” in ISCA ’04: Proceedingsof the 31st annual international symposium on Computer architecture.Washington, DC, USA: IEEE Computer Society, 2004, p. 2.

[14] K. S. et al., “TRIPS: A polymorphous architecture for exploiting ILP,TLP, and DLP,” ACM Trans. Archit. Code Optim., vol. 1, no. 1, pp.62–93, 2004.

[15] B. Gorjiara, M. Reshadi, and D. Gajski, “Designing a custom architec-ture for DCT using NISC design flow,” in ASP-DAC’06 Design Contest,2006.

[16] H. Corporaal, “Ttas: missing the ilp complexity wall,” J. Syst. Archit.,vol. 45, no. 12-13, pp. 949–973, 1999.

[17] X. Liang, A. Athalye, and S. Hong, “Dynamic coarse grain dataflowreconfiguration technique for real-time systems design,” in The 2005IEEE International Symposium on Circuits and Systems. IEEE Com-puter Society, May 2005, pp. 3511–3514.

[18] E. Nia and O. Fatemi, “Multimedia extensions for DLX processor,” inProceedings of the 10th IEEE International Conference on Electronics,Circuits and Systems, Dec 2003, pp. 1010 – 1013.

[19] A. Cheng, G. Tyson, and T. Mudge, “FITS: framework-basedinstruction-set tuning synthesis for embedded application specific pro-cessors,” in DAC ’04: Proceedings of the 41st annual conference onDesign automation. ACM Press, 2004, pp. 920–923.

[20] ——, “PowerFITS: Reduce dynamic and static i-cache power usingapplication specific instruction set synthesis,” in Performance Analysisof Systems and Software, 2005. ISPASS 2005. IEEE InternationalSymposium on, 2005, pp. 32–41.

[21] A. C. Cheng and G. S. Tyson, “High-quality ISA synthesis for low-power cache designs in embedded microprocessors,” IBM J. Res. Dev.,vol. 50, no. 2, pp. 299–309, 2006.


22

Design and System Level Evaluation of aHigh Performance Memory System forreconfigurable SoC Platforms

Holger Lange∗,1, Andreas Koch∗,1

∗Tech. Univ. DarmstadtEmbedded Systems and Applications Group (ESA)Darmstadt, Germany

ABSTRACT

We present a high performance memory attachment for custom hardware accelerators on re-configurable SoC platforms. By selectively replacing the conventional on-chip bus wrappers bypoint-to-point connections, the full bandwidth of the DDR DRAM-based main memory is madeavailable for the accelerator data path. At the same time, the system continues to run software atan almost undiminished execution speed, even when using a full-scale Linux as virtual-memorymulti-tasking operating system. Despite the performance gains, the approach also reduces chiparea by sharing the already existing bus interface to the DDR-DRAM controller with the hardwareaccelerator. System-level experiments measuring both software and accelerator performance on arepresentative rSOC platform show a speedup of two to four over the vendor-provided meansof attaching accelerator cores, while still allowing the use of standard design flows. The resultsare expected to be portable to other platforms, as similar on-chip buses, protocols, and interfacewrappers are employed in a great number of variations.

1 Introduction

Building Systems-on-Chip (SoCs) is complicated by their inherently heterogeneous nature.This necessitates efficient communication of their components both with external periph-erals as well with each other. In this context, the memory subsystem plays a crucial role,as a growing number of SoC components require master-mode access (self-initiate trans-fers) to memories, and memory transfers account for the vast majority of overall bus traf-fic. Low-latency, high-bandwidth memory interfacing is thus highly desirable, especiallyin application domains where excess clock cycles and associated power consumption, in-curred due to waiting for the completion of memory accesses, are not acceptable. For high-performance embedded systems, the CPU(s) alone already demands a significant share ofthe memory bandwidth to keep cache(s) filled, but the memory bottleneck becomes eventighter when custom hardware accelerators (abbreviated here as HW) are integrated into thesystem. These blocks, often realized by IP cores, frequently rely on a fast memory path torealize their efficiency advantages, both in speed and power, over software (SW) running on

1E-mail: {lange,koch}@esa.cs.tu-darmstadt.de23

conventional CPUs.For interconnecting SoC components, a variety of standards have been proposed [1][2].

In context of this work, we examined the IBM CoreConnect [3] approach more closely, whichis also used in the widespread reconfigurable SoCs (rSoC) based on Xilinx Virtex II-pro (V2P)platform FPGAs.

These popular devices allow the creation of SoCs in a reconfigurable fashion, by em-bedding efficiently hardwired core components such as processors into the reconfigurableressources. However, as we will demonstrate, they implement an even less performant sub-set of CoreConnect than the specification itself would allow. The supposedly high-performancemeans of attaching accelerators directly to the processor local bus (PLB) for memory access,as recommended by Xilinx development tools (Embedded Development Kit, EDK) [5], leadsto a cumbersome high-latency, low-bandwidth interface instead.

Interestingly, our observations are not limited to the domain of reconfigurable SoCs, butalso apply to hardwired ASIC SoCs such as the IBM Blue Gene/L compute chip [6]. Here,even the more advanced version 4 of the PLB had to be operated beyond the specification inorder to provide a sufficiently fast link between the L2 and L1 caches, the latter feeding thePowerPC 440 used as processor on the chip at 700 MHz. In addition to the “overclocking”of the interface, it was also used as a dedicated point-to-point connection, instead of the busfashion it was originally intended for (it would not have achieved the required performancethen).

2 Base Platform Xilinx ML310

Since the simulation of an entire system comprising one or more processors, custom accel-erators, memories, and I/O peripherals is both difficult and often inaccurate, we employ anactual HW platform for our experiments.

The Xilinx ML310 [4] is an embedded system development platform which resemblesa standard PC main board (see Fig. 1). It features a variety of peripherals (USB, NIC, IDE,UART, AC97 audio, etc.) attached via a Southbridge ASIC to a PCI bus, which provides a re-alistic environment for later system-level evaluation. In contrast to a standard PC, the CPUand the usual Northbridge ASIC have been replaced by a V2P FPGA [7], which comprisestwo PowerPC 405 processor cores that may be clocked at up to 300MHz. They are embeddedin an array of reconfigurable logic (RA). Thus, the “heart” of the compute system (CPUs, ac-celerators, buses, memory interface) is now reconfigurable and thus suitable for experimentsin architectures. With sufficient care, this rSoC can implement even complex designs with aclock frequency of 100 MHz (a third of the embedded CPU cores’ clock frequency).

The ML310 is shipped with a V2P PCI reference design (shown on a gray backgroundin Fig. 1). This design consists of several on-chip peripherals, which are attached to a sin-gle PowerPC core by CoreConnect buses. These peripherals comprise memory controllers(DDR DRAM and Block-RAM), I/O (PCI-Bridge, UART, the System ACE compact flashbased-boot controller, GPIO, etc.), an interrupt controller, and bridges between the differentCoreConnect buses.

On the SW side of the system, we run the platform under Linux. While the use of sucha full-scale multi-tasking, virtual memory might seem overkill for the embedded area, themarket share of Linux variants in that field is roughly 20% ([8], VxWorks 9%, Windows 12%).Furthermore, Linux stresses the on-chip communication network more than a light-weight


24

GPIO / LEDs

CF

SPI

SMBus

RS232

System ACE

GP

IO

DD

R

8 R

ock

etIO

MG

Ts

3 L

VD

S p

airs

1 L

VD

S C

lock

pai

r38

Sin

gle

-En

ded

I/O

SP

IS

MB

us

UA

RT

Sys

AC

E

PCI-Bridge

INTC PLB BRAM

OCM BRAM

OCMBus

PPC405

PLB2OPBBridge

OPB2PLBBridge

39 L

VD

S p

airs

1 C

lock

OPBBus

PLBBus

XC2VP30FF896

AliM1535D+

South Bridge

Audio

3.3V PCI

3.3V PCISlots

Intel GD8225910/100 Ethernet NIC

RS232(2)

PS/2K/M

ParallelPort

SMBus

AMDFlash

GPIO

IDE(2)

USB(2)

High-SpeedPM2

High-SpeedPM1

256 MBDDR DIMM

RJ45

Figure 1: ML310 system (from Xilinx manuals)

RTOS, and is thus better suited for load-testing our architecture.

3 Vendor Flow for rSoC Composition

Regardless of the target technology, actually composing an SoC is a non-trivial endeavor.In addition to the sheer number of components (e.g., as demonstrated by the ML310), dif-ferent components also have different interface requirements (e.g., an UART vs. a processorcore) or may not be available using one of the supported standard interfaces at all. For ourplatform, standard interfaces would be either the PLB interface already mentioned above,or the simpler On-chip Peripheral Bus (OPB), which will be described below. Non-standardinterfaces are common to HW accelerator blocks that often have applications-specific attach-ments, which are then connected to standard buses by means of so-called wrappers. Theseconvert between the internal and external interfaces and protocols. In some cases, the dif-ferent protocols are fundamentally incompatible and can only be connected with additionallatencies or even loss of features (such as degraded burst performance).

The Xilinx EDK [5] SoC composition tool flow supports two standard means for inte-grating custom accelerators into the reference design. The simpler way is the attachment viaOPB [3], shown in Figure 2. The idea here is to isolate slower peripherals from the fasterprocessor bus by putting them on a dedicated bus with less complex protocols. OPB attach-ments thus implement a relatively simple bus protocol, having 32 bits transfer width at 100


25

256 MBDDR DIMM

PLB-DDRPLB

SlaveDDRCtrl

HW

PPC405

PLBBus

PLBOPB

Bridge

OPBWrapper

OPC-PCIBridge

OPBBus

PCI

Figure 2: HW integration via OPB

256 MBDDR DIMM

HW

PPC405

PLBBus

PLBWrapper

PLB-DDRPLB

SlaveDDRCtrl

Figure 3: HW integration via PLB

MHz clock. The most important OPB operation modes are:

• Single beat transfers (one data item per transaction)• Burst transfers of up to 16 data words (limited by processor bus)• Master/slave (self/peer initiated) transfers

However, the simplicity comes at the price of higher access latencies compared to PLBattachment. These are due both to the OPB wrapper of the block as well as the PLB-OPBbridge which must be traversed when communicating with high-speed blocks on the PLB(such as the processor and the main memory).

The PLB attachment is the second proposed way of integrating HW accelerators, shownin Figure 3. Its protocol has more features aiming for high-performance operation, but isalso more complex than the OPB one. Hence, bus wrappers are nearly always needed whenconnecting to the PLB. The most important PLB operation modes are:

• Single beat transfers (one data item per transaction)• Burst transfers up to 16 data words• Cacheline transfers (one cacheline in 4 data beats, cache-missed word first)• Master/slave (self/peer initiated) transfers• Atomic transactions (bus locking)• Split transactions (separate masters/slaves performing simultaneous reads/writes)• Central arbiter, but master is responsible for timely bus release

Additionally, the PLB interface also operates on 64 bits of data at 100 MHz (twice thebandwidth of OPB), accompanied by lower access latencies with just two bus wrappers leftbetween HW and DDR-DRAM controller for the main memory. Note that the controller itselfalso requires a wrapper. However, since the memory never initiates transactions by itself, aslave-mode wrapper suffices for this purpose.

4 Practical Limitations

As discussed in the previous section, attaching a HW accelerator to the PLB offers perfor-mance increased over an OPB attachment. However, there are still practical limitations: The


26

IBM PLB Spec v3 Xilinx V2P PLBClock 133 MHz 100 MHzSingle cycle yes no (wrappers)transactionsBurst length unlimited 16 words

Table 1: PLB specification vs. implementation

Min. Size [Slices] Max. Size [Slices]Wrapper for (slave only) (full master-slave)OPB 27 544PLB 180 2593

Table 2: Area overhead for bus wrappers in vendor flow

original CoreConnect specification [3] allows for unlimited PLB burst lengths, with transac-tion latencies being as short as a single cycle (if the addressed slave allows it). Unfortunately,as shown in Table 1, the actual implementation of the specification on the V2P-series of de-vices does not support all of the specified capabilities.

In addition to being clocked at only 100 MHz, PLB is further hindered by its relativelycomplex protocol and an arbitration-based access scheme, both leading to long initial laten-cies. Furthermore, in the Xilinx V2P implementation, the maximum burst length is limitedto just 16 words. The bus wrappers employed in the vendor tool flow impose additionallatency and can also require considerable chip area (see Table 2). Even the performance-critical controller for the DDR-DRAM main memory is also connected to the PLB by meansof a wrapper (see Figure 3). This combination of restrictions renders the memory subsys-tem insufficient for 64 bit, DDR-200 operation (1600 MB/s theoretical peak performance, themaximum supported by the actual DDR DRAM memory chips used on the ML310 main-board).

5 FastLane Memory System

However, since we are experimenting with a reconfigurable SoC, we can choose an alternatearchitecture. To this end, we designed and implemented a new approach to interface theprocessor, custom HW accelerators and the main memory.

The main concept behind the FastLane high performance memory interface is the directconnection of the memory-intensive accelerator cores to the central memory controller with-out an intervening PLB. By also using a specialized, light-weight protocol, we can avoid thearbitration as well as the protocol overhead associated with PLB. This leads to a greatlyreduced latency, with no wrapper left between accelerator core and RAM controller, as op-posed to two wrappers in the Xilinx reference design. We can now also make the full band-width of the RAM controller available to the accelerator, eventually enabling true 64 bitdouble data-rate operation. Figure 4 shows the new memory subsystem layout.

For even further savings, the accelerator(s) now also share the PLB slave attachment ofthe DDR controller wrapper. Thus, no additional chip area is wasted on wrappers (whichhave now become redundant). The master mode side of the accelerator is connected viaFastLane directly to the interface of the DDR controller, but can accept data transfers from


27

256 MBDDR DIMM

HW Accelerator

MARC

optional

PPC405

PLBBus

PLBSlave

DDRCtrl

PLB-DDR

Fa

stL

ane

Figure 4: FastLane: Attaching HW accelerator directly to DDR controller

the processor (e.g., commands and parameters) via the shared PLB slave. Both interfacesinternally use a simple double handshake protocol, streamlined for low latency and fastburst transfers.

Many algorithms implemented on HW accelerators can profit from higher-level abstrac-tions in their memory systems, such as FIFOs/prefetching for streaming data, and cachesfor irregular accesses. To this end, the Memory Architecture for Reconfigurable Computers(MARC, shown in Figure 5) can be inserted seamlessly in the FastLane, using a compat-ible protocol. MARC offers a multi-port memory environment with emphasis on cachingand streaming services for (ir)regular access patterns. MARC, which is described in greaterdetail in [9], consists of 3 parts:

• The core encapsulates the functionality for caching and streaming services, the cachetag CAM (Content Addressable Memory), cacheline RAM and stream FIFOs. The corealso arbitrates the back ends and front ends, aiming to keep all of them working concur-rently but resolving conflicts when accessing the same resource.

• The front ends provide standardized, simple interface ports for both streaming andcaching using a simple double-handshake protocol.

• The back ends adapt the core to several memory and bus technologies. New backendscan be easily added as required.

While the FastLane approach aims to provide optimal conditions for the compute-intenseHW accelerators, it must also consider that the rest of the system, specifically the proces-sor(s), also require access to the main memory and may be intolerant to longer delays inanswers to their requests. For example, interrupts, timers and the process scheduler causememory traffic even on an idle system, and a too-slow response leads to system instabilities.Bus master devices (capable of initiating transfers on the bus) may experience buffer over-or underruns if the transfer is not completed in time due to bus contention caused by a HWaccelerator.

This implies that the CPU and other bus master devices must always have priority overthe accelerator block (which can be explicitly designed to tolerate access delays). The re-quired arbitration logic is completely hidden from the accelerator within the FastLane in-terface. The CPU (and other bus master devices) may interrupt master accesses of the accel-erator at any time, while the accelerator cannot interrupt the CPU, and has to wait for thecompletion of a CPU-initiated transfer.


28

MARCcore

SRAM 0

BIU

Use

r Lo

gic

I/O Bus

CachePort

CachePort

RAMCAM FIFO

RC Array

SRAM n

ModSRAM

ModSRAM

ModBus

StreamPort

StreamPort

Arbitration

CachingStreaming

Back−Ends Front−Ends

DDRRAM ModDDRRAM

Figure 5: MARC overview

6 Operating System Integration

When considering accelerators at the system level, it is obvious that these HW cores mustalso be made accessible to application SW in an efficient manner. Given their master-modeaccess to main memory, this is non-trivial in an OS environment supporting virtual memory.The memory management unit (MMU) translates the virtual userspace addresses as seen bySW applications into physical bus addresses, which are sent out from the processor via thePLB. Address translations and the resolution of page faults are transparent for SW. Sincethe accelerators do not have access to the processor MMU with its page address remappingtables, this implies that hard- and software communication in a virtual memory environ-ment must use both userspace and physical addresses. Furthermore, since the HW is neitheraware of virtual addresses, nor can it handle page faults, the memory pages accessed by theaccelerator must be present in RAM before starting the accelerator.

The solution to this requirement is a so-called Direct Memory Access buffer (DMA buffer).In the Linux virtual memory environment, a DMA buffer is guaranteed to consist of con-tiguous physical memory pages that are locked down and always present in physical RAM,they can never be swapped out to disk. As described previously, there are now two addressespointing to the buffer, the first being the physical bus address as seen by the accelerator, thesecond being the virtual userspace address representing the same memory area for applica-tion SW. In the example given in Figure 6, a SW program has allocated a DMA buffer andpasses its physical address to the accelerator. The SW can access this buffer via userspaceaddress 0x01004000, which is translated to physical address 0x12345678 by the MMU. Theaccelerator directly uses this physical address to access the same DMA buffer.

The algorithms running on the accelerator often require a set of parameters for their op-eration, which are transferred from the SW application by performing writes to the memory-mapped accelerator registers. These are actually handled by the PLB slave shared with thememory controller and forwarded to the accelerator HW. From the SW perspective, thememory mapped registers are simply accessed via a pointer to a suitable data structure.Bulk data (e.g., image pixmaps etc.) is also prepared by the SW within the previously al-


29

SW

HWDMAbuffer

MMU

0x01004000 0x12345678 0x12345678

RAM addresses

Figure 6: HW and SW addressing of DMA memory

located DMA buffer (e.g., read from a file), which can be manipulated by SW as any otherdynamically allocated memory block. Then, offsets from the start of the DMA buffer wherevarious data structures are located (both for input and output) are simply passed as param-eters to the accelerator. After starting the accelerator by writing to its command register, theaccelerator fetches the bulk data directly from memory, preferably in efficient burst reads,without involvement of SW running on the processor. In a similar fashion, it also depositsthe output data in the DMA buffer, and then indicates completion of the operation via aninterrupt to the processor. Now SW can take over again.

On the OS side, this functionality is completely encapsulated in a character device driverthat performs the appropriate memory mappings for the slave-mode registers and the DMAbuffer.

7 Experimental Results

To demonstrate the effectiveness of our approach, we exercised several system load scenar-ios. The basic setup is identical in all cases: We mimic the actions of an actual HW acceleratorby a hardware block that simply repeatedly copies a 2 MB buffer from one memory locationto another as quickly as possible, totaling to 4 MB of reads and writes per turn, and measurethe transfer rate in MB/s. In addition to the accelerator, we run different software programs,chosen for their specific load characteristics on the processor. We then measure the HW ex-ecution time (the time it takes to copy memory data at full speed) and the SW executiontime (the time it takes for a given program to execute on the processor) for both the orig-inal vendor-provided as well as our FastLane memory interface. The extreme cases (HWaccelerator and processor idle) are also considered.

The suite of software programs was chosen to represent an everyday mix of typical appli-cations that also perform I/O and calculations in main memory (instead of just running en-tirely within the CPU caches). The scp program from the OpenSSH suite [10] was instructedto copy a local file, sized 4 MB, via network to a remote system. The same is done withoutencryption by netcat [11]. The GNU gcc [12] C compiler was evaluated while compiling thenetcat sources.

To also cover the embedded system domain where SoC platforms similar to V2P are oftenemployed, the ETSI GSM enhanced full rate speech codec [13] and an image processingpipeline as often found in printers were included ([14], JPEG RGB to CMYK conversion aspart of the HP Labs Vliw EXample development kit), both representing typical embedded


30

Application V2P ref design FastLaneExec. Time [ms] Mem. Rate [MB/s] Exec. Time [ms] Mem. Rate [MB/s]

idle system 18.81 213 5.67 705scp 55.11 73 12.82 312netcat 53.07 75 19.27 208gcc 32.14 124 17.42 230GSM 19.05 210 6.35 630imgpipe 44.67 90 19.33 207

Table 3: HW accelerator run times and available bandwidth using original and FastLanememory subsystem implementations

Application HW inactive [ms] V2P ref design [ms] FastLane [ms]scp 4831 61052 5828netcat 3130 55938 3901gcc 40686 166655 52908GSM 25981 40045 27767imgpipe 3545 5109 4018

Table 4: SW run times on idle system and using accelerator attached by original and FastLanememory subsystem implementations

applications. The various programs can be characterized as follows:

• scp provides a mix of CPU- and I/O load• netcat exercises network I/O exclusively• gcc interleaves short I/O- and long calculation phases• GSM provides codec stream data processing• imgpipe implements multi-stage image processing

The first set of measurements shown in Table 3 considers the memory throughput of theHW accelerator under different CPU load scenarios. Here, we show the time for a single 2MB block copy (four mega-transfers), as well as the resulting memory throughput, whenusing the original vendor-provided PLB interface as well as our FastLane for connecting theHW accelerator. It is obvious that FastLane significantly increases the throughput in all loadscenarios, in some cases by a factor of up to 4.3.

The set of measurements shown in Table 4 quantifies the influence of the different mem-ory attachments on the execution time of software running on the processor that also accessmain memory in different load patterns. The results show that, despite its high throughputto the HW accelerator, FastLane does not significantly impair the processor: SW executiontimes are almost unaffected by the HW memory transfer, owing to the absolute priority ofthe CPU (cf. Section 5) over the HW accelerator. In contrast, the original vendor-providedreference design exhibits a steep SW performance decline, increasing execution times by afactor of up to 14x over that of SW running with the FastLane-attached accelerator. FastLanethus enables the accelerator to access memory bandwidth that appears to be completelyunused by the original memory interface.

One might assume that the FastLane approach of giving the CPU override priority forbus access will cancel out the theoretical performance gains of FastLane over the original


31

PLB-based accelerator attachment. However, we measured that even under these conditions,FastLane is able to provide the accelerator with roughly half of the theoretically availablememory bandwidth (which is 800 MB/s in the single-data rate mode used here): Practicallyachievable are 32b data words at a rate of 352 MB/s and 64b words at 705 MB/s, yieldinga bus efficiency of 88%. Going to double data rate mode (planned as a refinement) woulddouble these rates again. At the same time, the multi-tasking OS and the SW applicationcontinue to run at almost full speed.

In scenarios where fast SW interrupt response is not required, for example, and it ispossible to freeze the processor entirely (e.g. by stopping the clock signal), FastLane makesthe full memory bandwidth available to the accelerator. This is not achievable using theoriginal PLB attachment, which even with a frozen processor is only able to exploit just 25%of the theoretically available read bandwidth and 33% when writing.

8 Conclusion and Future Work

We presented a high performance memory attachment for custom HW accelerators. Ourapproach can increase the usable memory throughput by more than 4x over the originalvendor-provided PLB attachment (included in the Xilinx EDK [5] design suite). Addition-ally, it required less chip area and left performance of SW applications running on the on-chip processors almost unaffected.

From a practical perspective, FastLane integrates into the standard EDK design flow andis as easy to use as the original attachment. Although the results are not directly transferableto platforms other than Virtex 2 Pro system FPGA, it is clear that reduced bus and wrapperoverhead will always result in smaller logic and lower latencies. Hence, other platforms mayalso benefit from this approach.

Our current and future work focuses on improving the OS support (transparent ad-dresses between HW and SW) as well as the supporting full 64b double-data rate operationin FastLane.

References[1] AMBA home page, http://www.amba.com, 2006[2] Open Core Protocol International Partnership, http://www.ocpip.org, 2006[3] IBM, “The CoreConnect Bus Architecture”, White Paper, 1999[4] Xilinx, “ML310 User Guide” (UG068), 2005[5] Xilinx, “Embedded System Tools Reference Manual” (UG111), 2006[6] M. Ohmacht et al, “Blue Gene/L compute chip: Memory and Ethernet subsystem”, IBM Journal of Research

and Development, Vol. 49, No. 2/3, pp. 255-264, 2005[7] Xilinx, “Virtex-II Pro and Virtex-II Pro X Platform FPGAs: Complete Data Sheet” (DS083), 2005[8] J. Turley, “Operating systems on the rise”, http://www.embedded.com, 2006[9] H. Lange, A. Koch, “Memory Access Schemes for Configurable Processors”, Proc. Workshop on Field-

Programmable Logic and Applications (FPL), Villach, 2000[10] http://www.openssh.com, 2006[11] http://netcat.sourceforge.net, 2006[12] http://gcc.gnu.org, 2006[13] ETSI EN 300 724 V8.0.1 (2000-11), http://www.etsi.org, 2006[14] J. Fisher, P. Faraboschi, C. Young, “Embedded Computing: A VLIW Approach to Architecture, Compiler

and Tools”, chapter 11.1, Elsevier, 2005


32

An FPGA-based System for Development of Real-time Embedded Vision Applications

Hongying Meng1, Nick Pears1 Chris Bailey1 Department of Computer Science, The University of York, York, YO10, 5DD, UK

ABSTRACT In this paper, a FPGA based system for real-time video processing is proposed. In this system, real-time video was obtained though a digital camera and video/image processing algorithms were implemented on FPGA hardware. The original image frames and processed video can be displayed simultaneously on the screen of a PC, acting as a system monitor. It is a flexible and easily implemented system, which is used as a quick and simple platform for FPGA-based video processing algorithm development and evaluation. Furthermore it is a cheap solution especially for low complexity real-time video processing development for applications on small, low-power devices such as mobile or ubiquitous devices. KEYWORDS: FPGA; video processing; computer vision; real time

1 Introduction Due to the rapid development of the video camera and mobile video, video processing has attracted more and more attention in recent years. Real-time video processing is among the most demanding computation tasks [Atha95]. One solution is to use special purpose DSP processors, designed to execute certain low-level image processing operations extremely quickly. However, this approach can sometimes be over specialized, when one has a large range of possible applications in mind, since this requires a commensurately large range of low and medium level operations. An alternative solution to a software implementation (running on either general or specialized hardware) is the design of specific hardware for specific computer vision processes, in order to perform a high rate of operations per second. If only a small number of specific processes are required for a particular application, then only those particular processes need to be implemented in the hardware. Continuing growth in silicon chip capability is rapidly reducing the number of chips in a typical system, and increasing the size, performance and power benefits of System-on-Chip (SoC) integration. However, the design and test of a miniaturized vision system can be quite strenuous, owing to many technology and financial constraints that often restrict the developer’s pool of resources.

1Email: {hongying,nep,chrisb}@cs.york.ac.uk

33

Advances in programmable logic devices have resulted in the development of Field Programmable Gate Arrays (FPGA) that allow integration of large numbers of programmable logic elements in a single chip. The size and speed of FPGAs are comparable to ASICs, but FPGAs are more flexible and their design cycle is shorter. It is possible that FPGA architectures will allow generic real-time image processing, computer vision and pattern recognition techniques to be packaged with a relatively low power CPU and an image sensor. Actually, FPGA has been used in computer vision system frequently [Schm04] [Sold99] [McCr00] [Fern05] [Masr06]. Schmittler et al. [Schm04] used a single FPGA chip for real-time ray tracking of dynamic scenes. Soldek and Mantiuk [Sold99] proposed a FPGA based system for pattern recognition, processing, analysis and synthesis of images. McCready and Rose [McCr00] used FPGA based system for real-time, frame-rate face detection. Vargas-Martín et al [Fern05] proposed a generic FPGA based real-time video processing system for applications in the field of electronic visual aids. Masrani and MacLean [Masr06] used FPGA for a real-time large disparity range stereo system. Based on the wide applications of FPGA in the area of computer vision, some people have tried to develop a system or design methodology for this kind of designs [Flei98] [Dray99] [Hayn00] [Arri02] [Sen05]. Fleischmann et al. [Flei98] proposed a hardware/software prototyping environment for networked embedded systems. Drayer and Araman [Dray99] proposed a development system for creating real-time machine vision hardware design on FPGAs. Haynes et al. [Hayn00] proposed the Sonic architecture that is a configurable computing system performing real-time video image processing. Arribas and Macia [Arri02] proposed a FPGA based system for development and testing of vision algorithms such as image compression and optical flow calculation. Sen et al. [Sen05] developed a design methodology for generating efficient target hardware description language code from an algorithm. Increasingly, there are some commercial FPGA development boards such as Xilinx Virtex-4 Video Starter Kit available with a video I/O solution. However, lots of them were designed for a specific chip or algorithms, for example, DSP based algorithms. Meanwhile, the price of these commercial boards is also high. In most cases, such as intelligent surveillance, traffic management or object detection in mobile video, a very simple low-cost FPGA development board is all that is needed. Although above great progress, it is still helpful to have a cheap and convenient tool to develop computer vision algorithm for FPGA chips. In this paper, a novel PC-FPGA video development platform was provided. It was designed for quick development of ubiquitous computing applications with a demand for real-time video processing. The rest of this paper is organized as follows: In section 2, we will give a brief introduction to the architecture. In section 3, Huffman coding will be explained. In section 4, FPGA design of the components are presented .In section 5, experimental results are shown. Finally, we present the conclusions.

2 System architecture Figure 1 shows the basic outline architecture of the system for the development of real time video applications. It is a very cheap solution for FPGA based video processing algorithm development. A PC is connected with a digital camera by USB cable while it is also connected with a FPGA board. The PC can grabs the video frames from the camera and when the FPGA board is ready, the frames can be directly sent to FPGA board and be processed. Meanwhile, the results can be sent back and


34

displayed on the screen of PC along with the original frame. Obviously in a real ubiquitous/embedded application, we would remove the PC and connect the camera directly to the FPGA board, which directly implements the appropriate camera driver logic.

Figure 1: Architecture of the PC-FPGA System

2.1 Hardware construction From figure 1, it can be clearly seen that this is an easily implemented system. There are plenty of FPGA boards suitable for this system. For the FPGA development boards, some of them have a parallel connection with PC while others may have an USB, PCI or other high-speed connection. However, some of these are rather expensive. A simple FPGA development board is needed here to connect to PC by parallel cable. This parallel cable can be used for downloading FPGA design from PC to FPGA board. It can be switched to transfer image signals between PC and FPGA board.

Figure 2: Our PC-FPGA System

In our experiment, a simple BurchED FPGA prototyping board (BX-300 FPGA) was selected. There is a Xilinx SPARTAN 2E chip with 300K gates on this board. It is connected to a PC by a parallel cable through a JTAG connector. FPGA design can be downloaded to the FPGA board through the parallel cable. For digital camera, lots of digital camera or web camera can be used here. In our system, the DC1300 from BenQ was used. It has a USB connection to PC. BenQ DC1300 has many functions such as capture digital still image, image frames and record video slip. But here, it was used only as a web camera. Figure 2 showed our cheap PC-FPGA system.

USB Parallel

PC

Camera FPGA Board


35

2.2 Software components The software components include the ones both on PC side and FPGA sides, as described in figure 3. On the PC side, we have developed a friendly interface using Visual C++. The original video or image sequences were captured and transferred to PC by the software though a USB connection. The software was designed in VC++ environment. It can capture the video frames and display them on the screen. Huffman algorithm codes were designed to compress and decompress the images. There are also some other codes such as image format transformation between RGB and YUV. The software components on the PC side consist of an operating-system windowing/message processing layer, image acquisition, a parallel port interface and image display. A basic WIN32 message loop handles things like startup and shutdown and passes on repaint and input calls to the main code. The image acquisition code either supplies images from files or captures data from a device such as a digital camera. This uses the Microsoft Vision SDK API and assumes a Video For Windows compatible device has been set up. The image display code uses OpenGL. The PC-side code also does Huffman encoding and decoding of the images, as described earlier. Meanwhile, there is a module designed for parallel port interface using the DLPortIO library.

Figure 3: Software components in the system

On the FPGA side, six modules were designed for the FPGA chip. They are parallel port interface, Huffman decoder, Huffman encoder, image decoder, image encoder and image processing modules. The image processing module is the key element, which we wish to develop, whereas the other elements are essentially ‘test harness’ components of the system for image acquisition, display and evaluation of FPGA-based image processing algorithm performance. This image processing algorithm can be any image processing algorithm required by the application. For each module, VHDL codes were synthesized and implemented by Xilinx tools in ISE 7.1i (WebPACK) and then they were downloaded to the FPGA chip SPARTAN 2E using iMPACT tool. Finally, the parallel cable was switched to data communication mode and the FPGA board reset. When the software on PC side starts to run, the real-time video can be processed and displayed on the PC.

3 Huffman coding design Huffman coding [Huff52] is a well-known entropy encoding algorithm used for lossless data compression. The term refers to the use of a variable-length code table for encoding a source

FPGA

Parallel Port Interface

HuffmanDecoder

Image Decoder

Image Encoder

HuffmanEncoder

Image processing Edge detector, Corner detector,

PC

Parallel Port Interface

Image Acquisition

Huffman Decoder

Image Display

Huffman Encoder


36

symbol (such as a character in a file) where the variable-length code table has been derived in a particular way based on the estimated probability of occurrence for each possible value of the source symbol. Instead of determining symbol frequencies, the code table used here is fixed and used every frame. We used a few sample images to determine typical symbol frequencies. The image is first transformed into 3 channels: the brightness and two colour channels. The brightness is stored to a higher accuracy than the colour channels because it is more noticeable to the eye. Next, each pixel jip , is examined and it's represented as the difference jid , from its neighboring pixels:

2/)( 1,,1,, −− +−= jijijiji pppd (1)

That way, large areas of similar brightness/colour are represented as low values. The final step is to encode this data with the Huffman code table. There are actually two code tables: one for the brightness channel and another for the colour channels. There are two different code tables - one for encoding and one for decoding. The encoder is just a lookup table where it just transforms one fixed-width symbol to one variable-width symbol. The decoder needs to analyse a stream of bits and delineate the symbol boundaries then translate the variable-width symbols into fixed-width ones. When the bit stream is read, the matching prefix is found in this table and the symbol length is examined - that many bits are then removed from the stream and translated to a fixed-width (un-encoded) symbol. Then the reverse of the original image translation is applied to pixel differences and combining the colour channels.

4 FPGA components design For our current system, five basic FPGA modules have been designed, as shown in figure 3. The parallel port interface module was used for data communication with PC. Image encoder was used to convert the image format from RGB to YUV, while the image decoder did the reverse process. The Huffman encoder and decoder were used for compression and decompression.

Table 1. The sizes of FPGA components.

Components Slices % The Parallel port interface 14 0 Huffman decoder 168 5 Image decoder 306 9 Sobel edge detector algorithm 189 6 SUSAN edge detector algorithm 848 27 SUSAN corner detector algorithm 1010 32 Image encoder 34 1 Huffman encoder 78 2


37

The image processing algorithms that have been constructed are Sobel edge detection [Gonz92] and SUSAN edge and corner detection [Smit95]. The sizes of these components, when synthesized, are listed in table 1. Pipelining was introduced to components that had lower maximum clock speeds, as this is the limiting factor for the maximum clock speed of the design as a whole. When the design occupies more chips are then the maximum as a whole. The system clock speed can become significantly lower than the slowest component (if it were to be synthesized on it’s own). This is due to the longer signal pathways, which can represent more than half the delay, compared to the actual delay from the logic itself. By segmenting the component and pipelining, a speedup can be achieved for very little area cost, at the expense of increased latency time between input and output.

5 Experimental results Figure 4 demonstrates an experimental result obtained from our system. Real-time video frames were captured through the camera and, in this example, the camera was fixed and the hand was moving. The resultant images of edges extracted by the Sobel edge detector are displayed at the same time as the real-time video. The sampling rate of the frame was estimated and controlled automatically by the software itself. In our experiments, 5 frames per second rate can be reached, which is a limit due to the bandwidth of the parallel port, rather than a limit due to the image processing algorithm itself. Currently we are developing more FPGA-based processing modules using this development system in order to track people through video sequences.

Figure 4: Screen shot of real-time video edge extraction with successive 12 frames in which the camera was fixed and the hand was moving.

6 Conclusions


38

In this paper, a FPGA based real-time video processing development system was introduced. This system can be used for development and evaluation of real-time video processing algorithms on FPGA chips and is especially useful for developing prototypes for mobile and ubiquitous, visually-driven applications. It is an easily implemented and extensible system with large flexibility and low cost.

7 Acknowledgements The authors would like to thank DTI and Broadcom Ltd. for the financial support for this research. We also would like to thank Mr. Peter Stock for his previous work on this project.

References [Atha95] P.M. Athanas and A.L. Abbott, Real-time image processing on a custom computing

platform, In IEEE Computer, Feb. 1995. [Arri02] P. Arribas and F. Macia, FPGA board for real time vision development systems, In

Fourth International Caracas Conference on Devices, Circuits and Systems, Aruba, Dutch Caribbean, 2002.

[Dray99] T. Drayer and P. Araman, A Development System for Creating Real-time Machine Vision Hardware using Field Programmable Gate Arrays. In 32nd Annual Hawaii International Conference on System Sciences (HICSS-32), Maui, Hawaii, 1999.

[Fern05] Fernando Vargas-Martín, M. D. Peláez Coca, Eduardo Ros, Javier Diaz, Sonia Mota. A generic real-time video processing unit for low vision. In International Congress Series, Vol. 1282, pages 1075-1079, 2005.

[Flei98] J. Fleischmann, K. Buchenrieder and R. Kress, A hardware/software prototyping environment for dynamically reconfigurable embedded systems. In Proceedings of the Sixth International Workshop on Hardware/Software Codesign, CODES, Seattle, Washington, USA, pages 105-109, 1998.

[Gonz92] R. Gonzalez and R. Woods, Digital image processing, Addison Wesley, pages 414–428, 1992.

[Hayn00] S. D. Haynes, J. Stone, P. Cheung, W. Luk, Video Image Processing with the Sonic Architecture, IEEE Computer, vol. 33, no. 4, pages 50-57, 2000.

[Huff52] D.A. Huffman, A Method for the Construction of Minimum-Redundancy Codes, In Proceedings of IRE, pages 1098-1101, 1952.

[Masr06] D. Masrani and W. MacLean, A Real-Time Large Disparity Range Stereo-System Using FPGAs. In Lecture Notes in Computer Science, 3852, pages 42-51, Springer 2006.

[McCr00] R. McCready and J. Rose, Real-time, frame-rate face detection on a configurable hardware system, In FPGA '00: Proceedings of the 2000 ACM/SIGDA eighth international symposium on Field programmable gate arrays, page 221, New York, NY, USA, 2000. ACM Press.

[Schm04] J. Schmittler, S. Woop, D. Wagner, W. J. Paul and P. Slusallek, Realtime ray tracing of dynamic scenes on an FPGA chip. In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware. ACM Press, New York, NY, pages 95-106. 2004.

[Sen05] M. Sen, I. Corretjer, F. Haim, S. Saha, J. Schlessman, S. Bhattacharyya, and W. Wolf, Computer Vision on FPGAs: Design Methodology and its Application to Gesture


39

Recognition, In The First IEEE Workshop on Embedded Computer Vision, San Diego, USA, 2005

[Smit95] S. Smith and J. Brady, SUSAN – A new approach to low level image processing, In Technical Report, Oxford Centre for Functional Magnetic Resonance Image of the Brain (FMRIB), 1995.

[Sold99] J. Soldek and R. Mantiuk, A reconfigurable processor based on FPGAs for pattern recognition, processing, analysis and synthesis of images. In Pattern Recognition Letters, 20(7), pages 667-674, 1999.


40

Documents

W R C Workshop on Reconfigurable Computing · W R C Workshop on Reconfigurable Computing Workshop Proceedings January 28, 2007 Sofitel, Ghent, Belgium