Software Synthesis and Code Generation for Signal

SoftwareSynthesisand CodeGeneration

for SignalProcessingSystems

ShuvraS.Bhattacharyya

Universityof MarylandDepartmentof ElectricalandComputerEngineering

CollegePark,MD 20742,USA

RainerLeupers,PeterMarwedel

Universityof DortmundDepartmentof ComputerScience12

44221Dortmund,Germany

ABSTRACT

Therole of softwareis becomingincreasinglyimportantin the implementationof DSPapplications.As this trendin-

tensifies,andthe complexity of applicationsescalates,we areseeingan increasedneedfor automatedtools to aid in the

developmentof DSPsoftware. This paperreviews the stateof the art in programminglanguageandcompilertechnology

for DSP software implementation. In particular, we review techniquesfor high level, block-diagram-basedmodelingof

DSPapplications;thetranslationof block diagramspecificationsinto efficient C programsusingglobal,target-independent

optimizationtechniques;andthecompilationof C programsinto streamlinedmachinecodefor programmableDSPproces-

sors,usingarchitecture-specificandretargetableback-endoptimizations.In our review, we alsopoint out someimportant

directionsfor furtherinvestigation.

1 Intr oduction

Althoughdedicatedhardwarecanprovide significantspeedandpower consumptionadvantagesfor signalprocessingappli-

cations[1], extensive programmabilityis becominganincreasinglydesirablefeatureof implementationplatformsfor VLSI

signalprocessing.Thetrendtowardsprogrammableplatformsis fueledby tight time-to-marketwindows,whichin turnresult

from intensecompetitionamongDSPproductvendors,andfrom the rapid evolution of technology, which shrinksthe life

cycle of consumerproducts.As a resultof shorttime-to-market windows, designersareoften forcedto begin architecture

designandsystemimplementationbeforethe specificationof a productis fully completed.For example,a portablecom-

municationproductis often designedbeforethe signaltransmissionstandardsunderwhich it will operatearefinalized,or

beforethefull rangeof standardsthatwill besupportedby theproductis agreedupon.In suchanenvironment,latechanges

in thedesigncycle aremandatory. Theneedto quickly make suchlatechangesrequirestheuseof software. Furthermore,

whetheror not theproductspecificationis fixedbeforehand,software-basedimplementationsusingoff-the-shelfprocessors

takesignificantlylessverificationeffort comparedto customhardwaresolutions.

Althoughtheflexibility offeredbysoftwareis critical in DSPapplications,theimplementationof productionqualityDSP

softwareis anextremelycomplex task.Thecomplexity arisesfrom thediversityof critical constraintsthatmustbesatisfied;

typically theseconstraintsinvolve stringentrequirementson metricssuchaslatency, throughput,power consumption,code

size,anddatastoragerequirements.Additionalconstraintsincludetheneedto ensurekey implementationpropertiessuchasTechnicalreportUMIACS-TR-99-57,Institutefor AdvancedComputerStudies,Universityof Maryland,CollegePark,20742,September, 1999.S.S.

Bhattacharyyawassupportedin thiswork by theUSNationalScienceFoundation(CAREER,MIP9734275)andNorthropGrummanCorp.R. Leupersand

P. Marwedelweresupportedby HPEESof,California.

1

boundedmemoryrequirementsanddeadlock-freeoperation.As a result,unlike developersof softwarefor general-purpose

platforms,DSPsoftwaredevelopersroutinelyengagein meticuloustuningandsimulationof programcodeat theassembly

languagelevel.

Importantindustry-widetrendsat both the programminglanguagelevel andthe processorarchitecturelevel have had

a significantimpacton thecomplexity of DSPsoftwaredevelopment.At the architecturallevel, a specializedclassof mi-

croprocessorshasevolved that is streamlinedto the needsof DSP applications. TheseDSP-orientedprocessors,called

programmabledigital signalprocessors(PDSPs),employ avarietyof special-purposearchitecturalfeaturesthatsupportcom-

mon DSPoperationssuchasdigital filtering, andfastFourier transforms[2, 3, 4]. At the sametime, they often exclude

featuresof generalpurposeprocessors,suchasextensivememorymanagementsupport,thatarenot importantfor many DSP

applications.

Due to variousarchitecturalirregularitiesin PDSPs,which are requiredfor their exceptionalcost/performanceand

power/performancetrade-offs [2], compilertechniquesfor general-purposeprocessorshave provento beinadequatefor ex-

ploiting thepowerof PDSParchitecturesfrom high level languages[5]. As aresult,thecodequalityof high-level procedural

language(suchas C) compilersfor PDSPshasbeenseveral hundredsof percentworsethan manually-writtenassembly

languagecode [6, 52]. This situationhasnecessitatedthewidespreaduseof assembly-languagecoding,andtediousperfor-

mancetuning,in DSPsoftwaredevelopment.However, in recentyears,a significantresearchcommunityhasevolvedthatis

centeredaroundthedevelopmentof compilertechnologyfor PDSPs.Thiscommunityhasbegunto narrow thegapbetween

compiler-generatedcodeandmanuallyoptimizedcode.

It is expectedthatinnovativeprocessor-specificcompilationtechniquesfor PDSPswill provideasignificantproductivity

boostin DSPsoftwaredevelopment,sincesuchtechniqueswill us allow to take the stepfrom assemblyprogrammingof

PDSPsto the useof high-level programminglanguages.The key approachto reducethe overheadof compiler-generated

codeis the developmentof DSP-specificcompileroptimizationtechniques.While classicalcompiler technologyis often

basedontheassumptionof aregularprocessorarchitecture,DSP-specifictechniquesaredesignedto becapableof exploiting

the specialarchitecturalfeaturesof PDSPs. Theseinclude specialpurposeregistersin the datapath, dedicatedmemory

addressgenerationunits,anda moderatedegreeof instruction-level parallelism.

To illustratethis, considerthearchitectureof a popularfixed-pointDSP(TI TMS320C25)in fig. 1. Its datapathcom-

prisestheregistersTR, PR,andACCU,eachof which playsa specificrole in communicatingvaluesbetweenthefunctional

units of the processor. This structureallows for a very efficient implementationof DSP algorithms(e.g. filtering algo-

rithms).Moreregulararchitectures(e.g.with general-purposeregisters)would,for instance,requiremoreinstructionbits for

addressingtheregistersandmorepower for readingandwriting theregisterfile.

Froma compilerviewpoint, themappingof operations,programvariables,andintermediateresultsto thedatapathin

fig. 1 mustbedonein suchaway, thattheamountof datatransferinstructionsbetweentheregistersis minimized.Theaddress

generationunit (AGU) comprisesaspecialALU andis capableof performingaddressarithmeticin parallelto thecentraldata

path. In particular, it providesparallelauto-incrementinstructionsfor addressregisters.As we will show later, exploitation

of this featurein a compilerdemandsfor an appropriatememorylayout of programvariables.Besidesthe AGU, alsothe

datapathoffersa certaindegreeof instruction-level parallelism.For instance,loadinga memoryvalueinto registerTR and

accumulatingaproductstoredin PRcanbeperformedin parallelwithin asinglemachineinstruction.Sincesuchparallelism

cannotbeexplicitly describedin programminglanguageslikeC, compilersneedto carefullyschedulethegeneratedmachine

instructions,soasto exploit thepotentialparallelismandto generatefastanddensecode.

Furtherarchitecturalfeaturesfrequentlypresentin PDSPsincludeparallelmemorybanks(providing highermemory

accessbandwidth),chainedoperations(suchasmultiply-accumulate),specialarithmeticoperations(suchasadditionwith

saturation),andmoderegisters(for switchingbetweendifferentarithmeticmodes).

For mostof the architecturalfeaturesmentionedabove, dedicatedcodeoptimizationtechniqueshave beendeveloped

recently, anoverview of whichwill begivenin section3. Many of theseoptimizationsarecomputationallycomplex, resulting

2

in a comparatively low compilationspeed.This is intensifiedby the fact that compilersfor PDSPs,besidesthe needfor

specificoptimizationtechniques,have to dealwith the phasecouplingproblem. The compilationprocessis traditionally

divided into the phasesof codeselection,registerallocation,and instructionscheduling,which have to be executedin a

certainorder. For all possiblephaseorders,theapproachof separatecompilationphasesresultsin a codequality overhead,

sinceeachphasemay imposeobstructingconstraintson subsequentphases,which would not have beennecessaryfrom a

global viewpoint. While for regular processorarchitectureslike RISCsthis overheadis moderateandthustolerable,it is

typically much higher for irregular processorarchitecturesas found in PDSPs. Therefore,it is desirableto performthe

compilationphasesin a coupledfashion,wherethedifferentphasesmutuallyexchangeinformationsoasto achievea global

optimum.

Even thoughphase-coupledcompilertechniquesleadto a further increasein compilationtime, it is widely agreedin

theDSPsoftwaredevelopercommunitythathighcompilationspeedis of muchlower concernthanhighcodequality. Thus,

compilationtimesof minutesor evenhoursmaybeperfectlyacceptablein many cases.This factgivesgoodopportunities

for novel computation-intensive approachesto compiling high level languagesfor PDSPs,which however would not be

acceptablein general-purposecomputing.

Besidespurecodeoptimizationissues,the largevarietyof PDSPs(bothstandard”off-the-shelf” processorsandappli-

cationspecificprocessors)currentlyin usecreatea problemof economicfeasibility of compilerconstruction.Sincecode

optimizationtechniquesfor PDSPsarehighly architecture-specificby nature,a hugeamountof differentoptimizationtech-

niqueswererequiredto build efficientcompilersfor all PDSPsavailableon themarket. Therefore,in thispaperwewill also

briefly discusstechniquesfor retargetablecompilation. Retargetablecompilersarecapableof generatingcodenotonly for a

singletargetprocessorbut for a classof processors,therebyreducingthenumberof compilersrequired.This is achievedby

providing thecompilerwith a descriptionof themachinefor which codeis to begenerated,insteadof hard-codingthema-

chinedescriptionin thecompiler. We will mentiondifferentapproachesof processormodelingfor retargetablecompilation.

Retargetabilitypermitsto quickly generatecompilersfor new processors.If theprocessordescriptionformalismis flexible

enough,thenretargetablecompilersmayalsoassistin customizinganonly partially predefinedprocessorarchitecturefor a

givenapplication.

At the systemspecificationlevel, the pastseveral yearshave seenincreaseduseof block-diagrambased,graphical

programmingenvironmentsfor digital signalprocessing.Suchgraphicalprogrammingenvironments,which enableDSP

systemsto bespecifiedashierarchiesof block diagrams,offer several importantadvantages.Perhapsthe mostobviousof

theseadvantagesis their intuitiveappeal.Althoughvisualprogramminglanguageshaveseenlimited usein many application

domains,DSP systemdesignersare usedto thinking of systemsin termsof graphicalabstractions,suchas signal flow

diagrams,and thus, block diagramspecificationvia a graphicaluserinterfaceis a convenientand naturalprogramming

interfacefor DSPdesigntools.

An illustrationof ablockdiagramDSPsystem,developedusingthePtolemydesignenvironment[7], is shown in fig. 2.

This is animplementationof a discretewavelettransform[8] application.Thetop partof thefigureshows thehighestlevel

of theblock diagramspecificationhierarchy. Many of theblocksin thespecificationarehierarchical, which meansthatthe

internalfunctionalityof theblocksarealsospecifiedasblockdiagrams(“nested”blockdiagrams).Blocksat thelowestlevel

of thespecificationhierarchy, suchastheindividualFIR filters,arespecifiedin ameta-Clanguage(C augmentedwith special

constructsfor specifyingblockparametersandinterfaceinformation).

In additionto offering intuitive appeal,the specificationof systemsin termsof connectionsbetweenpre-defined,en-

capsulatedfunctionalblocksnaturallypromotesdesirablesoftwareengineeringpracticessuchasmodularityandcodereuse.

As thecomplexity of applicationsincreasescontinuallywhile time-to-marketpressuresremainintense,reuseof designeffort

acrossmultipleproductsis becomingmoreandmorecrucialto meetingdevelopmentschedules.

In additionto theirsyntacticandsoftwareengineeringappeal,thereareanumberof moretechnicaladvantagesof graph-

ical DSPtools. Theseadvantageshingeon theuseof appropriatemodelsof computationto provide thepreciseunderlying

3

block diagramsemantics.In particular, the useof dataflowmodelsof computationcanenablethe applicationof powerful

verificationandsynthesistechniques.Broadlyspeaking,dataflow modelinginvolvesrepresentinganapplicationasadirected

graphin whichthegraphverticesrepresentcomputationsandedgesrepresentlogicalcommunicationchannelsbetweencom-

putations.Dataflow-basedgraphicalspecificationformatsareusedwidely in commercialDSPdesigntoolssuchasCOSSAP

by Synopsys,theSignalProcessingWorksystemby Cadence,andtheAdvancedDesignSystemby Hewlett-Packard.These

threecommercialtools all employ the synchronousdataflowmodel [9], the most popularvariantof dataflow in existing

DSPdesigntools. Synchronousdataflow specificationallows boundedmemorydeterminationanddeadlockdetectionto be

performedcomprehensively andefficiently at compiletime. In contrast,both of theseverificationproblemsarein general

impossibleto solve (in finite time) for generalpurposeprogramminglanguagessuchasC.

Potentiallythe mostusefulbenefitof dataflow-basedgraphicalprogrammingenvironmentsfor DSPis that carefully-

specifiedgraphicalprogramscanexposecoarse-grainstructureof theunderlyingalgorithm,andthisstructurecanbeexploited

to improve thequality of synthesizedimplementationsin a wide varietyof ways.For example,theprocessof scheduling—

determiningtheorderin which thecomputationsin anapplicationwill execute— typically hasa largeimpacton all of the

key implementationmetricsof aDSPsystem.A dataflow-basedsystemspecificationexposeshigh-levelschedulingflexibility

thatis oftennotpossibleto deducemanuallyor automaticallyfrom anassemblylanguageor high-level procedurallanguage

specification.This schedulingflexibility canbeexploitedby a synthesistool to streamlinean implementationbasedon the

givensetof performanceandcostconstraints.We will elaborateondataflow-basedschedulingin sections2.1.2and2.2.

Althoughgraphicaldataflow-basedprogrammingtools for DSPhave becomeincreasinglypopularin recentyears,the

useof thesetoolsin industryis largely limited to simulationandprototyping.Thequalityof today’sgraphicalprogramming

tools is not sufficient to consistentlydeliver production-qualityimplementations.As with procedurallanguagecompilation

technologyfor PDSPs,synthesisfrom dataflow-basedgraphicalspecificationsofferssignificantpromisefor thefuture,andis

animportantchallengeconfrontingtheDSPdesignandimplementationresearchcommunitytoday. Furthermore,thesetwo

formsof compilertechnologyarefully complementaryto oneanother:themixtureof dataflow andC (or any otherprocedural

language),asdescribedin theexampleof fig. 2, is anespeciallyattractive specificationformat. In this format,coarse-grain

“subprogram”interactionsarespecifiedin dataflow, while thefunctionalityof individualsubprogramsis specifiedin C. Thus,

dataflow synthesistechniquesoptimizethefinal implementationat theinter-subprogramlevel, while C compilertechnology

is requiredto performfine-grainedoptimizationwithin subprograms.

This papermotivatesthe problemof compilertechnologydevelopmentfor DSPsoftwareimplementation,providesa

tutorial overview of modelingandoptimizationissuesthat areinvolved in the compilationof DSPsoftware,andprovides

a review of techniquesthat have beendevelopedby variousresearchersto addresssomeof theseissues.The first part of

our overview focuseson coarse-grainsoftwaremodelingandoptimizationissuespertinentto the compilationof graphical

dataflow programs,andthesecondpart focuseson fine-grainedissuesthatarisein thecompilationof high level procedural

languagessuchasC.

Thesetwo levelsof compilertechnology(coarse-grainandfine grain)arecommonlyreferredto assoftware synthesis

andcodegeneration, respectively. Morespecifically, by softwaresynthesis,wemeantheautomatedderivationof a software

implementation(applicationprogram)in someprogramminglanguagegiven a library of subprogrammodules,a subset

of selectedmodulesfrom this library, anda specificationof how theseselectedmodulesinteractto implementthe target

application. Fig. 2 is an exampleof a programspecificationthat is suitablefor software synthesis. Here, synchronous

dataflow semanticsareusedto specifysubprograminteractions.In section2.2,weexploresoftwaresynthesisissuesfor DSP.

On theotherhand,codegenerationrefersto themappingof asoftwareimplementationin someprogramminglanguage

to anequivalentmachineprogramfor aspecificprogrammableprocessor. Thus,themappingof aC programonto thespecific

resourcesof thedatapathin fig. 1 is anexampleof codegeneration.We exploreDSPcodegenerationtechnologyin section

3.

4

2 Compilation of dataflow programsto application programs

2.1 Dataflow modelingof DSPsystems

To performsimulation,formal verification,or any kind of compilationfrom block-diagramDSPspecifications,a preciseset

of semanticsis neededthatdefinestheinteractionsbetweendifferentcomputationalblocksin aspecification.Dataflow-based

computationalmodelshave provento provideblock-diagramsemanticsthatarebothintuitive to DSPsystemdesigners,and

efficient from thepointof view of verificationandsynthesis.

In thedataflow paradigm,a computationalspecificationis representedasa directedgraph.Verticesin thegraph(called

actors) correspondto the computationalmodulesin the specification. In mostdataflow-basedDSPdesignenvironments,

actorscanbeof arbitrarycomplexity. Typically, they rangefrom elementaryoperationssuchasadditionor multiplicationto

DSPsubsystemssuchasFFTunitsor adaptivefilters.

An edge

in a dataflow graphrepresentsthecommunicationof datafrom

to

. More specifically, anedge

representsa FIFO(first-in-first-out)queuethatbuffersdatasamples(tokens)asthey passfrom theoutputof oneactorto the

input of another. If is a dataflow edge,we write

, and . Whendataflow graphsare

usedto representsignalprocessingapplications,a dataflow edge hasa non-negative integerdelay associatedwith

it. Thedelayof anedgegivesthenumberof initial datavaluesthatarequeuedon theedge.Eachunit of dataflow delayis

functionallyequivalentto the

operator:thesequenceof datavalues "!$# generatedat the input of theactor is

equalto thetheshiftedsequence% ! '&(*),+.-0/ # , where %1!1# is thedatasequencegeneratedat theoutputof theactor .2.1.1 Consistency

Underthedataflow model,anactorcanexecuteatany timethatit hassufficientdataonall inputedges.An attemptto execute

anactorwhenthis constraintis not satisfiedis saidto causebuffer underflowon all edgesthatdo not containsufficientdata.

For dataflow modelingto beusefulfor DSPsystems,theexecutionof actorsmustalsoaccommodateinput datasequences

of unboundedlength. This is becauseDSPapplicationsoften involve operationsthat areappliedrepeatedlyto samplesin

indefinitelylong inputsignals.For animplementationof adataflow specificationto bepractical,theexecutionof actorsmust

besuchthatthenumberof tokensqueuedoneachFIFObuffer (dataflow edge)mustremainboundedthroughouttheexecution

of thedataflow graph.In otherwords,thereshouldnotbeunboundeddataaccumulationonany edgein thedataflow graph.

In summary, executinga dataflow specificationof a DSPsysteminvolvestwo fundamental,processor-independentre-

quirements— avoiding buffer underflow andavoiding unboundeddataaccumulation(buffering). The dataflow modelim-

posesno furtherconstraintson thesequencein which computations(actors)areexecuted.On theotherhand,in procedural

languages,suchasC andFORTRAN, theorderingof statementsaswell astheuseof control-flow constructsimply sequenc-

ing constraintsbeyondthosethatarerequiredto satisfydatadependencies.By avoiding theoverspecificationof execution

ordering,dataflow specificationsprovide synthesistoolswith full flexibility to streamlinetheexecutionorderto matchthe

relevant implementationconstraintsandoptimizationobjectives. This featureof dataflow is of critical importancefor DSP

implementationsince,as we will seethroughoutthe restof this section,the executionorderhasa large impacton most

importantimplementationmetrics,suchasperformance,memoryrequirements,andpowerconsumption.

Theterm“consistency” refersto thetwo essentialrequirementsof DSPdataflow specifications— theabsenceof over-

flow and unboundeddataaccumulation.We say that a consistentdataflow specificationis one that can be implemented

withoutany chanceof buffer underflow or unboundeddataaccumulation(regardlessof theinputsequencesthatareappliedto

thesystem).If thereexist oneor moresetsof inputsequencesfor whichunderflow andunboundedbufferingareavoided,and

therealsoexist oneor moresetsfor which underflow or unboundedbuffering results,we saythata specificationis partially

consistent. A dataflow specificationthat is neitherconsistentnor partially consistentis calledan inconsistentspecification.

Moreelaborateformsof consistency basedona probabilisticinterpretationof tokenflow areexploredin [10].

5

Clearly, consistency is a highly desirablepropertyfor DSP software implementation. For most consistentdataflow

graphs,tight boundscanbe derived on the numbersof datavaluesthat coexist (datathat hasbeenproducedbut not yet

consumed)on theindividualedges(buffers).For suchgraphs,all buffer memoryallocationcanbeperformedstatically, and

thus, the overheadof dynamicmemoryallocationcanbe avoidedentirely. This is a valuablefeaturewhenattemptingto

deriveastreamlinedsoftwareimplementation.

2.1.2 Scheduling

A fundamentaltaskin synthesizingsoftwarefrom anSDFspecificationis thatof scheduling, which refersto theprocessof

determiningtheorderin which theactorswill beexecuted.Schedulingis eitherdynamicor static. In staticscheduling, the

actorexecutionorderis specifiedat synthesistime,andis fixed– in particular, theorderis not data-dependent.To beuseful

in handlingindefinitely long input datasequences,a staticschedulemustbe periodic. A periodic,staticschedulecanbe

implementedin a finite amountof programmemoryspaceby encapsulatingtheprogramcodefor oneperiodof theschedule

within aninfinite loop. Indeed,this is how suchschedulesaremostoftenimplementedin practice.

In dynamicscheduling, the sequenceof actor executions(schedule) is not specifiedduring synthesis,and run-time

decision-makingis requiredto ensurethat actorsareexecutedonly whentheir respective input edgeshave sufficient data.

Disadvantagesof dynamicschedulingincludetheoverhead(executiontimeandpowerconsumption)of performingschedul-

ing decisionsat run-time,anddecreasedpredictability, especiallyin determiningwhetheror not any relevantreal-timecon-

straintswill besatisfied.However, if thedataproduction/consumptionbehavior of individualactorsexhibitssignificantdata-

dependence,thendynamicschedulingmayberequiredto avoid buffer underflow andunboundeddataaccumulation.Further-

more,if theperformancecharacteristicsof actorsareimpossibleto estimateaccurately, theneffective dynamicscheduling

leadsto betterperformanceby adaptively streamliningthe scheduleevolution to matchthe dynamiccharacteristicsof the

actors.

For mostDSPapplications,includingthevastmajorityof applicationsthatareamenableto theSDFmodelmentionedin

section1,actorbehavior is highlypredictable.For suchapplications,giventhetight costandpowerconstraintsthataretypical

of embeddedDSPapplications,it is highly desirableto avoid dynamicschedulingoverheadasmuchaspossible.Theultimate

goalundersucha high level of predictabilityis a (periodic)staticschedule.If it is notpossibleto constructastaticschedule,

thenit is desirableto identify “maximal” subsystemsthat canbe scheduledstatically, andusea small amountof dynamic

decision-makingto coordinatetheexecutionof thesestatically-scheduledsubsystems.Schedulesthatareconstructedusing

suchahybrid,mostlystaticapproacharecalledquasi-staticschedules.

2.1.3 Synchronousdataflow

A dataflowcomputationmodelcanbe viewed asa subclassof dataflow graphspecifications.A wide variety of dataflow

computationalmodelscanbeconceiveddependingon restrictionsthatareimposedon themannerin which dataflow actors

consumeandproducedata.For example,synchronousdataflow(SDF), which is thesimplestandcurrentlythemostpopular

form of dataflow for DSP, imposestherestrictionthatthenumberof datavaluesproducedby anactorontoeachoutputedge

is constant,andsimilarly thenumberof datavaluesconsumedby anactorfrom eachinput edgeis constant.Thus,anSDF

edge hastwo additionalattributes— the numberof datavaluesproducedonto by eachinvocationof the sourceactor,

denoted213 , andthenumberof datavaluesconsumedfrom by eachinvocationof thesink actor, denoted4 .Theexampleshown in fig. 2 conformsto theSDFmodel.An SDFabstractionof a scaled-down andsimplifiedversion

of thissystemis shown in fig. 3. Hereeachedgeis annotatedwith thenumberof datavaluesproducedandconsumedby the

sourceandsinkactors,respectively. For example,21 657389 ;: , and 4 657389 =< .Therestrictionsimposedby theSDFmodeloffer a numberof importantadvantages.

> Simplicity. Intuitively, whencomparedto moregeneraltypesof dataflow actors,actorsthat produceandconsume

6

datain constant-sizedpacketsareeasierto understand,develop,interfaceto otheractors,andmaintain.This property

is difficult to quantify; however, the rapid andextensive adoptionof SDF in DSPdesigntools clearly indicatesthat

designerscaneasilylearnto think of functionalspecificationsin termsof theSDFmodel.

> Staticschedulingandmemoryallocation.For SDFgraphs,thereis no needto resortto dynamicscheduling,or even

quasi-staticscheduling. For a consistentSDF graph,underflow and unboundeddataaccumulationcan always be

avoidedwith a periodic,staticschedule.Moreover, tight boundson buffer occupancy canbecomputedefficiently. By

avoiding the run-timeoverheadsassociatedwith dynamicschedulinganddynamicmemoryallocation,efficient SDF

graphimplementationsoffer significantadvantageswhencost,power, or performanceconstraintsaresevere.

> Consistency verification. A dataflow modelof computationis a decidabledataflow modelif it canbedeterminedin

finite timewhetheror notanarbitraryspecificationin themodelis consistent.Wesaythatadataflow modelis abinary-

consistencymodelif every specificationin themodelis eitherconsistentor inconsistent.In otherwords,a modelis a

binary-consistency modelif it containsnopartiallyconsistentspecifications.All of thedecidabledataflow modelsthat

areusedin practicetodayarebinary-consistency models.

Binary consistency is convenientfrom a verificationpoint of view sinceconsistency becomesan inherentproperty

of a specification:whetheror not buffer underflow or unboundeddataaccumulationarisesis not dependenton the

input sequencesthat areapplied. Of course,suchconveniencecomesat the expenseof restrictedapplicability. A

binary-consistency modelcannotbeusedto specifyall applications.

TheSDFmodelis a binary-consistency model,andefficient verificationtechniquesexist for determiningwhetheror

notanSDFgraphis consistent.AlthoughSDFhaslimited expressivepowerin exchangefor thisverificationefficiency,

themodelhasprovento beof greatpracticalvalue.SDFencompassesabroadandimportantclassof signalprocessing

anddigital communicationsapplications,includingmodems,multiratefilter banks[8], andsatellitereceiver systems,

just to namea few [9, 11, 12].

For SDF graphs,the mechanicsof consistency verificationarecloselyrelatedto the mechanicsof scheduling.The

interrelatedproblemsof verifying andschedulingSDFgraphsarediscussedin detailbelow.

2.1.4 Static schedulingof SDFgraphs

Thefirst stepin constructinga staticschedulefor anSDFgraph ?@ BAC3DEis determiningthenumberof times F 6GH that

eachactorGJI;A

shouldbe invoked in oneperiodof the schedule.To ensurethat the scheduleperiodcanbe repeated

indefinitelywithoutunboundeddataaccumulation,theconstraint

F 21 KF LM 4 NO"P'Q4RQ4P3STQUMVQ ID (1)

mustbesatisfied.Thesystemof equations(1) is calledthesetof balanceequationsfor ? .

Clearly, a usefulperiodicschedulecanbeconstructedonly if thebalanceequationshave a positive integersolution FW( FW XGY[Z]\ for all

G^I_A). LeeandMesserschmitthaveshown thatfor a generalSDFgraph? , exactlyoneof thefollowing

conditionsholds[9]:

> Thezerovectoris theonly solutionto thebalanceequations,or

> Thereexistsa minimalpositive integersolution ` to thebalanceequations,andthusevery positive integersolution FbasatisfiesF*a 6GH[c ` 6GY for all

G. Thisminimalvector ` is calledtherepetitionsvectorof ? .

7

If theformerconditionholds,then ? is inconsistent.Otherwise,a boundedbuffer periodicschedulecanbeconstructed

providedthat it is possibleto constructa sequenceof actorexecutionssuchthatbuffer underflow is avoided,andeachactorGis executedexactly ` 6GY times. Given a consistentSDF graph,we refer to an executionsequencethat satisfiesthese

two propertiesasa valid scheduleperiod, or simply a valid schedule. Clearly, a boundedmemorystaticschedulecanbe

implementedin softwareby encapsulatingtheimplementationof any valid schedulewithin aninfinite loop.

A linear-time ( d efAgefhiejDke ) algorithmto determinewhetheror not a repetitionsvectorexists,andto computea

repetitionsvectorwheneveronedoesexist canbefoundin [11].

For example,considertheSDFgraphshown in fig. 3. Therepetitionsvectorcomponentsfor thisgrapharegivenby

` 6GH =` 65l K` 6mn K` XoE p` X89 q` Xr7 q` XDE q` 6st q` Xu^ q` 6vt K` d <

` 6wn q` ? =` 6x q` byz =` 6| q` 6~ : (2)

If arepetitionsvectorexistsfor anSDFgraph,but avalid scheduledoesnotexist, thenthegraphis saidto bedeadlocked.

Thus,anSDFgraphis consistentif andonly if arepetitionsvectorexists,andthegraphis notdeadlocked.In general,whether

or notagraphis deadlockeddependsontheedgedelays e ID # aswell theproductionandconsumptionparameters

# and "M # . An exampleof a deadlockedSDFgraphis givenin fig. 4. An annotationof theform D next to an

edgein thefigurerepresentsadelayof units.Notethattherepetitionsvectorfor thisgraphis givenby

` 6GY K ` 65 q< ` X89 :" (3)

Oncea repetitionsvector ` hasbeencomputed,deadlockdetectionand the constructionof a valid schedulecanbe

performedconcurrently. Prematureterminationof the schedulingprocedure— terminationbeforeeachactorG

hasbeen

fully scheduled(scheduledXGY

times)— indicatesdeadlock. Onesimpleapproachis to scheduleactor invocationsone

at a time andsimulatethebuffer activity in the dataflow graphaccordinglyuntil all actorsarefully scheduled.The buffer

simulationis necessaryto ensurethat buffer overflow is avoided. A pseudocodespecificationof this simpleapproachcan

be found in [11]. Lee andMesserschmittshow that this approachterminatesprematurelyif andonly if the input graphis

deadlocked,andotherwise,regardlessof the specificorderin which actorsareselectedfor scheduling,a valid scheduleis

alwaysconstructed[13].

In summary, SDF is currentlythemostwidely-useddataflow modelin commercialandresearch-orientedDSPdesign

tools. Commercialtools that employ SDF semanticsinclude Simulink by The Math Works, SPW by Cadence,and HP

Ptolemyby Hewlett Packard.SDF-basedresearchtools includeGabriel[14] andseveralkey domainsin Ptolemy[7], from

from U.C. Berkeley; andASSIGNfrom Carnegie Mellon [15]. TheSDF modeloffersefficient verificationof consistency

for arbitraryspecifications,andefficient constructionof staticschedulesfor all consistentspecifications.Our discussion

above outlineda simple,systematictechniquefor constructinga staticschedulewhenever oneexists. In practice,however,

it is preferableto employ moreintricateschedulingstrategiesthat take carefulaccountof thecosts(performance,memory

consumption,etc.) of thegeneratedschedules.In section2.2,we will discusstechniquesfor streamlinedschedulingof SDF

graphsbasedon theconstraintsandoptimizationobjectivesof thetargetedimplementation.In theremainderof this section,

wediscussa numberof usefulextensionsto theSDFmodel.

2.1.5 Cyclo-staticdataflow

Cyclo-staticdataflow (CSDF)andscalablesynchronousdataflow (describedin section2.1.6)arepresentlythemostwidely-

usedextensionsof SDF. In CSDF, thenumberof tokensproducedandconsumedby anactoris allowedto vary aslong the

8

variationtakes the form of a fixed, periodicpattern[16, 17]. More precisely, eachactorA in a CSDFgraphhasassoci-

atedwith it a fundamentalperiod XGYI : < 4# , which specifiesthe numberof phasesin oneminimal periodof the

cyclic production/consumptionpatternofG

. For eachinput edge toG

, the scalarSDF attribute is replacedby a

XGY -tuple8-3 38

-3 4 38 -3 "+/ , whereeach

8-3 is a nonnegativeintegerthatgivesthenumberof datavaluesconsumed

from byG

in the F th phaseof eachperiodofG

. Similarly, for eachoutputedge , 21 is replacedby a 6GH -tuplem-3m

-3 4 m -3 "+/ , whichgivesthenumbersof datavaluesproducedin successivephasesof

G.

A simpleexampleof a CSDFactor is illustratedin fig. 5(a). This actor is a conventionaldownsampleractor (with

downsamplingfactor3) from multiratesignalprocessing.Functionally, adownsampler, performsthefunction FB1q% v F: 1h : , wherefor ^: < 4 , and %' denotethe datavaluesproducedandconsumed,respectively. Thus,for every

inputvaluethatis copiedto theoutput,v : input valuesarediscarded.As shown in fig. 5(b) for | , this functionality

canbespecifiedby a CSDFactorthathasv

phases.A datavalueis consumedon the input for allv

phases,resultingin

thev

-componentconsumptiontuple : : 4 : ; however, a datavalueis producedonto theoutputedgeonly on thefirst

phase,resultingin theproductiontuple : \\ 4 \ .

Like SDF, CSDFis a binaryconsistency model,andit is possibleto performefficient verificationof boundedmemory

requirementsandbuffer underflow avoidancefor CSDFgraphs[17]. Furthermore,staticschedulescanalwaysbeconstructed

for consistentCSDFgraphs.

A CSDFactorG

caneasilybeconvertedinto anSDFactorG a suchthat if identicalsequencesof input datavaluesare

appliedtoG

andG a , thenidenticaloutputdatasequencesresult.Sucha functionallyequivalentSDFactor

G a canbederived

by having eachinvocationofG a implementonefundamentalCSDFperiodof

G(thatis, 6GH successivephasesof

G). Thus,

for eachinputedge a ofG a , theSDFparametersof a aregivenby

> "4 a ,> 21 a +./

m -3 , and

> a "+/. 8 -3 ,

where is the correspondinginput edgeto the CSDF actorG

. Applying this conversionto the downsamplerexample

discussedabove givesan“SDF equivalent” downsamplerthatconsumesa block ofv

input datavalueson eachinvocation,

andproducesa singledatavalue,which is a copy of the first valuein the input block. The SDFequivalentfor fig. 5(a) is

illustratedin fig. 5(b).

Sinceany CSDFactorcanbe convertedto a functionally equivalentSDF actor, it follows that CSDFdoesnot offer

increasedexpressivepowerat thelevel of individualactorfunctionality(input-outputmappings).However, theCSDFmodel

canoffer increasedflexibility in compactlyandefficiently representinginteractionsbetweenactors.

As anexampleof increasedflexibility in expressingactorinteractions,considertheCSDFspecificationillustratedin fig.

6. Thisspecificationrepresentsa recursivedigital filter computationof theform

! !

h % ! h %:" (4)

In fig. 6, thetwo-phaseCSDFactorlabeledG

representsa scaling(multiplication)by theconstantfactor . In eachof

its two phases,actorG

consumesa datavaluefrom oneof its input edges,multipliesthedatavalueby , andproducesthe

resultingvalueontooneof its outputedges.TheCSDFspecificationof fig. 6 thusexploits our ability to compute(4) using

theequivalentformulation

"! M "! h %$! zh %$: (5)

which requiresonly additionblocksand -scalingblocks. Furthermore,the two -scalingoperationscontainedin ( 5) are

consolidatedinto a singleCSDFactor(actorG

).

9

Suchconsolidationof distinct operationsfrom differentdatastreamsoffers two advantages.First, it leadsto more

compactrepresentationssincefewer verticesarerequiredin the CSDFgraph. For large or complex applications,this can

result in more intuitive representations,andcanreducethe time requiredto performvariousanalysisandsynthesistasks.

Second,it allowsaprecisemodelingof resourcesharingdecisions— pre-specifiedbindingsof multipleoperationsin aDSP

applicationonto individual hardwareresources(suchasfunctionalunits) or softwareresources(suchassubprograms)—

within theframework of dataflow. Suchpre-specifiedbindingsmayarisefrom constraintsimposedby thedesigner, andfrom

decisionstakenduringsynthesisor designspaceexploration.

Theability to compactlyandpreciselymodelthesharingof actorsin CSDFstemsfrom theability to selectively “turn

off ” datadependenciesfrom arbitrarysubsetsof inputedgesin any givenphaseof anactor. In contrast,anSDFactorrequires

at leastonedatavalueon eachinput edgebeforeit canbeinvoked. In thepresenceof feedbackloops,this requirementmay

precludea sharedrepresentationof anactorin SDF, even thoughit maybepossibleto achieve thedesiredsharingusinga

functionallyequivalentCSDFactor. This is illustratedin fig. 7, which is derivedfrom the CSDFspecificationof fig. 6 by

replacingthe “shared”CSDFactorwith its functionally equivalentSDF counterpart.Sincethe graphof fig. 7 containsa

delay-freecycle, clearly we canconcludethat the graphis deadlocked,andthusa valid scheduledoesnot exist. In other

words,this is an inconsistentdataflow specification.In contrast,it is easilyverifiedthat thescheduleG w9r5EG 8TD ? is a

valid schedulefor theCSDFspecificationof fig. 6, whereGH

andG

denotethefirst andsecondphasesof theCSDFactorG, respectively.

Similarly, anSDFmodelof a hierarchical actor mayintroducedeadlockin a systemspecification,andsuchdeadlock

canoftenbeavoidedby replacingthehierarchicalSDFactorwith a functionallyequivalenthierarchicalCSDFactor. Here,

by ahierarchicalSDFactorwemeananactorwhoseinternalfunctionalityis specifiedby anSDFgraph.Theutility of CSDF

in constructinghierarchicalspecificationsis illustratedin fig. 8.

CSDFalsooffersdecreasedbuffering requirementsfor someapplications.An illustrationis shown in fig. 9. Fig. 9(a)

depictsasystemin whichv

-elementblocksof dataarealternatelydistributedfrom thedatasourceto two processingmodulesu and

u . Theactorthatperformsthedistribution is modeledasa two-phaseCSDFactorthat inputsan

v-elementdata

block on eachphase,sendstheinput block tou

in thefirst phase,andsendstheinput block tou

in thesecondphase.It

is easilyseenthattheCSDFspecificationof fig. 9(a)canbeimplementedwith a buffer of sizev

oneachof thethreeedges.

Thus,thetotalbufferingrequirementis v for thisspecification.

If we replacetheCSDF“block-distributor” actorwith its functionallyequivalentSDFcounterpart,thenwe obtainthe

pureSDFspecificationdepictedin fig. 9(b). TheSDFversionof thedistributormustprocesstwo blocksata timeto conform

to SDFsemantics.As aresult,theedgethatconnectsthedatasourceto thedistributor requiresabuffer of size < v . Thus,the

totalbufferingrequirementof theSDFgraphof fig. 9(b) is p v , which is 33%greaterthantheCSDFversionof fig. 9(a).

Yet anotheradvantageofferedby CSDFis that by decomposingactorsinto a finer level (phase-level) of specification

granularity, basicbehavioral optimizationssuchasconstantpropagationanddeadcodeelimination[18, 54] arefacilitated

significantly[19]. As a simpleexampleof deadcodeeliminationwith CSDF, considertheCSDFspecificationshown in fig.

10(a)of a multirateFIR filtering systemthat is expressedin termsof basicmultiratebuilding blocks. Fromthis graph,the

equivalentexpandedhomogeneousSDFgraph, shown in fig. 10(b),canbederivedusingconceptsdiscussedin [9, 17]. In the

expandedgraph,eachactorcorrespondsto a singlephaseof a CSDFactoror a singleinvocationof anSDFactorwithin a

singleperiodof a periodicschedule.Fromfig. 10(b)it is apparentthattheresultsof somecomputations(SDFinvocationsor

CSDFphases)areneverneededin theproductionof any of thesystemoutputs.Suchcomputationscorrespondto deadcode

andcanbeeliminatedduringsynthesiswithout compromisingcorrectness.For this example,thecompletesetof subgraphs

thatcorrespondto deadcodeis illustratedin fig. 10(b).Parks,Pino,andLeeshow thatsuch“deadsubgraphs”canbedetected

with a straightforwardalgorithm[19].

In summary, CSDFis a usefulgeneralizationof SDFthatmaintainsthepropertiesof binaryconsistency, efficient veri-

fication,andstaticschedulingwhile offeringa morerich rangeof inter-actorcommunicationpatterns,improvedsupportfor

10

hierarchicalspecifications,moreeconomicaldatabuffering,andimprovedsupportfor basicbehavioral optimizations.CSDF

conceptsareusedin a numberof commercialdesigntoolssuchasDSPCanvasby AngelesDesignSystems,andVirtuoso

Synchro by EonicSystems.

2.1.6 Scalablesynchronousdataflow

The scalablesynchronousdataflow (SSDF)model is an extensionof SDF that enablessoftware synthesisof vectorized

implementations,which exploit thefacility for efficient block processingin many DSPapplications[20]. Theinternal(host

language)specificationof anSSDFactorG

assumesthattheactorwill beexecutedin groupsofvE XGY

successiveinvocations,

whichoperateon(vl XGY 4 )-unit blocksof dataatatimefrom eachinputedge . Suchblockprocessingreducestherate

of inter-actorcontext switching,andcontext switchingbetweensuccessive codesegmentswithin complex actors,andit also

may improve executionefficiency significantlyon deeplypipelinedarchitectures.Thevectorizationparameterv

of each

SSDFactoris selectedcarefullyduringsynthesis.Thisselectionshouldbebasedonconstraintsimposedby theSSDFgraph

structure;the memoryconstraintsandperformancerequirementsof the target application;andon the following extended

versionof theSDFbalanceequation(1) constraints

vl ` 21 vl ` for everyedge in theSSDFgraph

(6)

where ` is therepetitionsvectorof theSDFgraphthatresultswhenthevectorizationparameterof eachactoris setto unity.

Sincetheutility of SSDFis closelytied to optimizedsynthesistechniques,we deferdetaileddiscussionof SSDFto section

2.2.4,which focuseson throughput-orientedoptimizationissuesfor softwaresynthesis.

SSDFis a key specificationmodelin thepopularCOSSAPdesigntool thatwasoriginally developedby Cadisandthe

AachenUniversityof Technology[21], andis now developedby Synopsys.

2.1.7 Other dataflow models

TheSDF, CSDF, andSSDFmodelsdiscussedaboveareall usedin widely-distributedDSPdesigntools. A numberof more

experimentalDSPdataflow modelshavealsobeenproposedin recentyears.Althoughthesemodelsall offeradditionalinsight

on dataflow modelingfor DSP, further researchanddevelopmentis requiredbeforethe practicalutility of thesemodelsis

clearlyunderstood.In theremainderof thissection,webriefly review someof theseexperimentalmodels.

Themultidimensionalsynchronousdataflow model(MDSDF),proposedbyLee[22], andexploredfurtherbyMurthy [23],

extendsSDFconceptsto applicationsthatoperateonmultidimensionalsignals,suchasthosearisingin imageandvideopro-

cessing.In MDSDF, eachactorproducesandconsumesdatain unitsof -dimensionalcubes,where canbearbitrary, and

candiffer from actorto actor. The“synchrony” requirementin MDSDFconstrainseachproductionandconsumption -cube

to beof fixedsize ¡ n¢ ¡ l¢ 44 ¢ ¡ ! , whereeach¡ is a constant.For example,an imageprocessingactorthatexpandsa£ :¤< ¢ £ :¤< –pixel imagesegmentinto a : \ <p ¢ : \ <¥p segmentwouldhave theMDSDFrepresentationillustratedin fig. 11.

We saythat a dataflow computationmodelis statically schedulableif a staticschedulecanalwaysbeconstructedfor

a consistentspecificationin the model. For SDF, CSDF, and MDSDF, binary consistency and static schedulabilityboth

hold. Thewell-behaveddataflow (WBDF) model[24], proposedby Gao,Govindarajan,andPanangaden,is anexampleof

a binary-consistency model that is not staticallyschedulable.The WBDF modelpermitsthe useof a limited setof data-

dependentcontrol-flow constructs,andthusrequiresdynamicscheduling,in general.However theuseof theseconstructsis

restrictedin suchaway thatthattheinter-relatedpropertiesof binary-consistency andefficientboundedmemoryverification

arepreserved,andtheconstructionof efficientquasi-staticschedulesis facilitated.

11

Thebooleandataflow (BDF) model[25] is anexampleof a DSPdataflow modelfor which binaryconsistency doesnot

hold. BDF introducesthe conceptof control inputs, which areactorinputsthataffect thenumberof tokensproducedand

consumedat otherinput/outputports. In BDF, thevaluesof control inputsarerestrictedto theset ¦ 3w # . Thenumberof

tokensconsumedby anactorfrom anon-controlinputedge,or producedontoanoutputedgeis restrictedto beconstant,asin

SDF, or a functionof oneor moredatavaluesconsumedatcontrolinputs.BDF attainsgreatlyincreasedexpressivepowerby

allowingdata-dependentproductionandconsumptionrates.In exchange,someof theintuitivesimplicityandappealof SDFis

lost;staticschedulingcannotalwaysbeemployed;andtheproblemsof boundedmemoryverificationanddeadlockdetection

becomeundecidable[26], which meansthat in general,they cannotbesolvedin finite time. However, heuristicshave been

developedfor constructingefficient quasi-staticschedules,andattemptingto verify boundedmemoryrequirements.These

heuristicshavebeenshown to work well in practice[26]. A naturalextensionof BDF, calledinteger-controlleddataflow, that

allowscontroltokensto take onarbitraryintegervalueshasbeenexploredin [27].

2.2 Optimized synthesisof DSPsoftware fr om dataflow specifications

In section2.1, we reviewed several dataflow modelsfor high-level, block diagramspecificationof DSPsystems.Among

thesemodels,SDF andthe closelyrelatedSSDFmodelarethe mostmature. In this this sectionwe examinefundamental

trade-offsandalgorithmsinvolvedin thesynthesisof DSPsoftwarefrom SDFandSSDFgraphs.Exceptfor thevectorization

approachesdiscussedin section2.2.4,thetechniquesdiscussedin thissectionapplyequallywell to bothSDFandSSDF. For

clarity, wepresentthesetechniquesuniformly in thecontext of SDF.

2.2.1 Thr eadedimplementation of dataflow graphs

A softwaresynthesistools generatesapplicationprogramsby piecingtogethercodemodulesfrom a predefinedlibrary of

softwarebuilding blocks. Thesecodemodulesaredefinedin termsof thetarget languageof thesynthesistool. Most SDF-

baseddesignsystemsusea modelof synthesiscalledthreading. GivenanSDFrepresentationof a block-diagramprogram

specification,a threadedsynthesistool beginsby constructinga periodicschedule.Thesynthesistool thenstepsthroughthe

scheduleandfor eachactorinstanceG

that it encounters,it insertstheassociatedcodemoduleGY§

from thegiven library

(inline threading), or insertsa call to a subroutinethat invokesG§

(subprogramthreading). Threadedtools may employ

purely inline threading,purely subroutinethreading,or a mixture of inline and subprogram-basedinstantiationof actor

functionality (hybrid threading). Thesequenceof codemodules/ subroutinecalls that is generatedfrom a dataflow graph

is processedby a buffer managementphasethat insertsthenecessarytargetprogramstatementsto routedataappropriately

betweenactors.

2.2.2 Schedulingtradeoffs

In this section,we provide a glimpseat the complex rangeof trade-offs that areinvolved during the schedulingphaseof

the synthesisprocess.At present,we consideronly inline threading.Subprogramandhybrid threadingareconsideredin

section2.2.5. Synthesistechniquesthatpertainto SSDF, which arediscussedin section2.2.4,canbeappliedwith similar

effectivenessto inline, subprogramor hybrid threading.

Schedulingis a critical taskin the synthesisprocess.In a softwareimplementation,schedulinghasa large impacton

key metricssuchasprogramanddatamemoryrequirements,performance,andpower consumption.Evenfor a simpleSDF

graph,the underlyingrangeof trade-offs may be very complex. For example,considerthe SDF graphin fig. 12(a). The

repetitionsvectorcomponentsfor thisgraphare ` 6¨| : ` 6© q` *ª ;: \ . Onepossibleschedulefor thisgraphis given

by

« i¬z¬fY¬f¬z¬f1®_¬f¬z¬fY¬f¬z[ (7)

12

This scheduleexploits the additionalschedulingflexibility offeredby the delaysplacedon edge¨_3©l

. Recall that

eachdelayresultsin aninitial datavalueon theassociatededge.Thus,in fig. 12 , fiveexecutionsof©

canoccurbefore is

invoked,which leadsto a reductionin theamountof memoryrequiredfor databuffering.

To discusssuchreductionsin buffering requirementsprecisely, we needa few definitions.Givena schedule,thebuffer

sizeof anSDFedgeis themaximumnumberof live tokens(tokensthatareproducedbut not yet consumed)thatcoexist on

theedgethroughoutexecutionof theschedule.Thebuffer requirementof a schedule«

, denoted4°± « , is thesumof the

buffer sizesof all of theedgesin thegivenSDFgraph.For example,it is easilyverifiedthat ¯4°± « ;:": .Thequantity ¯4°± « is thenumberof memorylocationsrequiredto implementthedataflow buffers in the input SDF

graphassumingthat eachbuffer is mappedto a separatesegmentof memory. This is a naturalandconvenientmodelof

buffer implementation.It is usedin SDFdesigntoolssuchasCadence’sSPWandtheSDF-relatedcodegenerationdomains

of Ptolemy, Furthermore,schedulingtechniquesthat employ this buffering modeldo not precludethe sharingof memory

locationsacrossmultiple,non-interferingedges(edgeswhoselifetimesdo not overlap):theresultingschedulescanbepost-

processedby any generaltechniquefor arraymemoryallocation,suchasthewell-knownfirst-fit or best-fitalgorithms.In this

case,theschedulingtechniques,which attemptto minimizethesumof theindividual buffer sizes,employ a buffer memory

metricthatis anupperboundapproximationto thefinal buffer memorycost.

Oneproblemwith theschedule«

undertheassumedinline threadingmodelis thatit consumesarelatively largeamount

of programmemory. If ² XGY denotesthecodesize(numberof programmemorywordsrequired)for anactorG

, thenthe

codesizecostof«

canbeexpressedas ² 6¨|zh : \ ² X©nzh : \ ² BªY .By exploiting therepetitivesubsequencesin thescheduleto organizecompactloopingstructures,wecanreducethecode

sizecostrequiredfor theactorexecutionsequenceimplementedby«

. Thestructureof theresultingsoftwareimplementation

canberepresentedby the loopedschedule

« £ ¬f ® £ ¬f (8)

Eachparenthesizedterm ¦ ¦ 4¦ §T (calleda scheduleloop) in sucha loopedschedulerepresentsthesuccessive repeti-

tion timesof the invocationsequence¦ ¦ 40¦ § . Eachiterand ¦ canbeaninstantiation(appearance) of anactor, or a

loopedsubschedule.Thus,thisnotationnaturallyaccommodatesnestedloops.

Givenanarbitraryfiring sequencew

(that is, a schedulethatcontainsno scheduleloops),anda setof codesizecosts

for all of thegivenactors,a loopedschedulecanbederivedthatminimizesthetotalcodesize(overall loopedschedulesthat

havew

astheunderlyingfiring sequence)usinganefficientdynamicprogrammingalgorithm [28] calledCDPPO.It is easily

verifiedthat theschedule«

achievestheminimumtotal codesizefor thefiring sequence«

for any givenvaluesof ² ¨³ ,² 6© , and ² *ª . In general,however, thethesetof loopedschedulesthatminimizethecodesizecostfor a firing sequence

maydependon therelativecostsof theindividualactors[28].

Schedules«

and«

bothattaintheminimumachievablebuffer requirementof 11for fig. 12;however,«

will generally

achieve a much lower codesizecost. The codesizecostof«

canbe approximatedas ² 6¨|´h <² X©E[h <² *ª . This

approximationneglectsthecodesizeoverheadµ « of implementingthescheduleloops(parenthesizedterms)within«

.

In practice,this approximationrarely leadsto misleadingresults. The looping overheadis typically very small compared

to the codesize saved by consolidatingactor appearancesin the schedule. This is especiallytrue for the large number

of DSPprocessorsthat employ so-called“zero-overheadlooping” facilities [2]. Schedulingtechniquesthat abandonthis

approximation,andincorporateloopingoverheadareexaminedin section2.2.5.

It is possibleto reducethecodesizecostbelow whatis achievableby«

; however, thisrequiresanincreasein thebuffer-

ing cost. For example,considertheschedule«¶ ¨· : \"©n : \ª . Sucha scheduleis calleda singleappearanceschedule

sinceit containsonly oneinstantiationof eachactor. Clearly(undertheapproximationof negligible loopingoverhead),any

singleappearanceschedulegivesa minimal codesizeimplementationof a dataflow graph.However, a penaltyin thebuffer

requirementmustusuallypaidfor suchcodesizeoptimality.

13

For example,thecodesizecostof« ¶

is ² ¨|h ² 6©l lessthanthatof

« ; however ¯°± « ¶ < £ , while ¯4°± « is

only 11.

Beyondcodesizeoptimality, anotherpotentiallyimportantbenefitof schedule« ¶

is thatit minimizestheaveragerateat

whichinter-actorcontext switchingoccurs.Thisscheduleincurs3context switches(alsocalledactoractivations)perschedule

period,while«

and«

both incur 21. Suchminimizationof context switchingcansignificantlyimprove throughputand

power consumption.Theissueof context switching,andthesystematicconstructionof minimum-context-switchschedules

arediscussedfurtherin section2.2.4.

An alternative singleappearanceschedulefor fig. 12 is«j¸ ¨· : \"©lªY . This schedulehasthe sameoptimal code

sizecostas« ¶

. However its buffer requirementof 16 is lower thanthat of« ¶

sinceexecutionof actors©

andª

is fully

interleaved,whichlimits dataaccumulationontheedge6©´LªY

. This interleaving,however, bringstheaveragerateof context

switchesto 21;andthus,« ¶

is clearlyadvantageousin termsof thismetric.

In summary, there is a wide, complex rangeof trade-offs involved in synthesizingan applicationprogramfrom a

dataflow specification. This is true even when we restrict ourselves to inline implementations,which entirely avoid the

(call/return/parameterpassing)overheadof subroutines.In theremainderof this section,we review a numberof techniques

thathavebeendevelopedfor addressingsomeof thesecomplex trade-offs. Sections2.2.3and2.2.4focusprimarily on inline

implementations.In section2.2.5,weexaminesomerecently-developedtechniquesthathave beendevelopedto incorporate

subroutine-basedthreadinginto thedesignspace.

2.2.3 Minimization of memory requirements

Minimizing programand datamemoryrequirementsis critical in many embeddedDSP applications. On-chip memory

capacitiesarelimited, andthespeed,power, andfinancialcostpenaltiesof employing off-chip memorymaybeprohibitive

or highly undesirable.Threegeneralavenueshave beeninvestigatedfor minimizingmemoryrequirements— minimization

of thebuffer requirement,which usuallyformsa significantcomponentof theoverall dataspacecost;minimizationof code

size;andjoint explorationof thetrade-off involving codesizeandbuffer requirements.

It hasbeenshown that the problemof constructinga schedulethat minimizesthe buffer requirementover all valid

schedulesis NP-complete[11]. Thus,for practical,scalablealgorithms,wemustresortto heuristics.Ade[29] hasdeveloped

techniquesfor computingtight lower boundson the buffer requirementfor a numberof restrictedsubclassesof delayless,

acyclic graphs,includingarbitrary-lengthchain-structuredgraphs.Someof theseboundshave beengeneralizedto handle

delaysin [11]. Approximatelower boundsfor generalgraphsarederivedin [30]. CubricandPanangadenhave presentedan

algorithmthatachievesoptimumbuffer requirementsfor acyclic SDFgraphsthatmayhave oneor moreindependent,undi-

rectedcycles[31]. An effectiveheuristicfor generalgraphs,which is employedin theGabriel[14] andPtolemy[7] systems,

is given in [11]. Govindarajan,Gao,andDesaihave developedan SDF buffer minimizationalgorithmfor multiprocessor

implementation[32]. This algorithmminimizesthebuffer memorycostover all multiprocessorschedulesthathave optimal

throughput.

For complex, multirateapplications— which are the mostchallengingfor memorymanagement— the structureof

minimumbufferschedulesis in generalhighly irregular[33, 11]. Suchschedulesoffer relatively few opportunitiesto organize

compactloop structures,andthushave very high codesizecostsunderinlined implementations.Thus,suchschedulesare

often not usefuleven thoughthey may achieve very low buffer requirements.Schedulesat the extremeof minimum code

size,on theotherhand,typically exhibit a muchmorefavorabletrade-off betweencodeandbuffer memorycosts[34].

Theseempiricalobservationsmotivatetheproblemof codesizeminimization.A centralgoalwhenattemptingto mini-

mizecodesizefor inlined implementationsis thatof constructinga singleappearanceschedulewheneveroneexists.A valid

singleappearancescheduleexists for any consistent,acyclic SDF graph. Furthermore,a valid singleappearanceschedule

canbederivedeasilyfrom any topologicalsort(a topological sort of a directedacyclic graph ? is a linearorderingof all its

verticessuchthatfor eachedge % in ? , % appearsbefore in theordering)of anacyclic graph ? : if

6G 3G 4 3G§T

14

is a topologicalsortof ? , thenit is easilyseenthatthesingleappearanceschedule ` 6G G ` XG 0G 4 ` 6GY§90G§9 is

valid. For acyclic graph,a singleappearanceschedulemayor maynotexist dependingon thelocationandmagnitudeof de-

laysin thegraph.An efficient strategy, calledtheLooseInterdependenceAlgorithmFramework (LIAF), hasbeendeveloped

thatconstructsa singleappearanceschedulewheneveroneexists[35]. Furthermore,for generalgraphs,this approachguar-

anteesthatall actorsthatarenot containedin a certaintypeof subgraph,calledtightly interdependentsubgraphs, will have

only oneappearancein thegeneratedschedule[36]. In practice,tightly interdependentsubgraphsariseonly very rarely, and

thus,theLIAF techniqueguaranteesfull codesizeoptimality for mostapplications.Becauseof its flexibility andprovable

performance,theLIAF is employedin a numberof widely usedtools,includingPtolemyandCadence’sSPW.

The LIAF constructsa singleappearancescheduleby decomposingthe input graphinto a hierarchyof acyclic sub-

graphs,which correspondto an outer-level hierarchyof nestedloopsin the generatedschedule.The acyclic subgraphsin

thehierarchycanbe scheduledwith any existing algorithmthatconstructssingleappearanceschedulesfor acyclic graphs.

Theparticularalgorithmthat is usedin a givenimplementationof theLIAF is calledtheacyclicschedulingalgorithm. For

example,thetopological-sort-basedapproachdescribedabove couldbeusedastheacyclic schedulingalgorithm.However,

this simpleapproachhasbeenshown to leadto relatively largebuffer requirements[11]. This motivatesa key problemin

the joint minimizationof codeanddatafor SDF specifications.This is the problemof constructinga singleappearance

schedulefor anacyclic SDFgraphthatminimizesthebuffer requirementover all valid singleappearanceschedules.Since

any topologicalsortleadsto adistinctschedulefor anacyclic graph,andthenumberof topologicalsortsis notpolynomially

boundedin thegraphsize,exhaustiveevaluationof singleappearanceschedulesis not tractable.Thus,aswith the(arbitrary

appearance)buffer minimizationproblem,heuristicshave beenexplored. Two complementary, low-complexity heuristics,

calledAPGAN [37] andRPMC [38], have provento be effective on practicalapplicationswhenboth areapplied,andthe

bestresultingscheduleis selected.Furthermore,it hasbeenformally shown thatAPGAN givesoptimal resultsfor a broad

classof SDFsystems.Thoroughdescriptionsof APGAN, RPMC,andtheLIAF, andtheir inter-relationshipscanbefound

in [11, 34]. A schedulingframework for applyingthesetechniquesto multiprocessorimplementationsis describedin [39].

Recently-developedtechniquesfor efficient sharingof memoryamongmultiple buffers from a singleappearanceschedule

aredevelopedin [40, 41].

Although APGAN andRPMC provide goodperformanceon many applications,theseheuristicscansometimespro-

duceresultsthatarefar from optimal[42]. Furthermore,asdiscussedin section1, DSPsoftwaretoolsareallowedto spend

moretime for optimizationof codethanwhat is requiredby low-complexity, deterministicalgorithmssuchasAPGAN and

RPMC.Motivatedby theseobservations,Zitzler, Teich,andBhattacharyyahavedevelopedaneffectivestochasticoptimiza-

tion methodology, calledGASAS, for constructingminimum buffer single appearanceschedules[43, 44]. The GASAS

approachis basedon a geneticalgorithm[45] formulationin which topologicalsortsareencodedas“chromosomes,” which

randomly“mutate” and“recombine”to explorethesearchspace.Eachtopologicalsort in theevolution is optimizedby the

efficient, local searchalgorithmCDPPO[28], which wasmentionedearlierin section2.2.2. Usingdynamicprogramming,

CDPPOcomputesa minimummemorysingleappearanceschedulefor agiventopologicalsort.To exploit thevaluableopti-

mality propertyof APGAN whenever it applies,thesolutiongeneratedby APGAN is includedin theinitial population,and

anelitist evolutionpolicy is enforcedto ensurethatthefittestindividualalwayssurvivesto thenext generation.

2.2.4 Thr oughput optimization

At theAachenUniversityof Technology, aspartof theCOSSAPdesignenvironment(now developedby Synopsys)project,

Ritz, Pankert, andMeyr have investigatedtheminimizationof of thecontext-switchoverhead,or theaveragerateat which

actoractivationsoccur[20]. As discussedin section2.2.2,anactoractivationoccurswhenevertwo distinctactorsareinvoked

in succession;for example,theschedule < < 5 £ GY £ 89 for fig. 13 resultsin fiveactivationsperscheduleperiod.

Activation overheadincludessaving the contentsof registersthat areusedby the next actor to invoke, if necessary,

andloadingstatevariablesandbuffer pointersinto registers.Theconceptof groupingmultiple invocationsof thesameactor

15

togetherto reducecontext-switchoverheadis referredto asvectorization. TheSSDFmodel,discussedin section2.1.6,allows

thebenefitsof vectorizationto extendbeyondtheactorinterfacelevel (inter-actorcontext switching).For example,context

switchingbetweensuccessive sub-functionsof a complex actorcanbeamortizedovervE

invocationsof thesub-functions,

wherevE

is thegivenvectorizationparameter.

Ritz estimatesthe averagerate of activationsfor a periodic schedule«

as the numberof activationsthat occur in

one iterationof«

divided by the blocking factor1 of«

. This quantity is denotedbyv¹ºb»3 «

For example,for fig. 13,v¹ºb»L < < 5l £ GH £ 89 £, and

v¹ºb»3 p < 5 £ GY : \89 ½¼¾<|@p £ . If for eachactor, eachinvocationtakesthe

sameamountof time, andif we ignorethe time spenton computationthat is not directly associatedwith actorinvocations

(for example,scheduleloops),thenv ¹º0» «

is directly proportionalto the numberof actoractivationsper unit time. For

consistentacyclic SDFgraphs,v ¹º0»

clearlycanbemadearbitrarily largeby increasingtheblockingfactorsufficiently; thus,

aswith theproblemof constructingcompactschedules,theextentto whichtheactivationratecanbeminimizedis limited by

thecyclic regionsin theinputSDFspecification.

Thetechniquedevelopedin [20] attemptsto find a valid singleappearanceschedulethatminimizesv ¹º0»

over all valid

singleappearanceschedules.Note that minimizing the numberof activationsdoesnot imply minimizing the numberof

appearances.As asimpleexample,considertheSDFgraphin fig. 14. It canbeverifiedthatfor thisgraph,thelowestvalueofv¹ºb»thatis obtainableby a valid singleappearancescheduleis

\ ¿ £ , andonevalid singleappearanceschedulethatachieves

thisminimumrateis p 5l4 p GY p 89 . However, valid schedulesexist thatarenotsingleappearanceschedules,andthathave

valuesofv7¹ºb»

below\ ¿ £ ; for example,thevalid schedule

p 5l4 p GY 5 GY4 ¿ 89 containstwo appearanceseachofG

and5

, andsatisfiesv¹ºb» £ ¾"¿T \ À¿: .

Thus,sinceRitz’svectorizationapproachfocusesonsingleappearanceschedules,theprimaryobjectiveof thetechniques

in [20] is implicitly codesizeminimization. This is reasonablesincein practice,codesizeis oftenof critical concern.The

overallobjectiveis in [20] is to constructaminimumactivationimplementationoverall implementationsthathaveminimum

codesize.

Ritz definestherelativevectorizationdegreeof asimplecycle(acyclic pathin thegraphin whichnopropersub-pathis

cyclic)8

in aconsistent,connectedSDFgraphby

vlÁÂX89 ;ÃÄ4Å ÃlÆÇ rÁÂ6ÈÉeÈÊI 2Ä¥ÄÇ 6ËÌ # ÉeËÊI L4Í X89 # (9)

where

rÎÁÉ6ËÌ ÐÏ XË'` 6ËÌ 21 6ËÌ¥Ñ (10)

is thedelayonedgeË

normalizedby thetotalnumberof tokensexchangedonË

in aminimalscheduleperiodof ? , and

2Ä¥ÄÇ4 6ËÌ ÈÒI 3Í ? ·e È' Ó 6ËÌ Ä¥j LM 6È ÓM 6ËÌ #is thesetof edgeswith thesamesourceandsinkas

Ë. Here, L4Í ? simplydenotesthesetof edgesin theSDFgraph? .

For example,if ? denotestheSDFgraphin fig. 13, and Ô denotesthecycle in ? whoseassociatedgraphcontainsthe

actorsG

and5

, thenrÎÁÉ Ô ÕÏ0: \ ¾"< \ Ñ \ ; andif ? denotesthegraphin fig. 14and Ô denotesthecyclewhoseassociated

graphcontainsG

and8

, thenrÁ Ô @ÏX¿¾: Ñ Ó¿ .

1Every periodicscheduleinvokeseachactor Ö somemultiple of ×4Ø,Ö'Ù times. This multiple, denotedby Ú , is calledthe blocking factor. A minimal

periodicscheduleis onethatsatisfiesÚlÛÜ . For memoryminimization,thereis no penaltyin restrictingconsiderationto minimal schedules[11]. When

attemptingto minimize Ý~ÞbßÇà , however, it is in generaladvantageousto considerÚEáÜ .16

Ritz et. al postulatethatgivena stronglyconnectedSDFgraph,avalid singleappearanceschedulethatminimizesv ¹º0»

canbe constructedfrom a completehierarchization, which is a clusterhierarchysuchthat only connectedsubgraphsare

clustered,all cyclesat a givenlevel of thehierarchyhave thesamerelative vectorizationdegree,andcyclesin higherlevels

of the hierarchyhave strictly higherrelative vectorizationdegreesthancyclesin lower levels. Fig. 15 depictsa complete

hierarchizationof an SDF graph. Fig. 15(a)shows the original SDF graph;here ` 6GE5L8Hr7 : < p â . Fig. 15(b)

shows the top level of theclusterhierarchy. Thehierarchicalactor ã representsL°¯*Í"Ä32ä 5738H3r # , andthis subgraph

is decomposedas shown in fig. 15(c), which gives the next level of the clusterhierarchy. Finally, fig. 15(d) shows that

°¯*Í"Ä32ä 8H3r # correspondsto ã andis thebottomlevel of theclusterhierarchy.

Now observethattherelativevectorizationdegreeof thefundamentalcycle in fig. 15(c)with respectto theoriginalSDF

graphis Ï0:å¾ â Ñ æ< , while therelative vectorizationdegreeof the fundamentalcycle in fig. 15(b) is Ïb:<¾"< Ñ çå ; andthe

relativevectorizationdegreeof thefundamentalcyclein fig. 15(c)is Ïb:<¾ â Ñ ^: . Weseethattherelativevectorizationdegree

decreasesaswe descendthehierarchy, andthusthehierarchizationdepictedin fig. 15 is complete.Thehierarchizationstep

definedby eachof theSDFgraphsin figs.15(b)-(d)is calleda componentof theoverallhierarchization.

Ritz’s algorithm[20] constructsa completehierarchizationby first evaluatingtherelative vectorizationdegreeof each

fundamentalcycle,determiningthemaximumvectorizationdegree,andthenclusteringthegraphsassociatedwith thefunda-

mentalcyclesthatdonotachievethemaximumvectorizationdegree.Thisprocessis thenrepeatedrecursively oneachof the

clustersuntil no new clustersareproduced.In general,this bottom-upconstructionprocesshasunmanageablecomplexity.

However, this normally doesn’t createproblemsin practicesincethe stronglyconnectedcomponentsof usefulsignalpro-

cessingsystemsareoftensmall,particularlyin largegraindescriptions.DetailsonRitz’s techniquefor translatingacomplete

hierarchizationinto ahierarchyof nestedloopscanbefoundin [20]. A general,optimalalgorithmfor vectorizationof SSDF

graphsbasedon thecompletehierarchizationconceptdiscussedabove is givenin [20]. Jointminimizationof vectorization

andbuffer memorycostis developedin [12], andadaptationsof theretimingtransformationto improvevectorizationfor SDF

graphsis addressedin [46, 47].

2.2.5 Subroutine insertion

Thetechniquesdiscussedabove assumea fixedthreadingmode. In particular, they do not attemptto exploit theflexibility

offeredby hybrid threading.Sung,Kim, andHa have developedanapproachthatemploys hybrid threadingto sharecode

amongdifferentactorsthathavesimilar functionality[48]. For example,anapplicationmaycontainseveralFIR filter blocks

thatdiffer only in thenumberof taps,andthesetof filter coefficients.Thesearecalleddifferentinstancesof aparameterized

FIR modulein theactorlibrary. Theirapproachdecomposesthecodeassociatedwith anactorinstanceinto theactorcontext

andactorreferencecode,andcarefully weighsthe benefitof eachcodesharingopportunitywith the associatedoverhead.

Theoverheadsstemfrom theactorcontext component,which includeinstance-specificstatevariables,andbuffer pointers.

Codemustbe insertedto managethis context so that eachinvocationof the sharedcodeblock (the “referencecode”) is

appropriatelycustomizedto theassociatedinstance.

Also, theGASASframework hasbeensignificantlyextendedto considermultipleappearanceschedules,andselectively

applyhybridthreadingto reducethecodesizecostof highly irregularschedules,whichcannotbeaccommodatedby compact

loop structures[49]. Suchirregularity often ariseswhenexploring the spaceof scheduleswhosebuffer requirementsare

significantly lower thanwhat is achievableby singleappearanceschedules[11]. The objective of this genetic-algorithm-

basedexplorationof hybrid threadingandloop schedulingis to efficiently computePareto-frontsin the multidimensional

designevaluationspaceof programmemorycost,buffer requirement,andexecutiontimeoverhead.

Theintelligentuseof hybridthreadingandcodesharing(subroutineinsertionoptimizations) canachievelowercodesize

coststhatwhatis achievablewith singleappearanceschedulesthatuseconventionalinlining. If aninlinedsingleappearance

schedulefits within theavailableon-chipmemory, it is not worth incurringtheoverheadof subroutineinsertion.However,

if an inline implementationis too largeto beheldon-chip,thensubroutineinsertionoptimizationscaneliminate,or greatly

17

reducetheneedfor off-chip memoryaccesses.Sinceoff-chip memoryaccessesinvolvesignificantexecutiontime penalties,

andlargepower consumptioncosts,subroutineinsertionenablesembeddedsoftwaredevelopersto exploit animportantpart

of thedesignspace.

2.2.6 Summary

In thissectionwehavereviewedavarietyof algorithmsfor addressingoptimizationtrade-offs duringsoftwaresynthesis.We

have illustratedsomeof theanalyticalmachineryusedin SDFoptimizationalgorithmsby examiningin somedetailRitz’s

algorithmfor minimizing actoractivations.SinceCSDF, MDSDF, WBDF, andBDF areextensionsof SDF, the techniques

discussedin thissectioncanalsobeappliedin thesemoregeneralmodels.In particular, they canbeappliedto any SDFsub-

graphsthatarefound.It is importantto recognizethiswhendevelopingor usingaDSPdesigntool sincein DSPapplications

thatarenot fully amenableto SDFsemantics,a significantsubsetof thefunctionalitycanusuallybeexpressedin SDF. Thus

thetechniquesdiscussedin thissectionremainusefulevenin DSPtoolsthatemploy moregeneraldataflow semantics.

Beyondtheirapplicationto SDFsubsystems,however, theextensionof mostof thetechniquesdevelopedin thissection

to moregeneraldataflow modelsis anon-trivial matter. To achievebestresultswith thesemoregeneralmodels,new synthesis

approachesarerequiredthattake into accountdistinguishingcharacteristicsof themodels.Themostsuccessfulapproaches

will combinethesenew approachesfor handlingthefull generalityof theassociatedmodels,with thetechniquesthatexploit

thestructureof pureSDFsubsystems.

3 Compilation of application programsto machinecode

In this section,we will first outlinethestateof theart in theareaof compilersfor PDSPs.As indicatedby severalempirical

studies,themajorproblemwith currentcompileris theirinability to generatemachinecodeof sufficientquality. Next, wewill

discussanumberof recentlydevelopedcodegenerationandoptimizationtechniques,whichexplicitly takeinto accountDSP-

specificarchitecturesandrequirementsin orderto improvecodequality. Finally, we will mentionkey techniquesdeveloped

for retargetablecompilation.

3.1 Stateof the art

Today, the mostwidespreadhigh-level programminglanguagefor PDSPsis ANSI C. Even thoughtherearemoreDSP-

specificlanguages,suchasthe dataflow languageDFL [50], the popularityandhigh flexibility of C aswell asthe large

amountof existing ”legacy code” hasso far largely preventedthe useof programminglanguagesmoresuitablefor DSP

programming.C compilersareavailablefor all importantDSPfamilies,suchasTexasInstrumentsTMS320xx,Motorola

56xxx,or AnalogDevices21xx. In mostcases,thecompilersareprovidedby thesemiconductorvendorsthemselves.

Due to the large semanticalgapbetweenthe C languageandPDSPinstructionsets,many of thesecompilersmake

extensionsto theANSI C standardby permittingtheuseof ”compiler intrinsics”,for instancein theform of compiler-known

functionswhich are expandedlike macrosinto specificassemblyinstructions. Intrinsicsareusedto manuallyguide the

compilerin makingthe right decisionsfor generationof efficient code. However, suchan ad-hocapproachhassignificant

drawbacks.First, thesourcecodedeviatesfrom thelanguagestandardandis no longermachine-independent.Thus,porting

of softwareto anotherprocessormightbeaverytime-consumingtask.Second,theprogrammingabstractionlevel is lowered

andtheefficientuseof compilerintrinsicsrequiresa deepknowledgeof theinternalPDSParchitecture.

Unfortunately, machine-specificsourcecodetodayis a mustwhenevertheC languageis usedfor programmingPDSPs.

Thereasonis thepoorqualityof codegeneratedby compilersfrom plainANSI C code.Theoverheadof compiler-generated

codeascomparedto hand-written,heavily optimizedassemblycodehasbeenquantifiedin the DSPStonebenchmarking

project[6]. In that project,both codesizeandperformanceof compiler-generatedcodehave beenevaluatedfor a number

18

of DSPkernelroutinesanddifferentPDSParchitectures.The resultsshowed that the compileroverheadtypically ranges

between100and700% (with thereferenceassemblycodesetto 0 % overhead).This is absolutelyinsufficient in thearea

of DSP, wherereal-timeconstraintsaswell aslimitationson programmemorysizeandpower consumptiondemandfor an

extremelyhigh utilization of processorresources.Therefore,anoverheadof compiler-generatedcodecloseor equalto zero

is mostdesirable.

In anotherempiricalstudy[51], DSPvendorshave beenasked to compilea setof C benchmarkprogramsexisting in

two differentversions,onebeingmachine-independentandtheotherbeingtunedfor thespecificprocessor. Again,theresults

showed that usingmachine-independentcodecausesan unacceptableoverheadin codequality in termsof codesizeand

performance.

Theseresultsmake thepracticaluseof compilersfor PDSPsoftwaredevelopmentquestionable.In theareaof general

purposeprocessors,suchasRISCs,thecompileroverheadtypically doesnotexceed100%, sothatevenfor DSPapplications

usingaRISCtogetherwith agoodcompilermayresultin amoreefficientimplementationthanusingaPDSP(with potentially

muchhigherperformance)wastingmostof its timeexecutingunnecessaryinstructioncyclesdueto a poorcompiler. Similar

argumentshold, if codesizeor powerconsumptionareof majorconcern.

As a consequence,thelargestpartof PDSPsoftwareis still written in assemblylanguages,which impliesa lot of well-

known drawbacks,suchashigh developmentcosts,low portability, andhigh maintenanceanddebuggingeffort. This has

beenquantifiedin a studyby Paulin [52], who foundthatfor a certainsetof DSPapplicationsabout90% of DSPcodelines

arewritten in assembly, while theuseof C only accountsfor 10%.

As bothDSPprocessorsandDSPapplicationstendto becomemoreandmorecomplex, the lack of goodC compilers

implies a significantproductivity bottleneck. About a decadeago,researchersstartedto analyzethe reasonsfor the poor

codequalityof DSPcompilers.A key observationwasthatclassicalcodegenerationtechnology, mainlydevelopedfor RISC

andCISCprocessorarchitectures,is hardlysuitablefor PDSPs,but thatnew DSP-specificcodegenerationtechniqueswere

required.In the following, we will summarizea numberof recenttechniques.In orderto put thesetechniquesinto context

with eachother, we will first give an overview aboutthe main phasesin compilation. Then,we will focuson techniques

developedfor particularproblemsin thedifferentcompilationphases.

3.2 Overview of the compilation process

Thecompilationof anapplicationprograminto machinecode,asillustratedin fig. 16,startswith severalsourcecodeanalysis

phases.

Lexical analysis: Thecharacterstringsdenotingatomicelementsof thesourcecode(identifiers,keywords,operators,con-

stants)aregroupedinto tokens, i.e. numericalidentifiers,which arepassedto thesyntaxanalyzer. Lexical analysisis

typically performedby ascanner, which is invokedby thesyntaxanalyzerwheneveranew tokenis required.Scanners

canbeautomaticallygeneratedfrom a languagespecificationwith toolslike ”lex”.

Syntaxanalysis: The structureof programminglanguagesis mostly describedby a context-freegrammar, consistingof

terminals(or tokens),nonterminals,andrules. Thesyntaxanalyzer, or parser, acceptstokensfrom thescanner, until

a matchinggrammarrule is detected.Eachrule correspondsto a primitiveelementof theprogramminglanguage,for

instanceanassignment.If a tokensequencedoesnot matchany rule,a syntaxerroris emitted.Theresultof parsinga

programis a syntaxtree, which accountsfor thestructureof a givenprogram.Parserscanbeconvenientlygenerated

from grammarspecificationswith toolslike ”yacc’.

Semanticalanalysis: Duringsemanticalanalysis,anumberof correctnesstestsareperformed.For instance,all usedidenti-

fiersmusthavebeendeclared,andfunctionsmustbecalledwith parametersin accordancewith their interfacespecifi-

cation.Failureof semanticalanalysisresultsin errormessages.Additionally, a symboltable is built, which annotates

19

eachidentifierwith its type andpurpose(e.g.typedefinition,globalor local variable). Semanticalanalysisrequires

a traversalof the syntaxtree. Frequently, semanticalanalysisis coupledwith syntaxanalysisby meansof attribute

grammars. Thesegrammarssupporttheannotationof informationlike typeor purposeto grammarsymbols,andthus

help to improve themodularityof analysis.Tools like ”ox” [53] areavailablefor automaticgenerationof combined

syntaxandsemanticalanalyzersfrom grammarspecifications.

Theresultof sourcecodeanalysisis an intermediaterepresentation(IR), which formsthebasisfor subsequentcompi-

lation phases.Both graph-basedandstatement-basedIRs arein use.Graph-basedIRs directly modeltheinterdependencies

betweenprogramoperations,while statement-basedIRs essentiallyconsistof anassembly-like sequenceof simpleassign-

ments(three-addresscode)andjumps.

In the next phase,several machine-independentoptimizationsareappliedto the generatedIR. A numberof suchIR

optimizationshavebeendevelopedin theareaof compilerconstruction[54]. Importanttechniquesincludeconstantfolding,

commonsubexpressionelimination,andloop-invariantcodemotion.

Thetechniquesmentionedsofararelargelymachine-independentandmaybeusedin any high-level languagecompiler.

DSP-specificinformation comesinto play only during the codegenerationphase,when the optimizedIR is mappedto

concretemachineinstructions.Dueto thespecializedinstructionsetsof PDSPs,this is themostimportantphasewith respect

to codequality. Dueto computationalcomplexity reasons,codegenerationis in turn subdividedinto differentphases.It is

importantto notethat for PDSPsthis phasestructuringsignificantlydiffers from compilersfor generalpurposeprocessors.

For thelatter, codegenerationis traditionallysubdividedinto thefollowing phases.

Codeselection: Theselectionof aminimumsetof instructionsfor agivenIR with respectto acostmetriclikeperformance

(executioncycles)or size(instructionwords).

Registerallocation: Themappingof variablesandintermediateresultsto a limited setof availablephysicalregisters.

Instruction scheduling: Theorderingof selectedinstructionsin timewhile minimizingthenumberof instructionsrequired

for temporarilymoving registercontentsto memory(spill code) andminimizing executiondelaydueto instruction

pipelinehazards.

Sucha phaseorganizationis not viable for PDSPsdue to several reasons.While generalpurposeprocessorsoften

have a large, homogeneousregisterfile, PDSPstendto show a datapatharchitecturewith several distributedregistersor

registerfiles of very limited capacity. An examplehasalreadybeengiven in fig. 1. Therefore,classicalregisterallocation

techniqueslike [55] arenot applicable,but registerallocationhasto beperformedtogetherwith codeselectionin orderto

avoid large codequality overheadsdueto superfluousdatamovesbetweenregisters. Furthermore,instructionscheduling

for PDSPshasto take into accountthe moderatedegreeof instruction-level parallelism (ILP) offeredby suchprocessors.

In many cases,several mutually independentinstructionsmay be groupedto be executedin parallel,therebysignificantly

increasingperformance.Thisparallelizationof instructionsis frequentlycalledcodecompaction. Anotherimportantareaof

codeoptimizationfor PDSPsconcernsthememoryaccessesperformedby a program.Both theexploitationof potentially

availablemultiple memorybanksandtheefficient computationof memoryaddressesundercertainrestrictionsimposedby

theprocessorarchitecturehave to beconsidered,which arehardly issuesfor generalpurposeprocessors.We will therefore

discusstechniquesusinga differentstructureof codegenerationphases.

Sequentialcodegeneration: Even thoughPDSPsgenerallypermit the executionof multiple instructionsin parallel, it is

often reasonableto temporarilyconsidera PDSPasa sequentialmachine,which executesinstructionsone-by-one.

During sequentialcodegeneration,IR blocks(statementsequences)aremappedto sequentialassemblycode. These

blocksaretypically basicblocks, wherecontrolflow enterstheblockat its beginningandleavestheblockatmostonce

at its endwith a jump. Sequentialcodegenerationaimsatsimultaneouslyminimizingthecostsof instructionsbothfor

operationsanddatamovesbetweenregistersandmemorywhile neglectingILP.

20

Memory accessoptimization: Generationof sequentialcodemakestheorderof memoryaccessesin aprogramknown. This

knowledgeis exploited to optimizememoryaccessbandwidthby partitioningthevariablesamongmultiple memory

banksandto minimizetheadditionalcodeneededfor addresscomputations.

Codecompaction: This phaseanalyzesinterdependenciesbetweengeneratedinstructionsandaimsat exploiting potential

parallelismbetweeninstructionsundertheresourceconstraintsimposedby theprocessorarchitectureandtheinstruc-

tion format.

3.3 Sequentialcodegeneration

Basicblocks in the IR of a programaregraphicallyrepresentedby data flow graphs(DFGs). A DFG ?è XAéDEis a

directedacyclic graph,wherethenodesinA

representoperations(arithmetic,Boolean,shifts,etc.),memoryaccesses(loads

andstores),andconstants.TheedgesetDÕêAæ¢tA

representsthedatadependenciesbetweenDFG nodes.If anoperation

representedby a node ë requiresa valuegeneratedby anoperationdenotedby, then

1 ë nID . DFG nodeswith more

thanoneoutgoingedgearecalledcommonsubexpressions(CSEs).As anexample,fig. 17 shows a pieceof C sourcecode,

whoseDFGrepresentation(afterdetectionof CSEs)is depictedin fig. 18.

Codegenerationfor DFGscanbevisualizedasa processof coveringa DFG by availableinstructionpatterns. Let us

considera processorwith instructionsADD, SUB, andMUL, to performaddition,subtraction,andmultiplication, respec-

tively. Oneof the operandsis expectedto residein memory, while the otheronehasto be first loadedinto a registerby

a LOAD instruction. Furthermore,writing backa resultto memoryrequiresa separateSTORE instruction. Then,a valid

coveringof theexampleDFGis thenoneshown in fig. 19.

Availableinstructionpatternsareusuallyannotatedwith a costvaluereflectingtheir sizeor executionspeed.Thegoal

of codegenerationis to find aminimumcostcoveringof a givenDFGby instructionpatterns.Theproblemis thatin general

thereexistnumerousdifferentalternativecoversfor aDFG.For instance,if theprocessoroffersaMAC(multiply-accumulate)

instruction,asfoundin mostPDSPs,andthecostvalueof MAC is lessthanthesumof thecostsof MUL andADD, thenit

mightbefavorableto selectthatinstruction(fig. 20).

However, using MAC for our exampleDFG would be lessuseful, becausethe multiply operationin this caseis a

CSE.Sincethe intermediatemultiply resultof a MAC is not storedanywhere,a potentiallycostly recomputationwould be

necessary.

3.3.1 Treebasedcodegeneration

Optimalcodegenerationfor DFGsis anexponentialproblem,evenfor very simpleinstructionsets[54]. A solutionto this

problemis to decomposea DFG into a setof dataflow trees(DFTs)by cuttingtheDFG at its CSEsandinsertingdedicated

DFG nodesfor communicatingCSEsbetweenthe DFTs (fig. 21). This decompositionintroducesschedulingprecedences

betweentheDFTs,sinceCSEsmustbewritten beforethey areread(dashedarrows in fig. 21). For eachof theDFTs,code

canbegeneratedseparatelyandefficiently. Liem [57] hasproposedadatastructurefor efficienttreepatternmatchingcapable

of handlingcomplex operationslike MAC.

For PDSPs,also the allocationof specialpurposeregistersduring DFT covering is extremely important,sinceonly

coveringtheoperatorsin a DFG by instructionpatternsdoesnot take into accountthecostsof instructionsneededto move

operandsandresultsto their requiredlocations.Wess[58] hasproposedtheuseof trellis diagramsto alsoincludedatamove

costsduringDFT covering.

AraujoandMalik [60] showedhow thepowerfulstandardtechniqueof treepatternmatchingwith dynamicprogramming

[56] widely usedin compilersfor generalpurposeprocessorscanbe effectively appliedalsoto PDSPswith irregulardata

paths.Treepatternmatchingwith dynamicprogrammingsolvesthecodegenerationproblemby parsinga givenDFT with

respectto aninstruction-setspecificationgivenasa treegrammar. Eachrule in sucha treegrammaris attributedwith a cost

21

valueandcorrespondsto oneinstructionpattern.OptimalDFT coversareobtainedby computinganoptimalderivationof

a given DFT accordingto the grammarrules. This requiresonly two passes(bottom-upandtop-down) over the nodesof

the input DFT, so that the runtimeis linear in the numberof DFT nodes.Codegeneratorsbasedon this paradigmcanbe

automaticallygeneratedwith toolslike ”twig” [56] and”iburg” [59].

Thekey ideain theapproachby Araujo andMalik is theuseof register-specificinstructionpatternsor grammarrules.

Insteadof separatingdetailedregisterallocationfrom codeselectionas in classicalcompilerconstruction,the instruction

patternscontainimplicit informationonthemappingof operandsandresultsto specialpurposeregisters.In orderto illustrate

this,weconsideraninstructionsubsetof theTI TMS320C25DSPalreadymentionedin section1 (seealsofig. 1. ThisPDSP

offerstwo typesof instructionsfor addition. Thefirst one(ADD) addsa memoryvalueto theaccumulatorregisterACCU,

while thesecondone(APAC) addsthevalueof theproductregisterPRto ACCU.In compilersfor generalpurposeprocessors,

adistinctionof storagecomponentsis madeonly between(generalpurpose)registersandmemory. In agrammarmodelused

for treepatternmatchingwith dynamicprogramming,theabovetwo instructionswould thusbemodeledasfollows:

reg: PLUS(reg,mem)

reg: PLUS(reg,reg)

The symbols”reg” and”mem” aregrammarnonterminals,while ”PLUS” is a grammarterminalsymbol representingan

addition.Thesemanticsof suchrulesis thatthecorrespondinginstructioncomputestheexpressionontheright handsideand

storestheresultin astoragecomponentrepresentedby theleft handside.WhenparsingaDFT with respectto thesepatterns

it would beimpossibleto incorporatethecostsof moving valuesto/from ACCU andPR,but thedetailedmappingof ”reg”

to physicalregisterswould beleft to a latercodegenerationphase,possiblyat theexpenseof codequality losses.However,

whenusingregister-specificpatterns,instructionsADD andAPAC wouldbemodeledas:

accu: PLUS(accu,mem)

accu: PLUS(accu,pr)

Using a separatenonterminalfor eachspecialpurposeregisterpermitsto model instructionsfor puredatamoves,which

in turn allows the codegeneratorto simultaneouslyminimize the costsof suchinstructions.As an example,considerthe

TMS320C25instructionPAC, which movesa valuefrom PRto ACCU. In thetreegrammar, thefollowing rule (a so-called

chain rule) for PAC wouldbeincluded:

accu: pr

SinceusingthePAC rule for derivationof a DFT would incur additionalcosts,thecodegeneratorimplicitly minimizesthe

datamoveswhenconstructingtheoptimalDFT derivation.

Generationof sequentialassemblycodealsorequiresto determineatotalorderingof selectedinstructionsin time. DFGs

andDFTstypically only imposeapartialordering,andtheremainingschedulingfreedommustbeexploitedcarefully. This is

dueto thefact,thatspecialpurposeregistersgenerallyhavevery limited storagecapacity. On theTMS320C25,for instance,

eachregistermayhold only a singlevalue,sothatunfavorableschedulingdecisionsmayrequireto spill andreloadregister

contentsto/from memory, therebyintroducingadditionalcode. In orderto illustratetheproblem,considera DFT ¦ whose

rootnoderepresentsanaddition,for which theaboveAPAC instructionhasbeenselected.Thus,theadditionoperandsmust

residein registersACCUandPR,sothattheleft andright subtrees¦ and ¦ of ¦ mustdelivertheir resultsin theseregisters.

Whengeneratingsequentialcodefor ¦ , it mustbedecidedwhether¦ or ¦ shouldbeevaluatedfirst. If someinstructionin

¦ writesits resultto PR,then ¦ shouldbeevaluatedfirst in orderto avoid a spill instruction,because¦ writesits resultto

PRaswell andthisvalueis ”li ve” until theAPAC instructionfor therootof ¦ is emitted.Conversely, if someinstructionfor

¦ writesregisterACCU,then ¦ shouldbescheduledfirst in orderto avoid a registercontentionfor ACCU.In [60], Araujo

andMalik formalizedthis observationandprovideda formal criterionfor theexistenceof a spill-freeschedulefor a given

DFT. This criterionrefersto thestructureof theinstructionsetand,for instance,holdsfor theTMS320C25.Whenusingan

22

appropriateschedulingalgorithm,which immediatelyfollowsfrom thatcriterion,thenoptimalspill-freesequentialassembly

codecanbegeneratedfor any DFT.

3.3.2 Graph basedcodegeneration

Unfortunately, the DFT-basedapproachto codegenerationmay affect codequality, becauseit performsonly a local opti-

mizationof codefor a DFGwithin thescopeof thesingleDFTs.Therefore,researchershave investigatedtechniquesaiming

atoptimalor near-optimalcodegenerationfor full DFGs.Liao [61] haspresentedabranch-and-boundalgorithmminimizing

thenumberof spills in accumulator-basedmachines,i.e. processorswheremostcomputedvalueshave to passa dedicated

accumulatorregister. In addition,hisalgorithmminimizesthenumberof instructionsneededfor switchingbetweendifferent

computationmodes.Thesemodes(e.g.signextensionor productshift modes)arespecialcontrolcodesstoredin dedicated

moderegisters in orderto reducetheinstructionword length.If theoperationswithin aDFGhaveto beexecutedwith differ-

entmodes,thesequentialschedulehasa strongimpacton thenumberof instructionsfor modeswitching. Liao’s algorithm

simultaneouslyminimizesaccumulatorspillsandmodeswitchinginstructions.However, dueto thetime-intensiveoptimiza-

tion algorithm,optimality cannotbe achieved for large basicblocks. The codegenerationtechniquein [62] additionally

performscodeselectionfor DFGs,but alsorequireshighcompilationtimesfor largeblocks.

A fasterheuristicapproachhasbeengivenin [63]. It alsorelieson thedecompositionof DFGsinto DFTs,but takesinto

accountarchitecturalinformationwhencuttingtheCSEsin aDFG.In somecases,themachineinstructionsetitself enforces

thatCSEshaveto passthememoryanyway, whichagainis aconsequenceof theirregulardatapathsof PDSPs.Theproposed

techniqueexploits thisobservationby assigningthoseCSEsto memorywith highestpriority, while othersmightbekeptin a

register, resultingin moreefficientcode.

Kolsonet al. [64] have focusedon the problemof codegenerationfor irregular datapathsin the context of program

loops. While theabove techniquesdealwell with specialpurposeregistersin basicblocks,thedo not take into accountthe

datamovesrequiredbetweendifferentiterationsof a loop body. This mayrequiretheexecutionof a numberof datamoves

betweenthoseregistersholdingtheresultsat theendof oneiterationandthoseregisterswhereoperandsareexpectedat the

beginningof thenext iteration.Bothanoptimalandaheuristicalgorithmhavebeenproposedfor minimizingthedatamoves

betweenloop iterations.

3.4 Memory accessoptimization

Duringsequentialcodegeneration,memoryaccessesareusuallytreatedonly ”symbolically” without particularreferenceto

acertainmemorybankor memoryaddresses.Thedetailedimplementationof memoryaccessesis typically left to a separate

codegenerationphase.

3.4.1 Memory bank partitioning

Thereexist severalPDSPfamilieshaving thememoryorganizedin two differentbanks(typically calledX andY memory),

which areaccessiblein parallel. ExamplesareMotorola56xxx andAnalogDevices21xx. Suchan architectureallows to

simultaneouslyloadtwo valuesfrom memoryinto registersandis thereforevery importantfor DSPapplicationslike digital

filtering or FFT, involving component-wiseaccessto differentdataarrays.Exploiting this featurein a compilermeans,that

symbolicmemoryaccesseshave to bepartitionedinto X andY memoryaccessesin sucha way, thatpotentialparallelismis

maximized.Sudarsanam[65] hasproposeda techniqueto performthis optimization.Thereis a strongmutualdependence

betweenmemorybankpartitioningandregisterallocation,becausevaluesfrom a certainmemorybankcanonly be loaded

into certainregisters. The proposedtechniquestartsfrom symbolicsequentialassemblycodeandusesa constraintgraph

modelto representtheseinterdependencies.Memorybankpartitioningandregisterallocationareperformedsimultaneously

23

by labelingthe constraintgraphwith valid assignments.Due to the useof simulatedannealing,the optimizationis rather

time-intensive,but mayresultin significantcodesizeimprovements,asindicatedby experimentaldata.

3.4.2 Memory layout optimization

As onecostmetric, Sudarsanam’s techniquealsocapturesthe costof instructionsneededfor addresscomputations.For

PDSPswhich typically show veryrestrictedaddressgenerationcapabilities,addresscomputationsareanotherimportantarea

of codeoptimization.Fig. 22showsthearchitectureof anaddressgenerationunit (AGU) asit is frequentlyfoundin PDSPs.

SuchanAGU operatesin parallelto thecentraldatapathandcontainsa separateadder/subtractorfor performingop-

erationson addressregisters (ARs). ARs storethe effective addressesfor all indirect memoryaccesses,exceptfor global

variablestypically addressedin direct mode. Modify registers (MRs) areusedto storefrequentlyrequiredaddressmodify

values. ARs andMRs arein turn addressedby AR andMR pointers. Sincetypical AR or MR file sizesare4 or 8, these

pointersareshortindicesof 2 or 3 bits,eitherstoredin theinstructionword itself or in specialsmallregisters.

Therearedifferentmeansfor addresscomputation,i.e., for changingthevalueof AGU registers.

AR load: LoadinganAR with animmediateconstant(from theinstructionword).

MR load: LoadingaMR with animmediateconstant.

AR modify: Addingor subtractinganimmediateconstantto/fromanAR.

Auto-incrementand auto-decrement: Addingor subtractingtheconstant1 to/fromanAR.

Auto-modify: Addingor subtractingthecontentsof oneMR to/fromanAR.

While detailslike thesizeof AR andMR filesor thesigned-nessof modify valuesmayvaryfor differentprocessors,the

generalAGU architecturefrom fig. 22 is actuallyfoundin a largenumberof PDSPs.It is importantto notethatperforming

addresscomputationsusingthe AGU in parallel to other instructionsis generallyonly possible,if the AGU doesnot use

the instructionword asa resource.The wide immediateoperandfor AR andMR load andAR modify operationsusually

leavesno spaceto encodefurther instructionswithin thesameinstructionword, so that thesetwo typesof AGU operations

requirea separatenon-parallelinstruction. On the otherhand,thoseAGU operationsnot using the instructionword can

mostlybeexecutedin parallelto otherinstructions,sinceonly internalAGU resourcesareoccupied.We call theseaddress

computationszero-costoperations. In order to maximizecodequality in termsof performanceand size it is obviously

necessaryto maximizetheutilizationof zero-costoperations.

A numberof techniqueshave beendevelopedwhich solve this problemfor the scalar variablesin a program. They

exploit thefact,thatwhenthesequenceof variableaccessesis known aftersequentialcodegeneration,agoodmemorylayout

for the variablescan still be determined. In order to illustrate this, supposea programblock containingaccessesto the

variables A Óì Líîï #is given,andthevariableaccesssequenceis

« Xí¥ï$ ì î¥3ï ì î¥Lí ì ï$ ì 3î¥ï

Furthermore,let theaddressspacereservedforA

beG ç \ : < M# andlet oneAR beavailableto computetheaddresses

accordingto thesequence«

. Considera memorylayoutwhereA

is mappedtoG

in lexicographicorder(fig. 23a).

First, AR needsto be loadedwith theaddress1 of thefirst elementí

of«

. Thenext accesstakesplacetoï

which is

mappedto address3. Therefore,AR mustbe modifiedwith a valueof +2. The next accessrefersto ì , which requiresto

subtract3 from AR, andsoforth. ThecompleteAGU operationsequencefor«

is givenin fig. 23 a). Accordingto our cost

24

metric,only 4 out of 13 AGU operationshappento bezero-costoperations(auto-incrementor decrement),sothata costof

9 extra instructionsfor addresscomputationsis incurred. However, onecanfind a bettermemorylayout forA

(fig. 23 b),

which leadsto only 5 extra instructions,dueto a betterutilizationof zero-costoperations.An evenbetteraddressingscheme

is possibleif a modify registerMR is available. Sincetheaddressmodifier2 is requiredthreetimesin theAGU operation

sequencefrom fig. 23b), onecanassignthevalue2 to MR (oneextra instruction)but reusethisvaluethreetimesatzerocost

(fig. 23c), resultingin a totalcostvalueof only 3.

How cansuch”low cost”memorylayoutsbeconstructed? A first approachhasbeenproposedby Bartley [66] andhas

laterbeenrefinedby Liao [67]. Bothuseanaccessgraphto modeltheproblem.

Thenodesof theedge-weightedaccessgraph ?Õ BAC3D ë correspondto thevariableset,while theedgesrepresent

transitionsbetweenvariablepairsin the accesssequence«

. An edge 6$ ë IKD is assignedan integerweight , if

thereare transitions1 ë or

ë in«

. Fig. 24showstheaccessgraphfor ourexample.Sinceany memorylayoutforA

impliesa linearorderofA

andvice versa,any memorylayoutcorrespondsto a Hamiltonianpathin ? , i.e.,a pathtouching

eachnodeexactly once. Informally, a ”good’ Hamiltonianpathobviously shouldcontainasmany edgesof high weight

aspossible,becauseincluding theseedgesin the pathimplies that the correspondingvariablepairswill be adjacentin the

memorylayout,whichin turnmakesauto-increment/decrementaddressingpossible.In otherwords,amaximumHamiltonian

path in ? hasto befound,in orderto obtainanoptimalmemorylayout,whichunfortunatelyis anexponentialproblem.

While Bartley [66] first proposedthe accessgraphmodel,Liao [67] provided an efficient heuristicalgorithmto find

maximumpathsin theaccessgraph.Furthermore,Liao proposedageneralizationof thealgorithmfor thecaseof anarbitrary

number of ARs. By partitioningthevariablesetA

into groups,the -AR problemis reducedto different1-ARproblems,

eachbeingsolvableby theoriginalalgorithm.

Triggeredby this work, a numberof improvementsan generalizationshave beenfound. Leupers[68] improved the

heuristicfor the1-AR caseandproposedamoreeffectivepartitioningfor the -AR problem.Furthermore,heprovidedafirst

algorithmfor theexploitationof MRsto reduceaddressingcosts.Wess’algorithm[69] constructsmemorylayoutsfor AGUs

with anauto-incrementrangeof 2 insteadof 1, while in [70] a generalizationfor anarbitraryintegerauto-incrementrange

waspresented.Thegeneticalgorithmbasedoptimizationgivenin [71] generalizesthesetechniquesfor arbitraryregisterfile

sizesandauto-incrementrangeswhile alsoincorporatingMRs into memorylayoutconstruction.

3.5 Codecompaction

Codecompactionis typically executedasthelastphasein codegeneration.At this point of time,all instructionsrequiredto

implementa givenapplicationprogramhave beengenerated,andthegoalof codecompactionis to schedulethegenerated

sequentialcodeinto a minimumnumberof parallelmachineinstructions,or control steps, undertheconstraintsimposedby

thePDSParchitectureandinstructionset.Thus,codecompactionis avariantof theresourceconstrainedschedulingproblem.

Input to thecodecompactionphaseis usuallya dependencygraph ? XAéDl, whosenodesrepresenttheinstructions

selectedfor abasicblock,while edgesdenoteschedulingprecedences.Therearethreetypesof suchprecedences:

Data dependencies:Two instructionsx

andx

aredatadependent,ifx¤

generatesa valuereadbyx

. Thus,x

mustbe

scheduledbeforex

.

Anti dependencies:Two instructionsx

andx

areanti dependent,ifx

potentiallyoverwritesa valuestill neededbyx

.

Thus,x¤

mustnotbescheduledbeforex

.

Output dependencies:Two instructionsx

andx

areoutputdependent,ifx

andx

write their resultsto thesamelocation

(registeror memorycell). Thus,x

andx

mustbescheduledin differentcontrolsteps.

Additionally, incompatibilityconstraintsxlðñ x betweeninstructionpairs

Xxx¤haveto beobeyed.Theseconstraints

ariseeitherfrom processorresourcelimitations(e.g.only onemultiplier available)or from theinstructionformat,whichmay

25

preventtheparallelschedulingof instructionsevenwithouta resourceconflict. In eithercase,ifx ðñ x , then

x and

x must

bescheduledin differentcontrolsteps.

Thecodecompactionproblemhasalreadybeenstudiedin theearlyeightieswithin thecontext of verylong instruction

word (VLIW) processors,showing a largedegreeof parallelismat the instructionlevel. A numberof differentcompaction

heuristicshavebeendevelopedfor VLIW machines[73]. However, eventhoughPDSPsresembleVLIW machinestoacertain

extent,VLIW compactiontechniquesarenot directly applicableto PDSPs.Thereasonis that instruction-level parallelism

(ILP) is typically muchmoreconstrainedin PDSPsthanin VLIWs, becauseusingvery long instructionwordsfor PDSPs

would leadto extremelyhighcodesizes.Furthermore,PDSPinstructionsetsfrequentlyshow alternativeopcodesto perform

a certainmachineinstruction.

As anexample,considertheTI TMS320C25instructionset.This PDSPoffersinstructionsADD andMPY to perform

additionandmultiplication.However, thereis alsoamultiply-accumulateinstructionMPYA, whichperformsbothoperations

in parallelandthusfaster. InstructionMPYA maybeconsideredasanalternativeopcodebothfor ADD andMPY, but its use

is stronglycontext dependent.Only if anadditionanda multiplicationcanbescheduledin parallelfor a givendependency

graph,MPYA may be used. Otherwise,usingMPYA insteadof eitherADD or MPY could leadto an incorrectprogram

behavior aftercompaction,becauseMPYA overwritestwo registers(PRandACCU), thuspotentiallycausingundesiredside

effects.

In addition,coderunningonf PDSPsin mostcaseshasto meetreal-timeconstraints,which cannotbe guaranteedby

heuristics.Dueto thesespecialcircumstances,DSP-specificcodecompactiontechniqueshavebeendeveloped.In Timmer’s

approach[74], bothresourceandtimingconstraintsareconsideredduringcodecompaction.A bipartitegraphisusedtomodel

possibleassignmentsof instructionsto control steps.In importantfeatureof Timmer’s techniqueis that timing constraints

areexploitedin orderto quickly find exactsolutionsfor compactionprobleminstances.Themobilityof aninstructionis the

interval of controlsteps,to whichaninstructionmaybeassigned.Trivial boundsonmobility canbeachievedby performing

an ASAP/ALAP analysison thedependency graph,which accountsfor the earliestandthe latestcontrol stepin which an

instructionmaybescheduledwithoutviolatingdependencies.An additionalexecutionintervalanalysis, basedonbothtiming

andresourceconstraintsis performedto furtherrestrictthemobility of instructions.Theremainingmobility on theaverage

is low, anda schedulemeetingall constraintscanbedeterminedquickly by abranch-and-boundsearch.

AnotherDSP-specificcodecompactiontechniquewaspresentedin [75], whichalsoexploits theexistenceof alternative

instructionopcodes.The codecompactionproblemis transformedinto an Integer Linear Programmingproblem. In this

formulation,a set of integer solutionvariablesaccountfor the detailedschedulingof instructions,while all precedences

andconstraintsaremodeledaslinear equationsandinequationson the solutionvariables.The Integer Linear Programis

thensolvedoptimally usinga standardsolver, suchas”lp solve” [76]. SinceIntegerLinearProgrammingis anexponential

problem,theapplicabilityof this techniqueis restrictedto smallto moderatesizebasicblocks,whichhoweveris sufficient in

mostpracticalcases.

In orderto illustratetheimpactof codecompactiononcodequalityaswell asits cooperationwith othercodegeneration

phases,weuseasmallC programfor complex numbermultiplicationasanexample.

int ar,ai,br,bi,cr,ci;

cr = ar * br - ai * bi ;

ci = ar * bi + ai * br ;

For theTI TMS320C25,thesequentialassemblycode,asgeneratedby techniquesmentionedin section3.3,would be

thefollowing.

LT ar // TR = ar

MPY br // PR = TR * br

26

PAC // ACCU = PR

LT ai // TR = ai

MPY bi // PR = TR * bi

SPAC // ACCU = ACCU - PR

SACL cr // cr = ACCU

LT ar // TR = ar

MPY bi // PR = TR * bi

PAC // ACCU = PR

LT ai // TR = ai

MPY br // PR = TR * br

APAC // ACCU = ACCU + PR

SACL ci // ci = ACCU

Thissequentialcodeshowsthefollowing (symbolic)variableaccesssequence:

« ìò 3í ò ìF 3í F î ò ìò 3í F ìF Lí ò î F

Suppose,oneaddressregisterAR is availablefor computingthememoryaddressesaccordingto«

. Then,thememorylayout

optimizationmentionedin section3.4.2would computethefollowing addressmappingof thevariablesto theaddressspace

\ £ .0 ci

1 br

2 ai

3 bi

4 cr

5 ar

We can now insert the correspondingAGU operationsinto the sequentialcodeand invoke codecompaction. The

resultingparallelassemblycodemakesuseof parallelismbothwithin thedatapathitself andwith respectto parallelAGU

operations(auto-incrementanddecrement).

LARK 5 // load AR with &ar

LT * // TR = ar

SBRK 4 // AR -= 4 (&br)

MPY *+ // PR = TR * br, AR++ (&ai)

LTP *+ // TR = ai, ACCU = PR, AR++ (&bi)

MPY *+ // PR = TR * bi, AR++ (&cr)

SPAC // ACCU = ACCU - PR

SACL *+ // cr = ACCU, AR++ (&ar)

LT * // TR = ar

SBRK 2 // AR -= 2

MPY *- // PR = TR * bi, AR-- (&ai)

LTP *- // TR = ai, ACCU = PR, AR-- (&br)

MPY *- // PR = TR * br, AR-- (&ci)

APAC // ACCU = ACCU + PR

SACL * // ci = ACCU

Eventhoughaddresscomputationsfor thevariableshave beeninserted,theresultingcodeis only oneinstructionlarger

thanthe original symbolicsequentialcode. This is achievedby a high utilization of zero-costaddresscomputations(only

27

two extraSBRKinstructions)aswell asparallelLTP instructions,whichperformtwo datamovesin parallel.Thiswouldnot

havebeenpossiblewithoutmemorylayoutoptimizationandcodecompaction.

3.6 Phasecoupling

Eventhoughcodecompactionis a powerful codeoptimizationtechnique,only thedirectcouplingof sequentialandparallel

codegenerationphasescanyield globally optimal results.Phase-coupledtechniquesfrequentlyhave to resortto heuristics

due to extremely large searchspaces.However, heuristicsfor phase-coupledcodegenerationstill may outperformexact

techniquessolvingonly partsof thecodegenerationproblem.In this sectionwe thereforesummarizeimportantapproaches

to phase-coupledcodegenerationfor PDSPs.

Early work [77, 78] combinedinstructionschedulingwith a dataroutingphase.In any stepof scheduling,datarouting

performsdetailedregisterallocationbasedon resourceavailability in accordancewith a partialscheduleconstructedsofar.

In this way, the schedulingfreedom(mobility) of instructionscannotbe not obstructedby unfavorableregisterallocation

decisionsmadeearlier during codegeneration. However, significanteffort hasto be spentfor avoidanceof scheduling

deadlocks, whichrestricttheapplicabilityof suchtechniquesto simplePDSParchitectures.

Wilson’sapproachto phasecoupledcodegeneration[79] is alsobasedon IntegerLinearProgramming.In his formula-

tion, thecompletesearchspace,includingregisterallocation,codeselection,andcodecompactionis exploredatonce.While

this approachpermitsthegenerationof provableoptimalcodefor basicblocks,thehigh problemcomplexity alsoimposes

heavy restrictionsonapplicabilityfor realisticprogramsandPDSPs.

An alternativeIntegerLinearProgrammingformulationhasbeengivenin [80]. By bettertakinginto accountthedetailed

processorarchitecture,optimalcodecouldbegeneratedfor smallsizeexamplesfor theTI TMS320C25DSP.

A morepracticalphasecouplingtechniqueis MutationScheduling[81]. During instructionscheduling,a setof muta-

tions is maintainedfor eachprogramvalue. Eachmutationrepresentsanalternative implementationof valuecomputation.

For instance,mutationsfor a commonsubexpressionin aDFGmayincludestoringtheCSEin somespecialpurposeregister

or recomputingit multiple times. For othervalues,mutationsaregeneratedby applicationof algebraicruleslike commuta-

tivity or associativity. In eachschedulingstep,thebestmutationfor eachvalueto bescheduledis chosen.While Mutation

Schedulingrepresentsan”ideal” approachto phasecoupling,its efficacy critically dependsontheschedulingalgorithmused

aswell ason thenumberof mutationsconsideredfor eachvalue.

A constraintdrivenapproachto phase-coupledcodegenerationfor PDSPsis presentedin [82]. In thatapproach,alterna-

tiveswith respectto codeselection,registerallocation,andschedulingareretainedaslongaspossibleduringcodegeneration.

Restrictionsimposedby theprocessorarchitectureareexplicitly modeledin theform of constraints,whichensurecorrectness

of thegeneratedcode.Theimplementationmakesuseof aconstraint logic programmingenvironment.For severalexamples

it hasbeendemonstratedthatthequalityof thegeneratedcodeis equalto thatof hand-writtenassemblycode.

3.7 Retargetablecompilation

As systemsbasedon PDSPsmostlyhave to bevery cost-efficient,a comparatively largenumberof differentstandard(”off-

the-shelf”) PDSPsareavailableon thesemiconductormarket at thesametime. Fromthis variety, a PDSPusermayselect

thatprocessorarchitecturewhichmatcheshis requirementsatminimumcosts.In spiteof thelargevarietyof standardDSPs,

however, it is still unlikely thata customerwill find a processorideally matchingonegivenapplication.In particular, using

standardprocessorsin theform of cores(layoutmacrocells) for systems-on-a-chipmayleadto a wasteof silicon area.For

mobileapplications,alsotheelectricalpowerconsumedby a standardprocessormaybetoohigh.

As aconsequence,thereis a trendtowardstheuseof anew classof PDSPs,calledapplicationspecificsignalprocessors

(ASSPs).Thearchitectureof suchASSPsis still programmable,but is customizedfor restrictedapplicationareas.A well-

known exampleis theEPICSarchitecture[83]. A numberof furtherASSPsarementionedin [52].

28

Theincreasinguseof ASSPsfor implementingembeddedDSPsystemsleadsto anevenlargervarietyof PDSPs.While

thecodeoptimizationtechniquesmentionedin theprevioussectionshelpto improve thepracticalapplicabilityof compilers

for DSPsoftwaredevelopment,they do not answerthe question:Who will write compilersfor all thesedifferentPDSP

architectures? Developinga compilerfor eachnew ASSP, possiblyhaving a low productionvolumeandproductlifetime, is

noteconomicallyfeasible.Still, theuseof compilersfor ASSPsinsteadof assemblyprogrammingis still highly desirable.

Therefore,researchershave looked at technologyfor developingretargetablecompilers. Suchcompilersarenot re-

strictedto generatingcodefor a singletargetprocessor, but aresufficiently flexible to bereusedfor awholeclassof PDSPs.

Morespecifically, wecall a compilerretargetable,if adaptingthecompilerto a new targetprocessordoesnot involverewrit-

ing a largepartof thecompilersourcecode.This canbeachievedby usingexternalprocessormodels. While in a classical,

target-specificcompilertheprocessormodelis hard-codedin thecompilersourcecode,a retargetablecompilercanreadan

externalprocessormodelasanadditionalinput specifiedby theuserandgeneratecodefor thetargetprocessorspecifiedby

themodel.

3.7.1 The RECORD compiler system

An exampleof a retargetablecompiler for PDSPsis the RECORDsystem[84], a coarseoverview of which is given in

fig. 25. In RECORD,processormodelsaregivenin thehardwaredescriptionlanguage(HDL) MIMOLA, which resembles

structuralVHDL. A MIMOLA processormodelcapturestheregistertransferlevel structureof aPDSPs,includingcontroller,

datapath,andaddressgenerationunits. Alternatively, the pure instructionsetcanbe described,while hiding the internal

structure.UsingHDL modelsis anaturalwayof describingprocessorhardware,with a largeamountof modelingflexibility .

Furthermore,theuseof HDL modelsreducesthenumberof differentprocessormodelsrequiredduringthedesignprocess,

sinceHDL modelscanbeusedalsofor hardwaresynthesisandsimulation.

Sequentialcodegenerationin RECORDis basedon the dataflow tree(DFT) modelexplainedin section3.3.1. The

sourceprogram,givenin theprogramminglanguageDFL, is first transformedinto anintermediaterepresentation,consisting

of DFTs. The codegeneratoris automaticallygeneratedfrom the HDL processormodelby meansof the iburg tool [59].

Sinceiburg requiresa treegrammarmodelof thetargetinstructionset,somepreprocessingof theHDL modelis necessary.

RECORDusesan instruction set extraction phaseto transformthe structuralHDL model into an internal model of the

machineinstructionset.This internalmodelcapturesthebehavior of availablemachineinstructionsaswell astheconstraints

on instruction-level parallelism.

During sequentialcodegeneration,the codegeneratorgeneratedby meansof iburg is usedto mapDFTs into target

specificmachinecode.Whilemapping,RECORDexploitsalgebraicruleslikecommutativity andassociativity of operatorsto

increasecodequality. Theresultingsequentialassemblycodeis furtheroptimizedby meansof memoryaccessoptimization

(section3.4) and codecompaction(section3.5). An experimentalevaluationfor the TI TMS320C25DSP showed, that

thanksto theseoptimizationsRECORDon theaveragegeneratessignificantlydensercodethana commercialtargetspecific

compiler, however at the expenseof lower compilationspeed. Furthermore,RECORDis easily retargetableto different

processorarchitectures.If a HDL modelis available,thengenerationof processorspecificcompilercomponentstypically

takes lessthanoneworkstationCPU minute. This short turnaroundtime permitsto usea retargetablecompileralso for

quickly exploringdifferentarchitecturaloptionsfor anASSP, e.g.,with respectto thenumberof functionalunits,registerfile

sizes,or interconnectstructure.

3.7.2 Further retargetablecompilers

A widespreadexamplefor a retargetablecompileris theGNU compiler”gcc” [85]. Sincegcchasbeenmainly designedfor

CISCandRISCprocessorarchitectures,it is basedon theassumptionof regularprocessorarchitecturesandthusis hardly

applicableto PDSPs.

29

TheMSSQcompiler[86] hasbeenanearlyapproachto retargetablecompilationbasedonHDL models,howeverwithout

specificoptimizationsfor PDSPs.

In theCodeSyncompiler[57], specificallydesignedfor ASSPs,thetargetprocessoris heterogeneouslydescribedby the

setof availableinstructionpatterns,a graphmodelrepresentingthedatapath,anda resourceclassificationthataccountsfor

specialpurposeregisters.

TheCHESScompiler[87] usesaspecificlanguagecallednML for describingtargetprocessorarchitectures.It generates

codefor a specificASSParchitecturalstyleandthereforeemploysspecialcodegenerationandoptimizationtechniques[88].

ThenML languagehasalsobeenusedin a retargetablecompilerprojectatCadence[89].

Severalcodeoptimizationsmentionedin this paper[61, 62, 60, 63] have beenimplementedin theSPAM compilerat

PrincetonUniversityandMIT. AlthoughSPAM canbeclassifiedasaretargetablecompiler, it is morebasedonexchangeable

softwaremodulesperformingspecificoptimizationinsteadof anexternaltargetprocessormodel.

Anotherapproachto retargetablecodegenerationfor PDSPsis theAVIV compiler[90], which usesa speciallanguage

(ISDL [91]) for modelingVLIW-likeprocessorarchitectures.

As compilersfor standardDSPsandASSPsbecomemoreimportantandretargetablecompilertechnologygetsmore

mature,severalcompanieshavestartedto sellcommercialretargetablecompilerswith specialemphasisonPDSPs.Examples

aretheCoSycompilerdevelopmentsystemby ACE, thecommercialversionof theCHESScompiler, aswell asArchelon’s

retargetablecompilersystem.Detailedinformationabouttheserecentsoftwareproductsis availableon theWorld WideWeb

[92, 93, 94].

4 Conclusions

Thispaperhasreviewedthatstateof theart in front- andback-enddesignautomationtechnologyfor DSPsoftwareimplemen-

tation. We have motivateda designflow thatbeginswith a high-level, hierarchicalblock diagramspecification;synthesizes

a C-languageapplicationprogramor subsystemfrom this specification;andthencompilesthe C programinto optimized

machinecodefor the given targetprocessor. We have reviewedseveralusefulcomputationalmodelsthat provide efficient

semanticsfor theblock diagramspecificationsat the front endof this designflow, We thenexaminedthevastspaceof im-

plementationtrade-offs oneencounterswhensynthesizingsoftwarefrom thesecomputationalmodels,in particularfrom the

closely-relatedsynchronousdataflow (SDF)andscalablesynchronousdataflow (SSDF)models,whichcanbeviewedaskey

“commondenominators”of theothermodels.Subsequently, we examineda varietyof usefulsoftwaresynthesistechniques

thataddressimportantsubsetsof andprioritizationsof relevantoptimizationmetrics.

Complementaryto softwaresynthesisissues,we have outlinedthestate-of-the-artin compilationof efficient machine

codefrom applicationsourceprograms. Taking the stepfrom assembly-level to C-level programmingof DSPsdemands

for specialcodegenerationtechniquesbeyondthescopeof classicalcompilertechnology. In particular, this concernscode

generation,memoryaccessoptimization,andexploitation of instruction-level parallelism. Recently, also the problemof

tightly couplingthesedifferentcompilationphasesin orderto generatedvery efficient codehasgainedsignificantresearch

interest.In addition,wehavemotivatedtheuseof retargetablecompilers,whichareimportantfor programmingapplication-

specificDSPs.

In ouroverview, wehavehighlightedusefuldirectionsfor furtherstudy. A particularlyinterestingandpromisingdirec-

tion,whichremainslargelyunexplored,is theinvestigationof theinteractionbetweensoftwaresynthesisandcodegeneration

– that is, thedevelopmentof synthesistechniquesthatexplicitly aid thecodegenerationprocess,andcodegenerationtech-

niquesthatincorporatehigh-level applicationstructurethatis exposedduringsynthesis.

30

AGU

register fileaddress

(8 x 16)

data RAM(256 x 16)

ALU

ACCU

shifter

MUX

shifter

multiplier

TR

PR

(4096 x 16)program ROM

controller

exchangebus

ARP

program bus

data bus

16 16

16

16

32

32

3

16

Figure1: Simplifiedarchitectureof TexasInstrumentsTMS320C25DSP

31

Figure2: Thetop-level blockdiagramspecificationof adiscretewavelettransformapplicationimplementedin Ptolemy[7].

32

BA

C

D

J LG

I K

H N

P QF

ME O

111

1

1 1 1 1

1 1

1

1

1 1 1 1

1 1 1 1

2

2

2

2

2

2

2

1

11 2

1

1

1 1

1

Figure3: An illustrationof anexplicit SDFspecification.

B

A

C3

2

3

12

1D

3D

1

Figure4: A deadlockedSDFgraph.

(1,1,1) (1,0,0) 3 13

(a)

3

(b)

Figure5: CSDFandSDFversionsof adownsamplerblock.

33

K

+ Ψ

+

Ψ

IN xn

OUT ynG

111

1

1

1

1

1

1

1

1

(1,0)

(0,1)

(0,1)

(1,0)

A

C

E

B

D

F

= k²yn-1ny +kxn+xn-1= k(ky+x )+xn-1nn

CEG2FDBA1valid schedule: A

1D

1D

Figure6: An examplethat illustratesthecompactmodelingof resourcesharingusingCSDF. Theactorlabeledó denotesa

dataflow fork, whichsimply replicatesits input tokensonall of its outputedges.Thelowerportionof thefiguregivesa valid

schedulefor thisCSDFspecification.Here,G

andG

denotethefirst andsecondphasesof theCSDFactorG

.

34

delay-free SDF cycle=> deadlock

K

+ INΨ

+

OUT

Ψ

111

1

1

1

1

1

1

1

1

1

1

1

1

1D

1D

Figure7: TheSDFversionof thespecificationin fig. 6.

35

A B

D C D C

Ωô Ωôschedule: CD

ΩôΩô

Ω

Ωô

D C

(a)

(c)

(b)

1 2

1D

1

2

: B

: A

1

1

(1,0)

11

1 1

1 11

111

1 1

11

1 1

(0,1)

Figure8: An examplethatillustratestheutility of cyclo-staticdataflow in constructinghierarchicalspecifications.Grouping

theactorsG

and5

into thehierarchicalSDFactor ã , asshown in (b), resultsin a deadlockedSDFgraph. In contrast,an

appropriateCSDFmodelof thehierarchicalgrouping,illustratedin (c), avoidsdeadlock.Thetwo phasesof thehierarchical

CSDFactorã[a in (c)arespecifiedin thelowerrightcornerof thefigurealongwith avalidschedulefor theCSDFspecification.

36

M1

M2

source distribute

source distribute

M1

M2

2N1

1 (N,N)

1

1

(b)

(a)

A

D

C1

1

N

N

B

(N,0)

(0,N)

Figure9: An exampleof theuseof CSDFto decreasebufferingrequirements.

37

OUT3 2(1,0)(1,1)(1,1,1)(1,0,0)

IN

A B C

FIR11 11

"deadsubgraphs"

IN A B C OUT

A B C

A B C OUT

IN A B C

A B C OUT

A B C

1 1 1 11

2 2 2

3 3 3

42

2

4 4

5

6

5 5

6 6

3

Figure10: An exampleof efficientdeadcodeeliminationusingCSDF.

imageexpander

(512x512) (1024x1024)

Figure11: An exampleof anMDSDFactor.

X Y Z10 1 1

5D

1

Figure12: A simpleexamplethatweuseto illustratetrade-offs involvedin compilingSDFspecifications.

38

A B

C

2

1

2 5

52

10D

Figure13: An examplethatweuseto illustratethe õ7ö÷0ø metric.

111

B C4D

7D

1 1 1 1

1

A

Figure14: Thisexampleillustratesthatminimizingactoractivationsdoesnot imply minimizingactorappearances.

39

101

A B Cù

D2 1 2 1 2 1

1224ú

2 1

12D 16D 12D

A Ωû

12 1

2 1

12D

B 2 1ü 24ú

16D

Ωû

2 C D2 1ü 12ü

12D

(b) (c) (d)

(a)

Figure15: An illustrationof a completehierarchization.

40

source program

lexical analysissyntax analysis

semantical analysis

source code analyses

representation intermediate

machine-independentIR optimizations

representation intermediate

optimized

sequential code generation

code generation

memory access optimizationcode compaction

assembly program

Figure16: Compilationphases

int a,b,c,d,x,y,z;

void f()

x = a + b;

y = a + b - c * d;

z = c * d;

Figure17: ExampleC sourcecode

41

load a load b load dload c

*

store x store y store z

+

-

Figure18: DFG representationof codefromfig. 17


*


+

- SUBMUL

LOAD LOAD

STORESTORESTORE

ADD

Figure19: DFG fromfig. 18coveredby instructionpatterns


*


+

- SUB

LOAD LOAD

STORESTORESTORE

MAC

Figure20: UsingMAC for DFG covering


*

store y store z

+

store x

write CSE write CSE

read CSE read CSE

read CSEread CSE

-

Figure21: Decompositionof a DFG into DFTs

42

effectiveaddress

modify registerfile

addressregisterfile

+/-

"1"

AR pointer

AGU

immediate value

MR pointer

Figure22: Addressgenerationunit

LOAD AR, 1AR += 2AR -= 3AR += 2AR ++AR -= 3AR += 2AR --AR --AR += 3AR -= 3AR += 2AR ++

LOAD AR, 3AR --AR --AR --LOAD MR, 2AR += MRAR --AR --AR += 3AR -= MRAR ++AR --AR --AR += MR

bdacdacbadacd

LOAD AR, 3AR --AR --AR --AR += 2AR --AR --AR += 3AR -= 2AR ++AR --AR --AR += 2

bdacdacbadacd

bdac

acbadacd

dabcd

cadb

cadb

0123

0123

0123

a) b) c)

cost: 9 cost: 5 cost: 3

Figure23: AlternativememorylayoutsandAGU operationsequences

a

c

b

d

4

3

2

1

1

1

a

c

b

d

4

3 1

access graph maximum weighted path

Figure24: Accessgraphmodelandmaximumweightedpath

43

DFL source program

processor modelMIMOLA HDL

mapping to DFTsinstruction set

extraction

generation with iburgcode generator

sequential codegeneration

memory access optimizationcode compaction

assembly codesequential

parallel assembly code

Figure25: Coarsearchitectureof theRECORDsystem

44

References

[1] TheDesignandImplementationof SignalProcessingSystemsTechnicalCommittee.VLSI designandimplementation

fuelsthesignalprocessingrevolution. IEEESignalProcessingMagazine, 15(1):22–37,January1998.

[2] P. Lapsley, J.Bier, A. Shoham,andE. A. Lee.DSPProcessorFundamentals. Berkeley DesignTechnology, Inc.,1994.

[3] E. A. Lee. ProgrammableDSParchitectures— Part I. IEEEASSPMagazine, 5(4),October1988.

[4] E. A. Lee. ProgrammableDSParchitectures— Part II. IEEEASSPMagazine, 6(1),January1988.

[5] P. MarwedelandG. Goossens,editors. CodeGeneration for EmbeddedProcessors. Kluwer AcademicPublishers,

1995.

[6] V. Zivojnovic, H. Schraut,M. Willems, and H. Meyr. DSPs,GPPs,and multimediaapplications— an evaulation

usingDSPstone.In Proceedingsof the InternationalConferenceon SignalProcessingApplicationsandTechnology,

November1995.

[7] J.T. Buck, S.Ha,E. A. Lee,andD. G. Messerschmitt.Ptolemy:A framework for simulatingandprototypinghetero-

geneoussystems.InternationalJournalof ComputerSimulation, January1994.

[8] P. P. Vaidyanathan.MultirateSystemsandFilter Banks. PrenticeHall, 1993.

[9] E. A. Lee andD. G. Messerschmitt.Synchronousdataflow. Proceedingsof the IEEE, 75(9):1235–1245, September

1987.

[10] E. A. Lee. Consistency in dataflow graphs.IEEETransactionsonParallel andDistributedSystems, 2(2),April 1991.

[11] S. S. Bhattacharyya,P. K. Murthy, and E. A. Lee. Software Synthesisfrom Dataflow Graphs. Kluwer Academic

Publishers,1996.

[12] S.Ritz,M. Willems,andH. Meyr. Schedulingfor optimumdatamemorycompactionin blockdiagramorientedsoftware

synthesis.In Proceedingsof theInternationalConferenceonAcoustics,Speech,andSignalProcessing, May 1995.

[13] E. A. LeeandD. G. Messerschmitt.Staticschedulingof synchronousdataflow programsfor digital signalprocessing.

IEEETransactionsonComputers, February1987.

[14] E. A. Lee, W. H. Ho, E. Goei, J. Bier, and S. S. Bhattacharyya.Gabriel: A designenvironmentfor DSP. IEEE

TransactionsonAcoustics,Speech,andSignalProcessing, 37(11),November1989.

[15] D. R. O’Hallaron. TheASSIGNparallelprogramgenerator. Technicalreport,Schoolof ComputerScience,Carnegie

Mellon University, May 1991.

[16] G. Bilsen,M. Engels,R. Lauwereins,andJ.A. Peperstraete.Cyclo-staticdataflow. In Proceedingsof theInternational

ConferenceonAcoustics,Speech,andSignalProcessing, pages3255–3258,May 1995.

[17] G. Bilsen, M. Engels,R. Lauwereins,and J. A. Peperstraete.Cyclo-staticdataflow. IEEE Transactionson Signal

Processing, 44(2):397–408,February1996.

[18] G. DeMicheli. SynthesisandOptimizationof Digital Circuits. McGraw-Hill, 1994.

[19] T. M. Parks,J.L. Pino,andE. A. Lee. A comparisonof synchronousandcyclo-staticdataflow. In Proceedingsof the

IEEEAsilomarConferenceonSignals,Systems,andComputers, November1995.

45

[20] S.Ritz, M. Pankert,andH. Meyr. Optimumvectorizationof scalablesynchronousdataflow graphs.In Proceedingsof

theInternationalConferenceonApplicationSpecificArrayProcessors, October1993.

[21] S. Ritz, M. Pankert, andH. Meyr. High level softwaresynthesisfor signalprocessingsystems.In Proceedingsof the

InternationalConferenceonApplicationSpecificArrayProcessors, August1992.

[22] E. A. Lee. Representingandexploiting dataparallelismusingmultidimensionaldataflow diagrams.In Proceedingsof

theInternationalConferenceonAcoustics,Speech,andSignalProcessing, pages453–456,April 1993.

[23] P. K. Murthy andE. A. Lee. An extensionof multidimensionalsynchronousdataflow to handlearbitrarysampling

lattices. In Proceedingsof the InternationalConferenceon Acoustics,Speech, and SignalProcessing, pages3306–

3309,May 1996.

[24] G. R. Gao,R. Govindarajan,andP. Panangaden.Well-behavedprogramsfor DSPcomputation.In Proceedingsof the

InternationalConferenceonAcoustics,Speech, andSignalProcessing, March1992.

[25] J. T. Buck andE. A. Lee. Schedulingdynamicdataflow graphsusingthe token flow model. In Proceedingsof the

InternationalConferenceonAcoustics,Speech, andSignalProcessing, April 1993.

[26] J. T. Buck. SchedulingDynamicDataflowGraphswith BoundedMemoryusingthe Token Flow Model. PhD thesis,

Departmentof ElectricalEngineeringandComputerSciences,Universityof CaliforniaatBerkeley, September1993.

[27] J.T. Buck. Staticschedulingandcodegenerationfrom dynamicdataflow graphswith integer-valuedcontrolsystems.

In Proceedingsof theIEEEAsilomarConferenceonSignals,Systems,andComputers, October1994.

[28] S.S.Bhattacharyya,P. K. Murthy, andE.A. Lee.Optimalparenthesizationof lexicalorderingsfor DSPblockdiagrams.

In Proceedingsof the InternationalWorkshopon VLSI SignalProcessing. IEEE press,October1995. Sakai,Osaka,

Japan.

[29] M. Ade, R. Lauwereins,andJ. A. Peperstraete.Buffer memoryrequirementsin DSPapplications.In Proceedingsof

theIEEEWorkshoponRapidSystemPrototyping, pages198–123,June1994.

[30] M. Ade,R. Lauwereins,andJ.A.Peperstraete.Datamemoryminimisationfor synchronousdataflow graphsemulated

onDSP-FPGAtargets.In Proceedingsof theDesignAutomationConference, pages64–69,June1994.

[31] M. CubricandP. Panangaden.Minimal memoryschedulesfor dataflow networks. In CONCUR’93, August1993.

[32] R. Govindarajan,G. R. Gao,andP. Desai.Minimizing memoryrequirementsin rate-optimalschedules.In Proceedings

of theInternationalConferenceonApplicationSpecificArray Processors, August1994.

[33] S.How. Codegenerationfor multirateDSPsystemsin gabriel. Master’s thesis,Departmentof ElectricalEngineering

andComputerSciences,Universityof CaliforniaatBerkeley, May 1990.

[34] S.S.Bhattacharyya,P. K. Murthy, andE. A. Lee. Synthesisof embeddedsoftwarefrom synchronousdataflow specifi-

cations.Journalof VLSISignalProcessingSystems, 21(2):151–166,June1999.

[35] S.S.Bhattacharyya,J.T. Buck,S.Ha, andE. A. Lee. A schedulingframework for minimizing memoryrequirements

of multirateDSPsystemsrepresentedasdataflow graphs.In Proceedingsof theInternationalWorkshoponVLSISignal

Processing, October1993.Veldhoven,TheNetherlands.

[36] S.S.Bhattacharyya,J.T. Buck,S.Ha,andE.A. Lee.Generatingcompactcodefrom dataflow specificationsof multirate

signalprocessingalgorithms.IEEE Transactionson CircuitsandSystems– I: FundamentalTheoryandApplications,

42(3):138–150,March1995.

46

[37] S.S.Bhattacharyya,P. K. Murthy, andE. A. Lee. APGAN andRPMC:Complementaryheuristicsfor translatingDSP

blockdiagramsinto efficientsoftwareimplementations.Journalof DesignAutomationfor EmbeddedSystems, January

1997.

[38] P. K. Murthy, S. S. Bhattacharyya,and E. A. Lee. Joint minimizationof codeand datafor synchronousdataflow

programs.Journalof FormalMethodsin SystemDesign, 11(1):41–70,July1997.

[39] J.L. Pino,S.S.Bhattacharyya,andE. A. Lee. A hierarchicalmultiprocessorschedulingsystemfor DSPapplications.

In Proceedingsof theIEEEAsilomarConferenceonSignals,Systems,andComputers, November1995.

[40] P. K. Murthy andS.S. Bhattacharyya.Sharedmemoryimplementationsof synchronousdataflow specificationsusing

lifetime analysistechniques.TechnicalReportUMIACS-TR-99-32,Institutefor AdvancedComputerStudies,Univer-

sity of MarylandatCollegePark,June1999.

[41] P. K. Murthy andS.S.Bhattacharyya.A buffer merging techniquefor reducingmemoryrequirementsof synchronous

dataflow specifications.In Proceedingsof theInternationalSymposiumonSystemsSynthesis, 1999.SanJose,Califor-

nia, to appear.

[42] E. Zitzler, J. Teich,andS. S. Bhattacharyya.Optimizedsoftwaresynthesisfor DSPusingrandomizationtechniques.

Technicalreport,ComputerEngineeringandCommunicationNetworksLaboratory, SwissFederalInstituteof Technol-

ogy, Zurich,July1999.Revisedversionof teic1998x1.

[43] J. Teich,E. Zitzler, andS. S. Bhattacharyya.Optimizedsoftwaresynthesisfor digital signalprocessingalgorithms–

anevolutionaryapproach.In Proceedingsof theIEEEWorkshoponSignalProcessingSystems, October1998.Boston,

Massachusetts.

[44] E. Zitzler, J. Teich,andS. S. Bhattacharyya.Evolutionaryalgorithmsfor thesynthesisof embeddedsoftware. IEEE

TransactionsonVeryLargeScaleIntegration(VLSI)Systems, 1999.Acceptedfor publication;to appear.

[45] T. Back,U. Hammel,andH-PSchwefel.Evolutionarycomputation:Commentson thehistoryandcurrentstate.IEEE

TransactionsonEvolutionaryComputation, 1(1):3–17,1997.

[46] V. Zivojnovic, S. Ritz, andH. Meyr. Multirate retiming: A powerful tool for hardware/softwarecodesign.Technical

report,AachenUniversityof Technology, 1993.

[47] V. Zivojnovic, S. Ritz, andH. Meyr. Retimingof DSPprogramsfor optimumvectorization. In Proceedingsof the

InternationalConferenceonAcoustics,Speech, andSignalProcessing, April 1994.

[48] W. Sung,J. Kim, andS. Ha. Memory efficient synthesisfrom dataflow graphs. In Proceedingsof the International

SymposiumonSystemsSynthesis, 1998.

[49] E. Zitzler, J.Teich,andS.S.Bhattacharyya.Multidimensionalexplorationof softwareimplementationsfor DSPalgo-

rithms. Journalof VLSISignalProcessingSystems, 1999.Acceptedfor publication;to appear.

[50] MentorGraphicsCorporation.DSPArchitectDFL User’sandReferenceManual,V 8.2 6. 1993.

[51] M. Levy. C compilersfor DSPsflex theirmuscles.EDNAccess, issue12,June1997.http://www.ednmag.com

[52] P. Paulin,M. Cornero,C. Liem, et al. Trendsin EmbeddedSystemsTechnology. In: M.G. Sami,G. DeMicheli (eds.):

Hardware/SoftwareCodesign, Kluwer AcademicPublishers,1996.

[53] K.M. Bischoff. OxUser’sManual. TechnicalReport#92-31.IowaStateUniversity, 1992.

47

[54] A.V. Aho, R. Sethi,J.D.Ullman. Compilers - Principles,Techniques,andTools. Addison-Wesley, 1986.

[55] G.J.Chaitin. RegisterAllocation andSpilling via GraphColoring. ACM SIGPLANSymp.on CompilerConstruction,

1982,pp.98-105.

[56] A.V. Aho, M. Ganapathi,S.W.K Tjiang. CodeGenerationUsing TreeMatchingandDynamicProgramming.ACM

Trans.onProgrammingLanguagesandSystems, vol. 11,no.4, 1989,pp.491-516.

[57] C. Liem, T. May, P. Paulin. Instruction-SetMatchingandSelectionfor DSPandASIP CodeGeneration.European

DesignandTestConference(ED & TC), 1994,pp.31-37.

[58] B. Wess.AutomaticInstructionCodeGenerationbasedon Trellis Diagrams.IEEEInt. Symp.on CircuitsandSystems

(ISCAS), 1992,pp.645-648.

[59] C.W. Fraser, D.R. Hanson,T.A. Proebsting.Engineeringa Simple,Efficient CodeGeneratorGenerator. ACM Letters

onProgrammingLanguagesandSystemsvol. 1, no.3, 1992,pp.213-226.

[60] G. Araujo,S.Malik. OptimalCodeGenerationfor EmbeddedMemoryNon-HomogeneousRegisterArchitectures.8th

Int. Symp.onSystemSynthesis(ISSS), 1995,pp.36-41.

[61] S.Liao, S.Devadas,K. Keutzer, S.Tjiang,A. Wang.CodeOptimizationTechniquesfor EmbeddedDSPMicroproces-

sors.32ndDesignAutomationConference(DAC), 1995,pp.599-604.

[62] S. Liao, S. Devadas,K. Keutzer, S. Tjiang. InstructionSelectionUsingBinateCoveringfor CodeSizeOptimization.

Int. Conf. onComputer-AidedDesign(ICCAD), 1995,pp.393-399.

[63] G. Araujo, S. Malik, M. Lee. UsingRegisterTransferPathsin CodeGenerationfor HeterogeneousMemory-Register

Architectures.33rd DesignAutomationConference(DAC), 1996

[64] D.J.Kolson,A. Nicolau,N. Dutt,K. Kennedy. OptimalRegisterAssignmentfor Loopsfor EmbeddedCodeGeneration.

8th Int. Symp.onSystemSynthesis(ISSS), 1995.

[65] A. Sudarsanam,S. Malik. Memory Bank and RegisterAllocation in SoftwareSynthesisfor ASIPs. Int. Conf. on

Computer-AidedDesign(ICCAD), 1995,pp.388-392.

[66] D.H. Bartley. OptimizingStackFrameAccessesfor Processorswith RestrictedAddressingModes.Software– Practice

andExperience, vol. 22(2),1992,pp.101-110.

[67] S. Liao, S. Devadas,K. Keutzer, S. Tjiang, A. Wang. StorageAssignmentto DecreaseCodeSize. ACM SIGPLAN

ConferenceonProgrammingLanguageDesignandImplementation(PLDI), 1995.

[68] R. Leupers,P. Marwedel.Algorithmsfor AddressAssignmentin DSPCodeGeneration.Int. Conf. onComputer-Aided

Design(ICCAD), 1996.

[69] B. Wess,M. Gotschlich.OptimalDSPMemoryLayoutGenerationasa QuadraticAssignmentProblem.Int. Symp.on

CircuitsandSystems(ISCAS), 1997.

[70] A. Sudarsanam,S. Liao, S. Devadas. Analysis andEvaluationof AddressArithmetic Capabilitiesin CustomDSP

Architectures.DesignAutomationConference(DAC), 1997.

[71] R. Leupers,F. David. A Uniform OptimizationTechniquefor OffsetAssignmentProblems.11thInt. Symp.on System

Synthesis(ISSS), 1998.

48

[72] C. Liem, P.Paulin, A. Jerraya. AddressCalculationfor RetargetableCompilationandExplorationof Instruction-Set

Architectures.33rd DesignAutomationConference(DAC), 1996.

[73] S. Davidson,D. Landskov, B.D. Shriver, P.W. Mallett. SomeExperimentsin Local MicrocodeCompactionfor Hori-

zontalMachines.IEEETrans.onComputers, vol. 30,no.7, 1981,pp.460-477.

[74] A. Timmer, M. Strik, J.vanMeerbergen,J.Jess.ConflictModellingandInstructionSchedulingin CodeGenerationfor

In-HouseDSPCores.32ndDesignAutomationConference(DAC), 1995,pp.593-598.

[75] R. Leupers,P. Marwedel.Time-ConstrainedCodeCompactionfor DSPs.IEEETrans.on VLSISystems, Vol. 5, No. 1,

1997.

[76] M. Berkelaar. EindhovenUniversityof Technology. availableat ftp.es.ele.tue.nl/pub/lpsolve/

[77] K. Rimey, P.N. Hilfinger. Lazy DataRoutingandGreedySchedulingfor Application-SpecificSignalProcessors.21st

AnnualWorkshoponMicroprogrammingandMicroarchitecture(MICRO-21), 1988,pp.111-115.

[78] R. Hartmann. CombinedSchedulingandDataRoutingfor ProgrammableASIC Systems.EuropeanConferenceon

DesignAutomation(EDAC), 1992,pp.486-490.

[79] T. Wilson,G. Grewal, B. Halley, D. Banerji. An IntegratedApproachto RetargetableCodeGeneration.7th Int. Symp.

onHigh-LevelSynthesis(HLSS), 1994,pp.70-75.

[80] C.H. Gebotys.An Efficient Model for DSPCodeGeneration:Performance,CodeSize,EstimatedEnergy. 10th Int.

Symp.onSystemSynthesis(ISSS), 1997.

[81] S.Novack,A. Nicolau,N. Dutt. A UnifiedCodeGenerationApproachusingMutationScheduling.Chapter12 in [5].

[82] S.Bashford,R. Leupers.ConstraintDrivenCodeSelectionfor Fixed-PointDSPs.36thDesignAutomationConference

(DAC), 1999.

[83] R. Woudsma.EPICS:A Flexible Approachto EmbeddedDSPCores.Int. Conf. onSignalProcessingApplicationsand

Technology(ICSPAT), 1994.

[84] R. Leupers.RetargetableCodeGenerationfor Digital SignalProcessors.Kluwer AcademicPublishers,ISBN 0-7923-

9958-7,1997.

[85] R.M. Stallmann.UsingandPortingGNU CC V2.4. FreeSoftwareFoundation,Cambridge/Massachusetts,1993.

[86] L. Nowak. GraphbasedRetargetableMicrocodeCompilationin theMIMOLA DesignSystem.20thAnn.Workshopon

Microprogramming(MICRO-20), 1987,pp.126-132.

[87] D. Lanneer, J.VanPraet,A. Kifli, K. Schoofs,W. Geurts,F. Thoen,G.Goossens.CHESS:RetargetableCodeGeneration

for EmbeddedDSPProcessors.chapter5 in [5].

[88] J.VanPraet,D. Lanneer, G. Goossens,W. Geurts,H. DeMan. A GraphBasedProcessorModel for RetargetableCode

Generation.EuropeanDesignandTestConference(ED & TC), 1996.

[89] M.R. Hartoog,J.A. Rowson,P.D. Reddy, et al. Generationof SoftwareTools from ProcessorDescriptionsfor Hard-

ware/SoftwareCodesign.34thDesignAutomationConference(DAC), 1997.

[90] S. Hanono,S. Devadas. InstructionSelection,ResourceAllocation, andSchedulingin the AVIV retargetablecode

generator. 35thDesignAutomationConference(DAC), 1998.

49

[91] G. Hadjiyiannis,S. Hanono,S. Devadas. ISDL: An Instruction-SetDescriptionLanguagefor Retargetability. 34th

DesignAutomationConference(DAC), 1997.

[92] ACEAssociatedCompilerExperts.http://www.ace.nl

[93] TargetCompilerTechnologies.http://www.retarget.com

[94] ArchelonInc. http://www.archelon.com

50

Biographical sketchesof the authors

Shuvra S.Bhattacharyya

ShuvraS.BhattacharyyareceivedthePh.D.degreein ElectricalEngineeringandComputerSciencesfrom theUniversity

of Californiaat Berkeley in 1994. SinceJuly, 1997,hehasbeenanAssistantProfessorin theDepartmentof Electricaland

ComputerEngineeringat theUniversityof Marylandat CollegePark. He holdsa joint appointmentwith theUniversityof

MarylandInstitutefor AdvancedComputerStudies(UMIACS).

Dr. Bhattacharyya’s researchinterestscenteraroundcomputer-aideddesignfor embeddedsystems,with emphasison

synthesisandoptimizationof hardwareandsoftwarefor digital signal/image/videoprocessing(DSP)applications.

From1991to 1992,hewasat KuckandAssociates,Inc. in Champaign,Illinois, wherehewasinvolvedin theresearch

anddevelopmentof programtransformationsfor performanceimprovementin C andFortrancompilers.From1994to 1997,

hewasaResearcherat theSemiconductorResearchLaboratoryof HitachiAmerica,Ltd., in SanJose,California.At Hitachi,

hewasinvolvedin researchonsoftwareoptimizationtechniquesfor embeddedDSPapplications.

Dr. Bhattacharyyais a recipientof the NSF CAREER award (1997), and is co-authorof Software Synthesisfrom

DataflowGraphs(Kluwer AcademicPublishers,1996),andEmbeddedMultiprocessors: Schedulingand Synchronization

(Marcel-Dekker, to bepublishedin 2000).

RainerLeupers

RainerLeupersreceivedhis DiplomaandPh.D.degreesin ComputerSciencewith distinctionfrom the Universityof

Dortmund,Germany, in 1992and1997, respectively. He received the HansUhdeAward andthe bestdissertationaward

from theUniversityof Dortmundfor outstandingtheses.Since1993,hehasbeenworking asa researcherat theComputer

ScienceDepartmentat Dortmund,whereheis currentlyheadingtheDSPcompilergroup. Dr. Leupersis theauthorof the

bookRetargetableCodeGeneration for Digital SignalProcessors, publishedby Kluwer Academicpublishersin 1997.His

researchinterestsincludedesignautomationandcompilersfor embeddedsystems.

PeterMarwedel

PeterMarwedelreceivedhis Ph.D.in Physicsfrom theUniversityof Kiel (Germany) in 1974.He workedat theCom-

puterScienceDepartmentof that University from 1974until 1989. In 1987,he received the Dr. habil. degree(a degree

requiredfor becominga professor)for his work on high-level synthesisandretargetablecodegenerationbasedon thehard-

waredescriptionlanguageMIMOLA. Since1989heis a professorat theComputerScienceDepartmentof theUniversityof

Dortmund(Germany). HeservedastheDeanof thatDepartmentbetween1992and1995.Currently, heis thepresidentof the

technologytransferinstituteICD, locatedat Dortmund. His researchareasincludehardware/softwarecodesign,high-level

testgeneration,high-level synthesisandcodegenerationfor embeddedprocessors.He is oneof theeditorsof thebookCode

Generation for EmbeddedProcessors publishedby Kluwer Academicpublishersin 1995.Dr. Marwedelis a memberof the

IEEEComputersociety, theACM, andtheGesellschaftfur Informatik (GI).

51

Documents

Software Synthesis and Code Generation for Signal