Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
SoftwareSynthesisand CodeGeneration
for SignalProcessingSystems
ShuvraS.Bhattacharyya
Universityof MarylandDepartmentof ElectricalandComputerEngineering
CollegePark,MD 20742,USA
RainerLeupers,PeterMarwedel
Universityof DortmundDepartmentof ComputerScience12
44221Dortmund,Germany
ABSTRACT
Therole of softwareis becomingincreasinglyimportantin the implementationof DSPapplications.As this trendin-
tensifies,andthe complexity of applicationsescalates,we areseeingan increasedneedfor automatedtools to aid in the
developmentof DSPsoftware. This paperreviews the stateof the art in programminglanguageandcompilertechnology
for DSP software implementation. In particular, we review techniquesfor high level, block-diagram-basedmodelingof
DSPapplications;thetranslationof block diagramspecificationsinto efficient C programsusingglobal,target-independent
optimizationtechniques;andthecompilationof C programsinto streamlinedmachinecodefor programmableDSPproces-
sors,usingarchitecture-specificandretargetableback-endoptimizations.In our review, we alsopoint out someimportant
directionsfor furtherinvestigation.
1 Intr oduction
Althoughdedicatedhardwarecanprovide significantspeedandpower consumptionadvantagesfor signalprocessingappli-
cations[1], extensive programmabilityis becominganincreasinglydesirablefeatureof implementationplatformsfor VLSI
signalprocessing.Thetrendtowardsprogrammableplatformsis fueledby tight time-to-marketwindows,whichin turnresult
from intensecompetitionamongDSPproductvendors,andfrom the rapid evolution of technology, which shrinksthe life
cycle of consumerproducts.As a resultof shorttime-to-market windows, designersareoften forcedto begin architecture
designandsystemimplementationbeforethe specificationof a productis fully completed.For example,a portablecom-
municationproductis often designedbeforethe signaltransmissionstandardsunderwhich it will operatearefinalized,or
beforethefull rangeof standardsthatwill besupportedby theproductis agreedupon.In suchanenvironment,latechanges
in thedesigncycle aremandatory. Theneedto quickly make suchlatechangesrequirestheuseof software. Furthermore,
whetheror not theproductspecificationis fixedbeforehand,software-basedimplementationsusingoff-the-shelfprocessors
takesignificantlylessverificationeffort comparedto customhardwaresolutions.
Althoughtheflexibility offeredbysoftwareis critical in DSPapplications,theimplementationof productionqualityDSP
softwareis anextremelycomplex task.Thecomplexity arisesfrom thediversityof critical constraintsthatmustbesatisfied;
typically theseconstraintsinvolve stringentrequirementson metricssuchaslatency, throughput,power consumption,code
size,anddatastoragerequirements.Additionalconstraintsincludetheneedto ensurekey implementationpropertiessuchasTechnicalreportUMIACS-TR-99-57,Institutefor AdvancedComputerStudies,Universityof Maryland,CollegePark,20742,September, 1999.S.S.
Bhattacharyyawassupportedin thiswork by theUSNationalScienceFoundation(CAREER,MIP9734275)andNorthropGrummanCorp.R. Leupersand
P. Marwedelweresupportedby HPEESof,California.
1
boundedmemoryrequirementsanddeadlock-freeoperation.As a result,unlike developersof softwarefor general-purpose
platforms,DSPsoftwaredevelopersroutinelyengagein meticuloustuningandsimulationof programcodeat theassembly
languagelevel.
Importantindustry-widetrendsat both the programminglanguagelevel andthe processorarchitecturelevel have had
a significantimpacton thecomplexity of DSPsoftwaredevelopment.At the architecturallevel, a specializedclassof mi-
croprocessorshasevolved that is streamlinedto the needsof DSP applications. TheseDSP-orientedprocessors,called
programmabledigital signalprocessors(PDSPs),employ avarietyof special-purposearchitecturalfeaturesthatsupportcom-
mon DSPoperationssuchasdigital filtering, andfastFourier transforms[2, 3, 4]. At the sametime, they often exclude
featuresof generalpurposeprocessors,suchasextensivememorymanagementsupport,thatarenot importantfor many DSP
applications.
Due to variousarchitecturalirregularitiesin PDSPs,which are requiredfor their exceptionalcost/performanceand
power/performancetrade-offs [2], compilertechniquesfor general-purposeprocessorshave provento beinadequatefor ex-
ploiting thepowerof PDSParchitecturesfrom high level languages[5]. As aresult,thecodequalityof high-level procedural
language(suchas C) compilersfor PDSPshasbeenseveral hundredsof percentworsethan manually-writtenassembly
languagecode [6, 52]. This situationhasnecessitatedthewidespreaduseof assembly-languagecoding,andtediousperfor-
mancetuning,in DSPsoftwaredevelopment.However, in recentyears,a significantresearchcommunityhasevolvedthatis
centeredaroundthedevelopmentof compilertechnologyfor PDSPs.Thiscommunityhasbegunto narrow thegapbetween
compiler-generatedcodeandmanuallyoptimizedcode.
It is expectedthatinnovativeprocessor-specificcompilationtechniquesfor PDSPswill provideasignificantproductivity
boostin DSPsoftwaredevelopment,sincesuchtechniqueswill us allow to take the stepfrom assemblyprogrammingof
PDSPsto the useof high-level programminglanguages.The key approachto reducethe overheadof compiler-generated
codeis the developmentof DSP-specificcompileroptimizationtechniques.While classicalcompiler technologyis often
basedontheassumptionof aregularprocessorarchitecture,DSP-specifictechniquesaredesignedto becapableof exploiting
the specialarchitecturalfeaturesof PDSPs. Theseinclude specialpurposeregistersin the datapath, dedicatedmemory
addressgenerationunits,anda moderatedegreeof instruction-level parallelism.
To illustratethis, considerthearchitectureof a popularfixed-pointDSP(TI TMS320C25)in fig. 1. Its datapathcom-
prisestheregistersTR, PR,andACCU,eachof which playsa specificrole in communicatingvaluesbetweenthefunctional
units of the processor. This structureallows for a very efficient implementationof DSP algorithms(e.g. filtering algo-
rithms).Moreregulararchitectures(e.g.with general-purposeregisters)would,for instance,requiremoreinstructionbits for
addressingtheregistersandmorepower for readingandwriting theregisterfile.
Froma compilerviewpoint, themappingof operations,programvariables,andintermediateresultsto thedatapathin
fig. 1 mustbedonein suchaway, thattheamountof datatransferinstructionsbetweentheregistersis minimized.Theaddress
generationunit (AGU) comprisesaspecialALU andis capableof performingaddressarithmeticin parallelto thecentraldata
path. In particular, it providesparallelauto-incrementinstructionsfor addressregisters.As we will show later, exploitation
of this featurein a compilerdemandsfor an appropriatememorylayout of programvariables.Besidesthe AGU, alsothe
datapathoffersa certaindegreeof instruction-level parallelism.For instance,loadinga memoryvalueinto registerTR and
accumulatingaproductstoredin PRcanbeperformedin parallelwithin asinglemachineinstruction.Sincesuchparallelism
cannotbeexplicitly describedin programminglanguageslikeC, compilersneedto carefullyschedulethegeneratedmachine
instructions,soasto exploit thepotentialparallelismandto generatefastanddensecode.
Furtherarchitecturalfeaturesfrequentlypresentin PDSPsincludeparallelmemorybanks(providing highermemory
accessbandwidth),chainedoperations(suchasmultiply-accumulate),specialarithmeticoperations(suchasadditionwith
saturation),andmoderegisters(for switchingbetweendifferentarithmeticmodes).
For mostof the architecturalfeaturesmentionedabove, dedicatedcodeoptimizationtechniqueshave beendeveloped
recently, anoverview of whichwill begivenin section3. Many of theseoptimizationsarecomputationallycomplex, resulting
2
in a comparatively low compilationspeed.This is intensifiedby the fact that compilersfor PDSPs,besidesthe needfor
specificoptimizationtechniques,have to dealwith the phasecouplingproblem. The compilationprocessis traditionally
divided into the phasesof codeselection,registerallocation,and instructionscheduling,which have to be executedin a
certainorder. For all possiblephaseorders,theapproachof separatecompilationphasesresultsin a codequality overhead,
sinceeachphasemay imposeobstructingconstraintson subsequentphases,which would not have beennecessaryfrom a
global viewpoint. While for regular processorarchitectureslike RISCsthis overheadis moderateandthustolerable,it is
typically much higher for irregular processorarchitecturesas found in PDSPs. Therefore,it is desirableto performthe
compilationphasesin a coupledfashion,wherethedifferentphasesmutuallyexchangeinformationsoasto achievea global
optimum.
Even thoughphase-coupledcompilertechniquesleadto a further increasein compilationtime, it is widely agreedin
theDSPsoftwaredevelopercommunitythathighcompilationspeedis of muchlower concernthanhighcodequality. Thus,
compilationtimesof minutesor evenhoursmaybeperfectlyacceptablein many cases.This factgivesgoodopportunities
for novel computation-intensive approachesto compiling high level languagesfor PDSPs,which however would not be
acceptablein general-purposecomputing.
Besidespurecodeoptimizationissues,the largevarietyof PDSPs(bothstandard”off-the-shelf” processorsandappli-
cationspecificprocessors)currentlyin usecreatea problemof economicfeasibility of compilerconstruction.Sincecode
optimizationtechniquesfor PDSPsarehighly architecture-specificby nature,a hugeamountof differentoptimizationtech-
niqueswererequiredto build efficientcompilersfor all PDSPsavailableon themarket. Therefore,in thispaperwewill also
briefly discusstechniquesfor retargetablecompilation. Retargetablecompilersarecapableof generatingcodenotonly for a
singletargetprocessorbut for a classof processors,therebyreducingthenumberof compilersrequired.This is achievedby
providing thecompilerwith a descriptionof themachinefor which codeis to begenerated,insteadof hard-codingthema-
chinedescriptionin thecompiler. We will mentiondifferentapproachesof processormodelingfor retargetablecompilation.
Retargetabilitypermitsto quickly generatecompilersfor new processors.If theprocessordescriptionformalismis flexible
enough,thenretargetablecompilersmayalsoassistin customizinganonly partially predefinedprocessorarchitecturefor a
givenapplication.
At the systemspecificationlevel, the pastseveral yearshave seenincreaseduseof block-diagrambased,graphical
programmingenvironmentsfor digital signalprocessing.Suchgraphicalprogrammingenvironments,which enableDSP
systemsto bespecifiedashierarchiesof block diagrams,offer several importantadvantages.Perhapsthe mostobviousof
theseadvantagesis their intuitiveappeal.Althoughvisualprogramminglanguageshaveseenlimited usein many application
domains,DSP systemdesignersare usedto thinking of systemsin termsof graphicalabstractions,suchas signal flow
diagrams,and thus, block diagramspecificationvia a graphicaluserinterfaceis a convenientand naturalprogramming
interfacefor DSPdesigntools.
An illustrationof ablockdiagramDSPsystem,developedusingthePtolemydesignenvironment[7], is shown in fig. 2.
This is animplementationof a discretewavelettransform[8] application.Thetop partof thefigureshows thehighestlevel
of theblock diagramspecificationhierarchy. Many of theblocksin thespecificationarehierarchical, which meansthatthe
internalfunctionalityof theblocksarealsospecifiedasblockdiagrams(“nested”blockdiagrams).Blocksat thelowestlevel
of thespecificationhierarchy, suchastheindividualFIR filters,arespecifiedin ameta-Clanguage(C augmentedwith special
constructsfor specifyingblockparametersandinterfaceinformation).
In additionto offering intuitive appeal,the specificationof systemsin termsof connectionsbetweenpre-defined,en-
capsulatedfunctionalblocksnaturallypromotesdesirablesoftwareengineeringpracticessuchasmodularityandcodereuse.
As thecomplexity of applicationsincreasescontinuallywhile time-to-marketpressuresremainintense,reuseof designeffort
acrossmultipleproductsis becomingmoreandmorecrucialto meetingdevelopmentschedules.
In additionto theirsyntacticandsoftwareengineeringappeal,thereareanumberof moretechnicaladvantagesof graph-
ical DSPtools. Theseadvantageshingeon theuseof appropriatemodelsof computationto provide thepreciseunderlying
3
block diagramsemantics.In particular, the useof dataflowmodelsof computationcanenablethe applicationof powerful
verificationandsynthesistechniques.Broadlyspeaking,dataflow modelinginvolvesrepresentinganapplicationasadirected
graphin whichthegraphverticesrepresentcomputationsandedgesrepresentlogicalcommunicationchannelsbetweencom-
putations.Dataflow-basedgraphicalspecificationformatsareusedwidely in commercialDSPdesigntoolssuchasCOSSAP
by Synopsys,theSignalProcessingWorksystemby Cadence,andtheAdvancedDesignSystemby Hewlett-Packard.These
threecommercialtools all employ the synchronousdataflowmodel [9], the most popularvariantof dataflow in existing
DSPdesigntools. Synchronousdataflow specificationallows boundedmemorydeterminationanddeadlockdetectionto be
performedcomprehensively andefficiently at compiletime. In contrast,both of theseverificationproblemsarein general
impossibleto solve (in finite time) for generalpurposeprogramminglanguagessuchasC.
Potentiallythe mostusefulbenefitof dataflow-basedgraphicalprogrammingenvironmentsfor DSPis that carefully-
specifiedgraphicalprogramscanexposecoarse-grainstructureof theunderlyingalgorithm,andthisstructurecanbeexploited
to improve thequality of synthesizedimplementationsin a wide varietyof ways.For example,theprocessof scheduling—
determiningtheorderin which thecomputationsin anapplicationwill execute— typically hasa largeimpacton all of the
key implementationmetricsof aDSPsystem.A dataflow-basedsystemspecificationexposeshigh-levelschedulingflexibility
thatis oftennotpossibleto deducemanuallyor automaticallyfrom anassemblylanguageor high-level procedurallanguage
specification.This schedulingflexibility canbeexploitedby a synthesistool to streamlinean implementationbasedon the
givensetof performanceandcostconstraints.We will elaborateondataflow-basedschedulingin sections2.1.2and2.2.
Althoughgraphicaldataflow-basedprogrammingtools for DSPhave becomeincreasinglypopularin recentyears,the
useof thesetoolsin industryis largely limited to simulationandprototyping.Thequalityof today’sgraphicalprogramming
tools is not sufficient to consistentlydeliver production-qualityimplementations.As with procedurallanguagecompilation
technologyfor PDSPs,synthesisfrom dataflow-basedgraphicalspecificationsofferssignificantpromisefor thefuture,andis
animportantchallengeconfrontingtheDSPdesignandimplementationresearchcommunitytoday. Furthermore,thesetwo
formsof compilertechnologyarefully complementaryto oneanother:themixtureof dataflow andC (or any otherprocedural
language),asdescribedin theexampleof fig. 2, is anespeciallyattractive specificationformat. In this format,coarse-grain
“subprogram”interactionsarespecifiedin dataflow, while thefunctionalityof individualsubprogramsis specifiedin C. Thus,
dataflow synthesistechniquesoptimizethefinal implementationat theinter-subprogramlevel, while C compilertechnology
is requiredto performfine-grainedoptimizationwithin subprograms.
This papermotivatesthe problemof compilertechnologydevelopmentfor DSPsoftwareimplementation,providesa
tutorial overview of modelingandoptimizationissuesthat areinvolved in the compilationof DSPsoftware,andprovides
a review of techniquesthat have beendevelopedby variousresearchersto addresssomeof theseissues.The first part of
our overview focuseson coarse-grainsoftwaremodelingandoptimizationissuespertinentto the compilationof graphical
dataflow programs,andthesecondpart focuseson fine-grainedissuesthatarisein thecompilationof high level procedural
languagessuchasC.
Thesetwo levelsof compilertechnology(coarse-grainandfine grain)arecommonlyreferredto assoftware synthesis
andcodegeneration, respectively. Morespecifically, by softwaresynthesis,wemeantheautomatedderivationof a software
implementation(applicationprogram)in someprogramminglanguagegiven a library of subprogrammodules,a subset
of selectedmodulesfrom this library, anda specificationof how theseselectedmodulesinteractto implementthe target
application. Fig. 2 is an exampleof a programspecificationthat is suitablefor software synthesis. Here, synchronous
dataflow semanticsareusedto specifysubprograminteractions.In section2.2,weexploresoftwaresynthesisissuesfor DSP.
On theotherhand,codegenerationrefersto themappingof asoftwareimplementationin someprogramminglanguage
to anequivalentmachineprogramfor aspecificprogrammableprocessor. Thus,themappingof aC programonto thespecific
resourcesof thedatapathin fig. 1 is anexampleof codegeneration.We exploreDSPcodegenerationtechnologyin section
3.
4
2 Compilation of dataflow programsto application programs
2.1 Dataflow modelingof DSPsystems
To performsimulation,formal verification,or any kind of compilationfrom block-diagramDSPspecifications,a preciseset
of semanticsis neededthatdefinestheinteractionsbetweendifferentcomputationalblocksin aspecification.Dataflow-based
computationalmodelshave provento provideblock-diagramsemanticsthatarebothintuitive to DSPsystemdesigners,and
efficient from thepointof view of verificationandsynthesis.
In thedataflow paradigm,a computationalspecificationis representedasa directedgraph.Verticesin thegraph(called
actors) correspondto the computationalmodulesin the specification. In mostdataflow-basedDSPdesignenvironments,
actorscanbeof arbitrarycomplexity. Typically, they rangefrom elementaryoperationssuchasadditionor multiplicationto
DSPsubsystemssuchasFFTunitsor adaptivefilters.
An edge
in a dataflow graphrepresentsthecommunicationof datafrom
to
. More specifically, anedge
representsa FIFO(first-in-first-out)queuethatbuffersdatasamples(tokens)asthey passfrom theoutputof oneactorto the
input of another. If is a dataflow edge,we write
, and . Whendataflow graphsare
usedto representsignalprocessingapplications,a dataflow edge hasa non-negative integerdelay associatedwith
it. Thedelayof anedgegivesthenumberof initial datavaluesthatarequeuedon theedge.Eachunit of dataflow delayis
functionallyequivalentto the
operator:thesequenceof datavalues "!$# generatedat the input of theactor is
equalto thetheshiftedsequence% ! '&(*),+.-0/ # , where %1!1# is thedatasequencegeneratedat theoutputof theactor .2.1.1 Consistency
Underthedataflow model,anactorcanexecuteatany timethatit hassufficientdataonall inputedges.An attemptto execute
anactorwhenthis constraintis not satisfiedis saidto causebuffer underflowon all edgesthatdo not containsufficientdata.
For dataflow modelingto beusefulfor DSPsystems,theexecutionof actorsmustalsoaccommodateinput datasequences
of unboundedlength. This is becauseDSPapplicationsoften involve operationsthat areappliedrepeatedlyto samplesin
indefinitelylong inputsignals.For animplementationof adataflow specificationto bepractical,theexecutionof actorsmust
besuchthatthenumberof tokensqueuedoneachFIFObuffer (dataflow edge)mustremainboundedthroughouttheexecution
of thedataflow graph.In otherwords,thereshouldnotbeunboundeddataaccumulationonany edgein thedataflow graph.
In summary, executinga dataflow specificationof a DSPsysteminvolvestwo fundamental,processor-independentre-
quirements— avoiding buffer underflow andavoiding unboundeddataaccumulation(buffering). The dataflow modelim-
posesno furtherconstraintson thesequencein which computations(actors)areexecuted.On theotherhand,in procedural
languages,suchasC andFORTRAN, theorderingof statementsaswell astheuseof control-flow constructsimply sequenc-
ing constraintsbeyondthosethatarerequiredto satisfydatadependencies.By avoiding theoverspecificationof execution
ordering,dataflow specificationsprovide synthesistoolswith full flexibility to streamlinetheexecutionorderto matchthe
relevant implementationconstraintsandoptimizationobjectives. This featureof dataflow is of critical importancefor DSP
implementationsince,as we will seethroughoutthe restof this section,the executionorderhasa large impacton most
importantimplementationmetrics,suchasperformance,memoryrequirements,andpowerconsumption.
Theterm“consistency” refersto thetwo essentialrequirementsof DSPdataflow specifications— theabsenceof over-
flow and unboundeddataaccumulation.We say that a consistentdataflow specificationis one that can be implemented
withoutany chanceof buffer underflow or unboundeddataaccumulation(regardlessof theinputsequencesthatareappliedto
thesystem).If thereexist oneor moresetsof inputsequencesfor whichunderflow andunboundedbufferingareavoided,and
therealsoexist oneor moresetsfor which underflow or unboundedbuffering results,we saythata specificationis partially
consistent. A dataflow specificationthat is neitherconsistentnor partially consistentis calledan inconsistentspecification.
Moreelaborateformsof consistency basedona probabilisticinterpretationof tokenflow areexploredin [10].
5
Clearly, consistency is a highly desirablepropertyfor DSP software implementation. For most consistentdataflow
graphs,tight boundscanbe derived on the numbersof datavaluesthat coexist (datathat hasbeenproducedbut not yet
consumed)on theindividualedges(buffers).For suchgraphs,all buffer memoryallocationcanbeperformedstatically, and
thus, the overheadof dynamicmemoryallocationcanbe avoidedentirely. This is a valuablefeaturewhenattemptingto
deriveastreamlinedsoftwareimplementation.
2.1.2 Scheduling
A fundamentaltaskin synthesizingsoftwarefrom anSDFspecificationis thatof scheduling, which refersto theprocessof
determiningtheorderin which theactorswill beexecuted.Schedulingis eitherdynamicor static. In staticscheduling, the
actorexecutionorderis specifiedat synthesistime,andis fixed– in particular, theorderis not data-dependent.To beuseful
in handlingindefinitely long input datasequences,a staticschedulemustbe periodic. A periodic,staticschedulecanbe
implementedin a finite amountof programmemoryspaceby encapsulatingtheprogramcodefor oneperiodof theschedule
within aninfinite loop. Indeed,this is how suchschedulesaremostoftenimplementedin practice.
In dynamicscheduling, the sequenceof actor executions(schedule) is not specifiedduring synthesis,and run-time
decision-makingis requiredto ensurethat actorsareexecutedonly whentheir respective input edgeshave sufficient data.
Disadvantagesof dynamicschedulingincludetheoverhead(executiontimeandpowerconsumption)of performingschedul-
ing decisionsat run-time,anddecreasedpredictability, especiallyin determiningwhetheror not any relevantreal-timecon-
straintswill besatisfied.However, if thedataproduction/consumptionbehavior of individualactorsexhibitssignificantdata-
dependence,thendynamicschedulingmayberequiredto avoid buffer underflow andunboundeddataaccumulation.Further-
more,if theperformancecharacteristicsof actorsareimpossibleto estimateaccurately, theneffective dynamicscheduling
leadsto betterperformanceby adaptively streamliningthe scheduleevolution to matchthe dynamiccharacteristicsof the
actors.
For mostDSPapplications,includingthevastmajorityof applicationsthatareamenableto theSDFmodelmentionedin
section1,actorbehavior is highlypredictable.For suchapplications,giventhetight costandpowerconstraintsthataretypical
of embeddedDSPapplications,it is highly desirableto avoid dynamicschedulingoverheadasmuchaspossible.Theultimate
goalundersucha high level of predictabilityis a (periodic)staticschedule.If it is notpossibleto constructastaticschedule,
thenit is desirableto identify “maximal” subsystemsthat canbe scheduledstatically, andusea small amountof dynamic
decision-makingto coordinatetheexecutionof thesestatically-scheduledsubsystems.Schedulesthatareconstructedusing
suchahybrid,mostlystaticapproacharecalledquasi-staticschedules.
2.1.3 Synchronousdataflow
A dataflowcomputationmodelcanbe viewed asa subclassof dataflow graphspecifications.A wide variety of dataflow
computationalmodelscanbeconceiveddependingon restrictionsthatareimposedon themannerin which dataflow actors
consumeandproducedata.For example,synchronousdataflow(SDF), which is thesimplestandcurrentlythemostpopular
form of dataflow for DSP, imposestherestrictionthatthenumberof datavaluesproducedby anactorontoeachoutputedge
is constant,andsimilarly thenumberof datavaluesconsumedby anactorfrom eachinput edgeis constant.Thus,anSDF
edge hastwo additionalattributes— the numberof datavaluesproducedonto by eachinvocationof the sourceactor,
denoted213 , andthenumberof datavaluesconsumedfrom by eachinvocationof thesink actor, denoted4 .Theexampleshown in fig. 2 conformsto theSDFmodel.An SDFabstractionof a scaled-down andsimplifiedversion
of thissystemis shown in fig. 3. Hereeachedgeis annotatedwith thenumberof datavaluesproducedandconsumedby the
sourceandsinkactors,respectively. For example,21 657389 ;: , and 4 657389 =< .Therestrictionsimposedby theSDFmodeloffer a numberof importantadvantages.
> Simplicity. Intuitively, whencomparedto moregeneraltypesof dataflow actors,actorsthat produceandconsume
6
datain constant-sizedpacketsareeasierto understand,develop,interfaceto otheractors,andmaintain.This property
is difficult to quantify; however, the rapid andextensive adoptionof SDF in DSPdesigntools clearly indicatesthat
designerscaneasilylearnto think of functionalspecificationsin termsof theSDFmodel.
> Staticschedulingandmemoryallocation.For SDFgraphs,thereis no needto resortto dynamicscheduling,or even
quasi-staticscheduling. For a consistentSDF graph,underflow and unboundeddataaccumulationcan always be
avoidedwith a periodic,staticschedule.Moreover, tight boundson buffer occupancy canbecomputedefficiently. By
avoiding the run-timeoverheadsassociatedwith dynamicschedulinganddynamicmemoryallocation,efficient SDF
graphimplementationsoffer significantadvantageswhencost,power, or performanceconstraintsaresevere.
> Consistency verification. A dataflow modelof computationis a decidabledataflow modelif it canbedeterminedin
finite timewhetheror notanarbitraryspecificationin themodelis consistent.Wesaythatadataflow modelis abinary-
consistencymodelif every specificationin themodelis eitherconsistentor inconsistent.In otherwords,a modelis a
binary-consistency modelif it containsnopartiallyconsistentspecifications.All of thedecidabledataflow modelsthat
areusedin practicetodayarebinary-consistency models.
Binary consistency is convenientfrom a verificationpoint of view sinceconsistency becomesan inherentproperty
of a specification:whetheror not buffer underflow or unboundeddataaccumulationarisesis not dependenton the
input sequencesthat areapplied. Of course,suchconveniencecomesat the expenseof restrictedapplicability. A
binary-consistency modelcannotbeusedto specifyall applications.
TheSDFmodelis a binary-consistency model,andefficient verificationtechniquesexist for determiningwhetheror
notanSDFgraphis consistent.AlthoughSDFhaslimited expressivepowerin exchangefor thisverificationefficiency,
themodelhasprovento beof greatpracticalvalue.SDFencompassesabroadandimportantclassof signalprocessing
anddigital communicationsapplications,includingmodems,multiratefilter banks[8], andsatellitereceiver systems,
just to namea few [9, 11, 12].
For SDF graphs,the mechanicsof consistency verificationarecloselyrelatedto the mechanicsof scheduling.The
interrelatedproblemsof verifying andschedulingSDFgraphsarediscussedin detailbelow.
2.1.4 Static schedulingof SDFgraphs
Thefirst stepin constructinga staticschedulefor anSDFgraph ?@ BAC3DEis determiningthenumberof times F 6GH that
eachactorGJI;A
shouldbe invoked in oneperiodof the schedule.To ensurethat the scheduleperiodcanbe repeated
indefinitelywithoutunboundeddataaccumulation,theconstraint
F 21 KF LM 4 NO"P'Q4RQ4P3STQUMVQ ID (1)
mustbesatisfied.Thesystemof equations(1) is calledthesetof balanceequationsfor ? .
Clearly, a usefulperiodicschedulecanbeconstructedonly if thebalanceequationshave a positive integersolution FW( FW XGY[Z]\ for all
G^I_A). LeeandMesserschmitthaveshown thatfor a generalSDFgraph? , exactlyoneof thefollowing
conditionsholds[9]:
> Thezerovectoris theonly solutionto thebalanceequations,or
> Thereexistsa minimalpositive integersolution ` to thebalanceequations,andthusevery positive integersolution FbasatisfiesF*a 6GH[c ` 6GY for all
G. Thisminimalvector ` is calledtherepetitionsvectorof ? .
7
If theformerconditionholds,then ? is inconsistent.Otherwise,a boundedbuffer periodicschedulecanbeconstructed
providedthat it is possibleto constructa sequenceof actorexecutionssuchthatbuffer underflow is avoided,andeachactorGis executedexactly ` 6GY times. Given a consistentSDF graph,we refer to an executionsequencethat satisfiesthese
two propertiesasa valid scheduleperiod, or simply a valid schedule. Clearly, a boundedmemorystaticschedulecanbe
implementedin softwareby encapsulatingtheimplementationof any valid schedulewithin aninfinite loop.
A linear-time ( d efAgefhiejDke ) algorithmto determinewhetheror not a repetitionsvectorexists,andto computea
repetitionsvectorwheneveronedoesexist canbefoundin [11].
For example,considertheSDFgraphshown in fig. 3. Therepetitionsvectorcomponentsfor thisgrapharegivenby
` 6GH =` 65l K` 6mn K` XoE p` X89 q` Xr7 q` XDE q` 6st q` Xu^ q` 6vt K` d <
` 6wn q` ? =` 6x q` byz =` 6| q` 6~ : (2)
If arepetitionsvectorexistsfor anSDFgraph,but avalid scheduledoesnotexist, thenthegraphis saidto bedeadlocked.
Thus,anSDFgraphis consistentif andonly if arepetitionsvectorexists,andthegraphis notdeadlocked.In general,whether
or notagraphis deadlockeddependsontheedgedelays e ID # aswell theproductionandconsumptionparameters
# and "M # . An exampleof a deadlockedSDFgraphis givenin fig. 4. An annotationof theform D next to an
edgein thefigurerepresentsadelayof units.Notethattherepetitionsvectorfor thisgraphis givenby
` 6GY K ` 65 q< ` X89 :" (3)
Oncea repetitionsvector ` hasbeencomputed,deadlockdetectionand the constructionof a valid schedulecanbe
performedconcurrently. Prematureterminationof the schedulingprocedure— terminationbeforeeachactorG
hasbeen
fully scheduled(scheduledXGY
times)— indicatesdeadlock. Onesimpleapproachis to scheduleactor invocationsone
at a time andsimulatethebuffer activity in the dataflow graphaccordinglyuntil all actorsarefully scheduled.The buffer
simulationis necessaryto ensurethat buffer overflow is avoided. A pseudocodespecificationof this simpleapproachcan
be found in [11]. Lee andMesserschmittshow that this approachterminatesprematurelyif andonly if the input graphis
deadlocked,andotherwise,regardlessof the specificorderin which actorsareselectedfor scheduling,a valid scheduleis
alwaysconstructed[13].
In summary, SDF is currentlythemostwidely-useddataflow modelin commercialandresearch-orientedDSPdesign
tools. Commercialtools that employ SDF semanticsinclude Simulink by The Math Works, SPW by Cadence,and HP
Ptolemyby Hewlett Packard.SDF-basedresearchtools includeGabriel[14] andseveralkey domainsin Ptolemy[7], from
from U.C. Berkeley; andASSIGNfrom Carnegie Mellon [15]. TheSDF modeloffersefficient verificationof consistency
for arbitraryspecifications,andefficient constructionof staticschedulesfor all consistentspecifications.Our discussion
above outlineda simple,systematictechniquefor constructinga staticschedulewhenever oneexists. In practice,however,
it is preferableto employ moreintricateschedulingstrategiesthat take carefulaccountof thecosts(performance,memory
consumption,etc.) of thegeneratedschedules.In section2.2,we will discusstechniquesfor streamlinedschedulingof SDF
graphsbasedon theconstraintsandoptimizationobjectivesof thetargetedimplementation.In theremainderof this section,
wediscussa numberof usefulextensionsto theSDFmodel.
2.1.5 Cyclo-staticdataflow
Cyclo-staticdataflow (CSDF)andscalablesynchronousdataflow (describedin section2.1.6)arepresentlythemostwidely-
usedextensionsof SDF. In CSDF, thenumberof tokensproducedandconsumedby anactoris allowedto vary aslong the
8
variationtakes the form of a fixed, periodicpattern[16, 17]. More precisely, eachactorA in a CSDFgraphhasassoci-
atedwith it a fundamentalperiod XGYI : < 4# , which specifiesthe numberof phasesin oneminimal periodof the
cyclic production/consumptionpatternofG
. For eachinput edge toG
, the scalarSDF attribute is replacedby a
XGY -tuple8-3 38
-3 4 38 -3 "+/ , whereeach
8-3 is a nonnegativeintegerthatgivesthenumberof datavaluesconsumed
from byG
in the F th phaseof eachperiodofG
. Similarly, for eachoutputedge , 21 is replacedby a 6GH -tuplem-3m
-3 4 m -3 "+/ , whichgivesthenumbersof datavaluesproducedin successivephasesof
G.
A simpleexampleof a CSDFactor is illustratedin fig. 5(a). This actor is a conventionaldownsampleractor (with
downsamplingfactor3) from multiratesignalprocessing.Functionally, adownsampler, performsthefunction FB1q% v F: 1h : , wherefor ^: < 4 , and %' denotethe datavaluesproducedandconsumed,respectively. Thus,for every
inputvaluethatis copiedto theoutput,v : input valuesarediscarded.As shown in fig. 5(b) for | , this functionality
canbespecifiedby a CSDFactorthathasv
phases.A datavalueis consumedon the input for allv
phases,resultingin
thev
-componentconsumptiontuple : : 4 : ; however, a datavalueis producedonto theoutputedgeonly on thefirst
phase,resultingin theproductiontuple : \\ 4 \ .
Like SDF, CSDFis a binaryconsistency model,andit is possibleto performefficient verificationof boundedmemory
requirementsandbuffer underflow avoidancefor CSDFgraphs[17]. Furthermore,staticschedulescanalwaysbeconstructed
for consistentCSDFgraphs.
A CSDFactorG
caneasilybeconvertedinto anSDFactorG a suchthat if identicalsequencesof input datavaluesare
appliedtoG
andG a , thenidenticaloutputdatasequencesresult.Sucha functionallyequivalentSDFactor
G a canbederived
by having eachinvocationofG a implementonefundamentalCSDFperiodof
G(thatis, 6GH successivephasesof
G). Thus,
for eachinputedge a ofG a , theSDFparametersof a aregivenby
> "4 a ,> 21 a +./
m -3 , and
> a "+/. 8 -3 ,
where is the correspondinginput edgeto the CSDF actorG
. Applying this conversionto the downsamplerexample
discussedabove givesan“SDF equivalent” downsamplerthatconsumesa block ofv
input datavalueson eachinvocation,
andproducesa singledatavalue,which is a copy of the first valuein the input block. The SDFequivalentfor fig. 5(a) is
illustratedin fig. 5(b).
Sinceany CSDFactorcanbe convertedto a functionally equivalentSDF actor, it follows that CSDFdoesnot offer
increasedexpressivepowerat thelevel of individualactorfunctionality(input-outputmappings).However, theCSDFmodel
canoffer increasedflexibility in compactlyandefficiently representinginteractionsbetweenactors.
As anexampleof increasedflexibility in expressingactorinteractions,considertheCSDFspecificationillustratedin fig.
6. Thisspecificationrepresentsa recursivedigital filter computationof theform
! !
h % ! h %:" (4)
In fig. 6, thetwo-phaseCSDFactorlabeledG
representsa scaling(multiplication)by theconstantfactor . In eachof
its two phases,actorG
consumesa datavaluefrom oneof its input edges,multipliesthedatavalueby , andproducesthe
resultingvalueontooneof its outputedges.TheCSDFspecificationof fig. 6 thusexploits our ability to compute(4) using
theequivalentformulation
"! M "! h %$! zh %$: (5)
which requiresonly additionblocksand -scalingblocks. Furthermore,the two -scalingoperationscontainedin ( 5) are
consolidatedinto a singleCSDFactor(actorG
).
9
Suchconsolidationof distinct operationsfrom differentdatastreamsoffers two advantages.First, it leadsto more
compactrepresentationssincefewer verticesarerequiredin the CSDFgraph. For large or complex applications,this can
result in more intuitive representations,andcanreducethe time requiredto performvariousanalysisandsynthesistasks.
Second,it allowsaprecisemodelingof resourcesharingdecisions— pre-specifiedbindingsof multipleoperationsin aDSP
applicationonto individual hardwareresources(suchasfunctionalunits) or softwareresources(suchassubprograms)—
within theframework of dataflow. Suchpre-specifiedbindingsmayarisefrom constraintsimposedby thedesigner, andfrom
decisionstakenduringsynthesisor designspaceexploration.
Theability to compactlyandpreciselymodelthesharingof actorsin CSDFstemsfrom theability to selectively “turn
off ” datadependenciesfrom arbitrarysubsetsof inputedgesin any givenphaseof anactor. In contrast,anSDFactorrequires
at leastonedatavalueon eachinput edgebeforeit canbeinvoked. In thepresenceof feedbackloops,this requirementmay
precludea sharedrepresentationof anactorin SDF, even thoughit maybepossibleto achieve thedesiredsharingusinga
functionallyequivalentCSDFactor. This is illustratedin fig. 7, which is derivedfrom the CSDFspecificationof fig. 6 by
replacingthe “shared”CSDFactorwith its functionally equivalentSDF counterpart.Sincethe graphof fig. 7 containsa
delay-freecycle, clearly we canconcludethat the graphis deadlocked,andthusa valid scheduledoesnot exist. In other
words,this is an inconsistentdataflow specification.In contrast,it is easilyverifiedthat thescheduleG w9r5EG 8TD ? is a
valid schedulefor theCSDFspecificationof fig. 6, whereGH
andG
denotethefirst andsecondphasesof theCSDFactorG, respectively.
Similarly, anSDFmodelof a hierarchical actor mayintroducedeadlockin a systemspecification,andsuchdeadlock
canoftenbeavoidedby replacingthehierarchicalSDFactorwith a functionallyequivalenthierarchicalCSDFactor. Here,
by ahierarchicalSDFactorwemeananactorwhoseinternalfunctionalityis specifiedby anSDFgraph.Theutility of CSDF
in constructinghierarchicalspecificationsis illustratedin fig. 8.
CSDFalsooffersdecreasedbuffering requirementsfor someapplications.An illustrationis shown in fig. 9. Fig. 9(a)
depictsasystemin whichv
-elementblocksof dataarealternatelydistributedfrom thedatasourceto two processingmodulesu and
u . Theactorthatperformsthedistribution is modeledasa two-phaseCSDFactorthat inputsan
v-elementdata
block on eachphase,sendstheinput block tou
in thefirst phase,andsendstheinput block tou
in thesecondphase.It
is easilyseenthattheCSDFspecificationof fig. 9(a)canbeimplementedwith a buffer of sizev
oneachof thethreeedges.
Thus,thetotalbufferingrequirementis v for thisspecification.
If we replacetheCSDF“block-distributor” actorwith its functionallyequivalentSDFcounterpart,thenwe obtainthe
pureSDFspecificationdepictedin fig. 9(b). TheSDFversionof thedistributormustprocesstwo blocksata timeto conform
to SDFsemantics.As aresult,theedgethatconnectsthedatasourceto thedistributor requiresabuffer of size < v . Thus,the
totalbufferingrequirementof theSDFgraphof fig. 9(b) is p v , which is 33%greaterthantheCSDFversionof fig. 9(a).
Yet anotheradvantageofferedby CSDFis that by decomposingactorsinto a finer level (phase-level) of specification
granularity, basicbehavioral optimizationssuchasconstantpropagationanddeadcodeelimination[18, 54] arefacilitated
significantly[19]. As a simpleexampleof deadcodeeliminationwith CSDF, considertheCSDFspecificationshown in fig.
10(a)of a multirateFIR filtering systemthat is expressedin termsof basicmultiratebuilding blocks. Fromthis graph,the
equivalentexpandedhomogeneousSDFgraph, shown in fig. 10(b),canbederivedusingconceptsdiscussedin [9, 17]. In the
expandedgraph,eachactorcorrespondsto a singlephaseof a CSDFactoror a singleinvocationof anSDFactorwithin a
singleperiodof a periodicschedule.Fromfig. 10(b)it is apparentthattheresultsof somecomputations(SDFinvocationsor
CSDFphases)areneverneededin theproductionof any of thesystemoutputs.Suchcomputationscorrespondto deadcode
andcanbeeliminatedduringsynthesiswithout compromisingcorrectness.For this example,thecompletesetof subgraphs
thatcorrespondto deadcodeis illustratedin fig. 10(b).Parks,Pino,andLeeshow thatsuch“deadsubgraphs”canbedetected
with a straightforwardalgorithm[19].
In summary, CSDFis a usefulgeneralizationof SDFthatmaintainsthepropertiesof binaryconsistency, efficient veri-
fication,andstaticschedulingwhile offeringa morerich rangeof inter-actorcommunicationpatterns,improvedsupportfor
10
hierarchicalspecifications,moreeconomicaldatabuffering,andimprovedsupportfor basicbehavioral optimizations.CSDF
conceptsareusedin a numberof commercialdesigntoolssuchasDSPCanvasby AngelesDesignSystems,andVirtuoso
Synchro by EonicSystems.
2.1.6 Scalablesynchronousdataflow
The scalablesynchronousdataflow (SSDF)model is an extensionof SDF that enablessoftware synthesisof vectorized
implementations,which exploit thefacility for efficient block processingin many DSPapplications[20]. Theinternal(host
language)specificationof anSSDFactorG
assumesthattheactorwill beexecutedin groupsofvE XGY
successiveinvocations,
whichoperateon(vl XGY 4 )-unit blocksof dataatatimefrom eachinputedge . Suchblockprocessingreducestherate
of inter-actorcontext switching,andcontext switchingbetweensuccessive codesegmentswithin complex actors,andit also
may improve executionefficiency significantlyon deeplypipelinedarchitectures.Thevectorizationparameterv
of each
SSDFactoris selectedcarefullyduringsynthesis.Thisselectionshouldbebasedonconstraintsimposedby theSSDFgraph
structure;the memoryconstraintsandperformancerequirementsof the target application;andon the following extended
versionof theSDFbalanceequation(1) constraints
vl ` 21 vl ` for everyedge in theSSDFgraph
(6)
where ` is therepetitionsvectorof theSDFgraphthatresultswhenthevectorizationparameterof eachactoris setto unity.
Sincetheutility of SSDFis closelytied to optimizedsynthesistechniques,we deferdetaileddiscussionof SSDFto section
2.2.4,which focuseson throughput-orientedoptimizationissuesfor softwaresynthesis.
SSDFis a key specificationmodelin thepopularCOSSAPdesigntool thatwasoriginally developedby Cadisandthe
AachenUniversityof Technology[21], andis now developedby Synopsys.
2.1.7 Other dataflow models
TheSDF, CSDF, andSSDFmodelsdiscussedaboveareall usedin widely-distributedDSPdesigntools. A numberof more
experimentalDSPdataflow modelshavealsobeenproposedin recentyears.Althoughthesemodelsall offeradditionalinsight
on dataflow modelingfor DSP, further researchanddevelopmentis requiredbeforethe practicalutility of thesemodelsis
clearlyunderstood.In theremainderof thissection,webriefly review someof theseexperimentalmodels.
Themultidimensionalsynchronousdataflow model(MDSDF),proposedbyLee[22], andexploredfurtherbyMurthy [23],
extendsSDFconceptsto applicationsthatoperateonmultidimensionalsignals,suchasthosearisingin imageandvideopro-
cessing.In MDSDF, eachactorproducesandconsumesdatain unitsof -dimensionalcubes,where canbearbitrary, and
candiffer from actorto actor. The“synchrony” requirementin MDSDFconstrainseachproductionandconsumption -cube
to beof fixedsize ¡ n¢ ¡ l¢ 44 ¢ ¡ ! , whereeach¡ is a constant.For example,an imageprocessingactorthatexpandsa£ :¤< ¢ £ :¤< –pixel imagesegmentinto a : \ <p ¢ : \ <¥p segmentwouldhave theMDSDFrepresentationillustratedin fig. 11.
We saythat a dataflow computationmodelis statically schedulableif a staticschedulecanalwaysbeconstructedfor
a consistentspecificationin the model. For SDF, CSDF, and MDSDF, binary consistency and static schedulabilityboth
hold. Thewell-behaveddataflow (WBDF) model[24], proposedby Gao,Govindarajan,andPanangaden,is anexampleof
a binary-consistency model that is not staticallyschedulable.The WBDF modelpermitsthe useof a limited setof data-
dependentcontrol-flow constructs,andthusrequiresdynamicscheduling,in general.However theuseof theseconstructsis
restrictedin suchaway thatthattheinter-relatedpropertiesof binary-consistency andefficientboundedmemoryverification
arepreserved,andtheconstructionof efficientquasi-staticschedulesis facilitated.
11
Thebooleandataflow (BDF) model[25] is anexampleof a DSPdataflow modelfor which binaryconsistency doesnot
hold. BDF introducesthe conceptof control inputs, which areactorinputsthataffect thenumberof tokensproducedand
consumedat otherinput/outputports. In BDF, thevaluesof control inputsarerestrictedto theset ¦ 3w # . Thenumberof
tokensconsumedby anactorfrom anon-controlinputedge,or producedontoanoutputedgeis restrictedto beconstant,asin
SDF, or a functionof oneor moredatavaluesconsumedatcontrolinputs.BDF attainsgreatlyincreasedexpressivepowerby
allowingdata-dependentproductionandconsumptionrates.In exchange,someof theintuitivesimplicityandappealof SDFis
lost;staticschedulingcannotalwaysbeemployed;andtheproblemsof boundedmemoryverificationanddeadlockdetection
becomeundecidable[26], which meansthat in general,they cannotbesolvedin finite time. However, heuristicshave been
developedfor constructingefficient quasi-staticschedules,andattemptingto verify boundedmemoryrequirements.These
heuristicshavebeenshown to work well in practice[26]. A naturalextensionof BDF, calledinteger-controlleddataflow, that
allowscontroltokensto take onarbitraryintegervalueshasbeenexploredin [27].
2.2 Optimized synthesisof DSPsoftware fr om dataflow specifications
In section2.1, we reviewed several dataflow modelsfor high-level, block diagramspecificationof DSPsystems.Among
thesemodels,SDF andthe closelyrelatedSSDFmodelarethe mostmature. In this this sectionwe examinefundamental
trade-offsandalgorithmsinvolvedin thesynthesisof DSPsoftwarefrom SDFandSSDFgraphs.Exceptfor thevectorization
approachesdiscussedin section2.2.4,thetechniquesdiscussedin thissectionapplyequallywell to bothSDFandSSDF. For
clarity, wepresentthesetechniquesuniformly in thecontext of SDF.
2.2.1 Thr eadedimplementation of dataflow graphs
A softwaresynthesistools generatesapplicationprogramsby piecingtogethercodemodulesfrom a predefinedlibrary of
softwarebuilding blocks. Thesecodemodulesaredefinedin termsof thetarget languageof thesynthesistool. Most SDF-
baseddesignsystemsusea modelof synthesiscalledthreading. GivenanSDFrepresentationof a block-diagramprogram
specification,a threadedsynthesistool beginsby constructinga periodicschedule.Thesynthesistool thenstepsthroughthe
scheduleandfor eachactorinstanceG
that it encounters,it insertstheassociatedcodemoduleGY§
from thegiven library
(inline threading), or insertsa call to a subroutinethat invokesG§
(subprogramthreading). Threadedtools may employ
purely inline threading,purely subroutinethreading,or a mixture of inline and subprogram-basedinstantiationof actor
functionality (hybrid threading). Thesequenceof codemodules/ subroutinecalls that is generatedfrom a dataflow graph
is processedby a buffer managementphasethat insertsthenecessarytargetprogramstatementsto routedataappropriately
betweenactors.
2.2.2 Schedulingtradeoffs
In this section,we provide a glimpseat the complex rangeof trade-offs that areinvolved during the schedulingphaseof
the synthesisprocess.At present,we consideronly inline threading.Subprogramandhybrid threadingareconsideredin
section2.2.5. Synthesistechniquesthatpertainto SSDF, which arediscussedin section2.2.4,canbeappliedwith similar
effectivenessto inline, subprogramor hybrid threading.
Schedulingis a critical taskin the synthesisprocess.In a softwareimplementation,schedulinghasa large impacton
key metricssuchasprogramanddatamemoryrequirements,performance,andpower consumption.Evenfor a simpleSDF
graph,the underlyingrangeof trade-offs may be very complex. For example,considerthe SDF graphin fig. 12(a). The
repetitionsvectorcomponentsfor thisgraphare ` 6¨| : ` 6© q` *ª ;: \ . Onepossibleschedulefor thisgraphis given
by
« i¬z¬fY¬f¬z¬f1®_¬f¬z¬fY¬f¬z[ (7)
12
This scheduleexploits the additionalschedulingflexibility offeredby the delaysplacedon edge¨_3©l
. Recall that
eachdelayresultsin aninitial datavalueon theassociatededge.Thus,in fig. 12 , fiveexecutionsof©
canoccurbefore is
invoked,which leadsto a reductionin theamountof memoryrequiredfor databuffering.
To discusssuchreductionsin buffering requirementsprecisely, we needa few definitions.Givena schedule,thebuffer
sizeof anSDFedgeis themaximumnumberof live tokens(tokensthatareproducedbut not yet consumed)thatcoexist on
theedgethroughoutexecutionof theschedule.Thebuffer requirementof a schedule«
, denoted4°± « , is thesumof the
buffer sizesof all of theedgesin thegivenSDFgraph.For example,it is easilyverifiedthat ¯4°± « ;:": .Thequantity ¯4°± « is thenumberof memorylocationsrequiredto implementthedataflow buffers in the input SDF
graphassumingthat eachbuffer is mappedto a separatesegmentof memory. This is a naturalandconvenientmodelof
buffer implementation.It is usedin SDFdesigntoolssuchasCadence’sSPWandtheSDF-relatedcodegenerationdomains
of Ptolemy, Furthermore,schedulingtechniquesthat employ this buffering modeldo not precludethe sharingof memory
locationsacrossmultiple,non-interferingedges(edgeswhoselifetimesdo not overlap):theresultingschedulescanbepost-
processedby any generaltechniquefor arraymemoryallocation,suchasthewell-knownfirst-fit or best-fitalgorithms.In this
case,theschedulingtechniques,which attemptto minimizethesumof theindividual buffer sizes,employ a buffer memory
metricthatis anupperboundapproximationto thefinal buffer memorycost.
Oneproblemwith theschedule«
undertheassumedinline threadingmodelis thatit consumesarelatively largeamount
of programmemory. If ² XGY denotesthecodesize(numberof programmemorywordsrequired)for anactorG
, thenthe
codesizecostof«
canbeexpressedas ² 6¨|zh : \ ² X©nzh : \ ² BªY .By exploiting therepetitivesubsequencesin thescheduleto organizecompactloopingstructures,wecanreducethecode
sizecostrequiredfor theactorexecutionsequenceimplementedby«
. Thestructureof theresultingsoftwareimplementation
canberepresentedby the loopedschedule
« £ ¬f ® £ ¬f (8)
Eachparenthesizedterm ¦ ¦ 4¦ §T (calleda scheduleloop) in sucha loopedschedulerepresentsthesuccessive repeti-
tion timesof the invocationsequence¦ ¦ 40¦ § . Eachiterand ¦ canbeaninstantiation(appearance) of anactor, or a
loopedsubschedule.Thus,thisnotationnaturallyaccommodatesnestedloops.
Givenanarbitraryfiring sequencew
(that is, a schedulethatcontainsno scheduleloops),anda setof codesizecosts
for all of thegivenactors,a loopedschedulecanbederivedthatminimizesthetotalcodesize(overall loopedschedulesthat
havew
astheunderlyingfiring sequence)usinganefficientdynamicprogrammingalgorithm [28] calledCDPPO.It is easily
verifiedthat theschedule«
achievestheminimumtotal codesizefor thefiring sequence«
for any givenvaluesof ² ¨³ ,² 6© , and ² *ª . In general,however, thethesetof loopedschedulesthatminimizethecodesizecostfor a firing sequence
maydependon therelativecostsof theindividualactors[28].
Schedules«
and«
bothattaintheminimumachievablebuffer requirementof 11for fig. 12;however,«
will generally
achieve a much lower codesizecost. The codesizecostof«
canbe approximatedas ² 6¨|´h <² X©E[h <² *ª . This
approximationneglectsthecodesizeoverheadµ « of implementingthescheduleloops(parenthesizedterms)within«
.
In practice,this approximationrarely leadsto misleadingresults. The looping overheadis typically very small compared
to the codesize saved by consolidatingactor appearancesin the schedule. This is especiallytrue for the large number
of DSPprocessorsthat employ so-called“zero-overheadlooping” facilities [2]. Schedulingtechniquesthat abandonthis
approximation,andincorporateloopingoverheadareexaminedin section2.2.5.
It is possibleto reducethecodesizecostbelow whatis achievableby«
; however, thisrequiresanincreasein thebuffer-
ing cost. For example,considertheschedule«¶ ¨· : \"©n : \ª . Sucha scheduleis calleda singleappearanceschedule
sinceit containsonly oneinstantiationof eachactor. Clearly(undertheapproximationof negligible loopingoverhead),any
singleappearanceschedulegivesa minimal codesizeimplementationof a dataflow graph.However, a penaltyin thebuffer
requirementmustusuallypaidfor suchcodesizeoptimality.
13
For example,thecodesizecostof« ¶
is ² ¨|h ² 6©l lessthanthatof
« ; however ¯°± « ¶ < £ , while ¯4°± « is
only 11.
Beyondcodesizeoptimality, anotherpotentiallyimportantbenefitof schedule« ¶
is thatit minimizestheaveragerateat
whichinter-actorcontext switchingoccurs.Thisscheduleincurs3context switches(alsocalledactoractivations)perschedule
period,while«
and«
both incur 21. Suchminimizationof context switchingcansignificantlyimprove throughputand
power consumption.Theissueof context switching,andthesystematicconstructionof minimum-context-switchschedules
arediscussedfurtherin section2.2.4.
An alternative singleappearanceschedulefor fig. 12 is«j¸ ¨· : \"©lªY . This schedulehasthe sameoptimal code
sizecostas« ¶
. However its buffer requirementof 16 is lower thanthat of« ¶
sinceexecutionof actors©
andª
is fully
interleaved,whichlimits dataaccumulationontheedge6©´LªY
. This interleaving,however, bringstheaveragerateof context
switchesto 21;andthus,« ¶
is clearlyadvantageousin termsof thismetric.
In summary, there is a wide, complex rangeof trade-offs involved in synthesizingan applicationprogramfrom a
dataflow specification. This is true even when we restrict ourselves to inline implementations,which entirely avoid the
(call/return/parameterpassing)overheadof subroutines.In theremainderof this section,we review a numberof techniques
thathavebeendevelopedfor addressingsomeof thesecomplex trade-offs. Sections2.2.3and2.2.4focusprimarily on inline
implementations.In section2.2.5,weexaminesomerecently-developedtechniquesthathave beendevelopedto incorporate
subroutine-basedthreadinginto thedesignspace.
2.2.3 Minimization of memory requirements
Minimizing programand datamemoryrequirementsis critical in many embeddedDSP applications. On-chip memory
capacitiesarelimited, andthespeed,power, andfinancialcostpenaltiesof employing off-chip memorymaybeprohibitive
or highly undesirable.Threegeneralavenueshave beeninvestigatedfor minimizingmemoryrequirements— minimization
of thebuffer requirement,which usuallyformsa significantcomponentof theoverall dataspacecost;minimizationof code
size;andjoint explorationof thetrade-off involving codesizeandbuffer requirements.
It hasbeenshown that the problemof constructinga schedulethat minimizesthe buffer requirementover all valid
schedulesis NP-complete[11]. Thus,for practical,scalablealgorithms,wemustresortto heuristics.Ade[29] hasdeveloped
techniquesfor computingtight lower boundson the buffer requirementfor a numberof restrictedsubclassesof delayless,
acyclic graphs,includingarbitrary-lengthchain-structuredgraphs.Someof theseboundshave beengeneralizedto handle
delaysin [11]. Approximatelower boundsfor generalgraphsarederivedin [30]. CubricandPanangadenhave presentedan
algorithmthatachievesoptimumbuffer requirementsfor acyclic SDFgraphsthatmayhave oneor moreindependent,undi-
rectedcycles[31]. An effectiveheuristicfor generalgraphs,which is employedin theGabriel[14] andPtolemy[7] systems,
is given in [11]. Govindarajan,Gao,andDesaihave developedan SDF buffer minimizationalgorithmfor multiprocessor
implementation[32]. This algorithmminimizesthebuffer memorycostover all multiprocessorschedulesthathave optimal
throughput.
For complex, multirateapplications— which are the mostchallengingfor memorymanagement— the structureof
minimumbufferschedulesis in generalhighly irregular[33, 11]. Suchschedulesoffer relatively few opportunitiesto organize
compactloop structures,andthushave very high codesizecostsunderinlined implementations.Thus,suchschedulesare
often not usefuleven thoughthey may achieve very low buffer requirements.Schedulesat the extremeof minimum code
size,on theotherhand,typically exhibit a muchmorefavorabletrade-off betweencodeandbuffer memorycosts[34].
Theseempiricalobservationsmotivatetheproblemof codesizeminimization.A centralgoalwhenattemptingto mini-
mizecodesizefor inlined implementationsis thatof constructinga singleappearanceschedulewheneveroneexists.A valid
singleappearancescheduleexists for any consistent,acyclic SDF graph. Furthermore,a valid singleappearanceschedule
canbederivedeasilyfrom any topologicalsort(a topological sort of a directedacyclic graph ? is a linearorderingof all its
verticessuchthatfor eachedge % in ? , % appearsbefore in theordering)of anacyclic graph ? : if
6G 3G 4 3G§T
14
is a topologicalsortof ? , thenit is easilyseenthatthesingleappearanceschedule ` 6G G ` XG 0G 4 ` 6GY§90G§9 is
valid. For acyclic graph,a singleappearanceschedulemayor maynotexist dependingon thelocationandmagnitudeof de-
laysin thegraph.An efficient strategy, calledtheLooseInterdependenceAlgorithmFramework (LIAF), hasbeendeveloped
thatconstructsa singleappearanceschedulewheneveroneexists[35]. Furthermore,for generalgraphs,this approachguar-
anteesthatall actorsthatarenot containedin a certaintypeof subgraph,calledtightly interdependentsubgraphs, will have
only oneappearancein thegeneratedschedule[36]. In practice,tightly interdependentsubgraphsariseonly very rarely, and
thus,theLIAF techniqueguaranteesfull codesizeoptimality for mostapplications.Becauseof its flexibility andprovable
performance,theLIAF is employedin a numberof widely usedtools,includingPtolemyandCadence’sSPW.
The LIAF constructsa singleappearancescheduleby decomposingthe input graphinto a hierarchyof acyclic sub-
graphs,which correspondto an outer-level hierarchyof nestedloopsin the generatedschedule.The acyclic subgraphsin
thehierarchycanbe scheduledwith any existing algorithmthatconstructssingleappearanceschedulesfor acyclic graphs.
Theparticularalgorithmthat is usedin a givenimplementationof theLIAF is calledtheacyclicschedulingalgorithm. For
example,thetopological-sort-basedapproachdescribedabove couldbeusedastheacyclic schedulingalgorithm.However,
this simpleapproachhasbeenshown to leadto relatively largebuffer requirements[11]. This motivatesa key problemin
the joint minimizationof codeanddatafor SDF specifications.This is the problemof constructinga singleappearance
schedulefor anacyclic SDFgraphthatminimizesthebuffer requirementover all valid singleappearanceschedules.Since
any topologicalsortleadsto adistinctschedulefor anacyclic graph,andthenumberof topologicalsortsis notpolynomially
boundedin thegraphsize,exhaustiveevaluationof singleappearanceschedulesis not tractable.Thus,aswith the(arbitrary
appearance)buffer minimizationproblem,heuristicshave beenexplored. Two complementary, low-complexity heuristics,
calledAPGAN [37] andRPMC [38], have provento be effective on practicalapplicationswhenboth areapplied,andthe
bestresultingscheduleis selected.Furthermore,it hasbeenformally shown thatAPGAN givesoptimal resultsfor a broad
classof SDFsystems.Thoroughdescriptionsof APGAN, RPMC,andtheLIAF, andtheir inter-relationshipscanbefound
in [11, 34]. A schedulingframework for applyingthesetechniquesto multiprocessorimplementationsis describedin [39].
Recently-developedtechniquesfor efficient sharingof memoryamongmultiple buffers from a singleappearanceschedule
aredevelopedin [40, 41].
Although APGAN andRPMC provide goodperformanceon many applications,theseheuristicscansometimespro-
duceresultsthatarefar from optimal[42]. Furthermore,asdiscussedin section1, DSPsoftwaretoolsareallowedto spend
moretime for optimizationof codethanwhat is requiredby low-complexity, deterministicalgorithmssuchasAPGAN and
RPMC.Motivatedby theseobservations,Zitzler, Teich,andBhattacharyyahavedevelopedaneffectivestochasticoptimiza-
tion methodology, calledGASAS, for constructingminimum buffer single appearanceschedules[43, 44]. The GASAS
approachis basedon a geneticalgorithm[45] formulationin which topologicalsortsareencodedas“chromosomes,” which
randomly“mutate” and“recombine”to explorethesearchspace.Eachtopologicalsort in theevolution is optimizedby the
efficient, local searchalgorithmCDPPO[28], which wasmentionedearlierin section2.2.2. Usingdynamicprogramming,
CDPPOcomputesa minimummemorysingleappearanceschedulefor agiventopologicalsort.To exploit thevaluableopti-
mality propertyof APGAN whenever it applies,thesolutiongeneratedby APGAN is includedin theinitial population,and
anelitist evolutionpolicy is enforcedto ensurethatthefittestindividualalwayssurvivesto thenext generation.
2.2.4 Thr oughput optimization
At theAachenUniversityof Technology, aspartof theCOSSAPdesignenvironment(now developedby Synopsys)project,
Ritz, Pankert, andMeyr have investigatedtheminimizationof of thecontext-switchoverhead,or theaveragerateat which
actoractivationsoccur[20]. As discussedin section2.2.2,anactoractivationoccurswhenevertwo distinctactorsareinvoked
in succession;for example,theschedule < < 5 £ GY £ 89 for fig. 13 resultsin fiveactivationsperscheduleperiod.
Activation overheadincludessaving the contentsof registersthat areusedby the next actor to invoke, if necessary,
andloadingstatevariablesandbuffer pointersinto registers.Theconceptof groupingmultiple invocationsof thesameactor
15
togetherto reducecontext-switchoverheadis referredto asvectorization. TheSSDFmodel,discussedin section2.1.6,allows
thebenefitsof vectorizationto extendbeyondtheactorinterfacelevel (inter-actorcontext switching).For example,context
switchingbetweensuccessive sub-functionsof a complex actorcanbeamortizedovervE
invocationsof thesub-functions,
wherevE
is thegivenvectorizationparameter.
Ritz estimatesthe averagerate of activationsfor a periodic schedule«
as the numberof activationsthat occur in
one iterationof«
divided by the blocking factor1 of«
. This quantity is denotedbyv¹ºb»3 «
For example,for fig. 13,v¹ºb»L < < 5l £ GH £ 89 £, and
v¹ºb»3 p < 5 £ GY : \89 ½¼¾<|@p £ . If for eachactor, eachinvocationtakesthe
sameamountof time, andif we ignorethe time spenton computationthat is not directly associatedwith actorinvocations
(for example,scheduleloops),thenv ¹º0» «
is directly proportionalto the numberof actoractivationsper unit time. For
consistentacyclic SDFgraphs,v ¹º0»
clearlycanbemadearbitrarily largeby increasingtheblockingfactorsufficiently; thus,
aswith theproblemof constructingcompactschedules,theextentto whichtheactivationratecanbeminimizedis limited by
thecyclic regionsin theinputSDFspecification.
Thetechniquedevelopedin [20] attemptsto find a valid singleappearanceschedulethatminimizesv ¹º0»
over all valid
singleappearanceschedules.Note that minimizing the numberof activationsdoesnot imply minimizing the numberof
appearances.As asimpleexample,considertheSDFgraphin fig. 14. It canbeverifiedthatfor thisgraph,thelowestvalueofv¹ºb»thatis obtainableby a valid singleappearancescheduleis
\ ¿ £ , andonevalid singleappearanceschedulethatachieves
thisminimumrateis p 5l4 p GY p 89 . However, valid schedulesexist thatarenotsingleappearanceschedules,andthathave
valuesofv7¹ºb»
below\ ¿ £ ; for example,thevalid schedule
p 5l4 p GY 5 GY4 ¿ 89 containstwo appearanceseachofG
and5
, andsatisfiesv¹ºb» £ ¾"¿T \ À¿: .
Thus,sinceRitz’svectorizationapproachfocusesonsingleappearanceschedules,theprimaryobjectiveof thetechniques
in [20] is implicitly codesizeminimization. This is reasonablesincein practice,codesizeis oftenof critical concern.The
overallobjectiveis in [20] is to constructaminimumactivationimplementationoverall implementationsthathaveminimum
codesize.
Ritz definestherelativevectorizationdegreeof asimplecycle(acyclic pathin thegraphin whichnopropersub-pathis
cyclic)8
in aconsistent,connectedSDFgraphby
vlÁÂX89 ;ÃÄ4Å ÃlÆÇ rÁÂ6ÈÉeÈÊI 2Ä¥ÄÇ 6ËÌ # ÉeËÊI L4Í X89 # (9)
where
rÎÁÉ6ËÌ ÐÏ XË'` 6ËÌ 21 6ËÌ¥Ñ (10)
is thedelayonedgeË
normalizedby thetotalnumberof tokensexchangedonË
in aminimalscheduleperiodof ? , and
2Ä¥ÄÇ4 6ËÌ ÈÒI 3Í ? ·e È' Ó 6ËÌ Ä¥j LM 6È ÓM 6ËÌ #is thesetof edgeswith thesamesourceandsinkas
Ë. Here, L4Í ? simplydenotesthesetof edgesin theSDFgraph? .
For example,if ? denotestheSDFgraphin fig. 13, and Ô denotesthecycle in ? whoseassociatedgraphcontainsthe
actorsG
and5
, thenrÎÁÉ Ô ÕÏ0: \ ¾"< \ Ñ \ ; andif ? denotesthegraphin fig. 14and Ô denotesthecyclewhoseassociated
graphcontainsG
and8
, thenrÁ Ô @ÏX¿¾: Ñ Ó¿ .
1Every periodicscheduleinvokeseachactor Ö somemultiple of ×4Ø,Ö'Ù times. This multiple, denotedby Ú , is calledthe blocking factor. A minimal
periodicscheduleis onethatsatisfiesÚlÛÜ . For memoryminimization,thereis no penaltyin restrictingconsiderationto minimal schedules[11]. When
attemptingto minimize Ý~ÞbßÇà , however, it is in generaladvantageousto considerÚEáÜ .16
Ritz et. al postulatethatgivena stronglyconnectedSDFgraph,avalid singleappearanceschedulethatminimizesv ¹º0»
canbe constructedfrom a completehierarchization, which is a clusterhierarchysuchthat only connectedsubgraphsare
clustered,all cyclesat a givenlevel of thehierarchyhave thesamerelative vectorizationdegree,andcyclesin higherlevels
of the hierarchyhave strictly higherrelative vectorizationdegreesthancyclesin lower levels. Fig. 15 depictsa complete
hierarchizationof an SDF graph. Fig. 15(a)shows the original SDF graph;here ` 6GE5L8Hr7 : < p â . Fig. 15(b)
shows the top level of theclusterhierarchy. Thehierarchicalactor ã representsL°¯*Í"Ä32ä 5738H3r # , andthis subgraph
is decomposedas shown in fig. 15(c), which gives the next level of the clusterhierarchy. Finally, fig. 15(d) shows that
°¯*Í"Ä32ä 8H3r # correspondsto ã andis thebottomlevel of theclusterhierarchy.
Now observethattherelativevectorizationdegreeof thefundamentalcycle in fig. 15(c)with respectto theoriginalSDF
graphis Ï0:å¾ â Ñ æ< , while therelative vectorizationdegreeof the fundamentalcycle in fig. 15(b) is Ïb:<¾"< Ñ çå ; andthe
relativevectorizationdegreeof thefundamentalcyclein fig. 15(c)is Ïb:<¾ â Ñ ^: . Weseethattherelativevectorizationdegree
decreasesaswe descendthehierarchy, andthusthehierarchizationdepictedin fig. 15 is complete.Thehierarchizationstep
definedby eachof theSDFgraphsin figs.15(b)-(d)is calleda componentof theoverallhierarchization.
Ritz’s algorithm[20] constructsa completehierarchizationby first evaluatingtherelative vectorizationdegreeof each
fundamentalcycle,determiningthemaximumvectorizationdegree,andthenclusteringthegraphsassociatedwith thefunda-
mentalcyclesthatdonotachievethemaximumvectorizationdegree.Thisprocessis thenrepeatedrecursively oneachof the
clustersuntil no new clustersareproduced.In general,this bottom-upconstructionprocesshasunmanageablecomplexity.
However, this normally doesn’t createproblemsin practicesincethe stronglyconnectedcomponentsof usefulsignalpro-
cessingsystemsareoftensmall,particularlyin largegraindescriptions.DetailsonRitz’s techniquefor translatingacomplete
hierarchizationinto ahierarchyof nestedloopscanbefoundin [20]. A general,optimalalgorithmfor vectorizationof SSDF
graphsbasedon thecompletehierarchizationconceptdiscussedabove is givenin [20]. Jointminimizationof vectorization
andbuffer memorycostis developedin [12], andadaptationsof theretimingtransformationto improvevectorizationfor SDF
graphsis addressedin [46, 47].
2.2.5 Subroutine insertion
Thetechniquesdiscussedabove assumea fixedthreadingmode. In particular, they do not attemptto exploit theflexibility
offeredby hybrid threading.Sung,Kim, andHa have developedanapproachthatemploys hybrid threadingto sharecode
amongdifferentactorsthathavesimilar functionality[48]. For example,anapplicationmaycontainseveralFIR filter blocks
thatdiffer only in thenumberof taps,andthesetof filter coefficients.Thesearecalleddifferentinstancesof aparameterized
FIR modulein theactorlibrary. Theirapproachdecomposesthecodeassociatedwith anactorinstanceinto theactorcontext
andactorreferencecode,andcarefully weighsthe benefitof eachcodesharingopportunitywith the associatedoverhead.
Theoverheadsstemfrom theactorcontext component,which includeinstance-specificstatevariables,andbuffer pointers.
Codemustbe insertedto managethis context so that eachinvocationof the sharedcodeblock (the “referencecode”) is
appropriatelycustomizedto theassociatedinstance.
Also, theGASASframework hasbeensignificantlyextendedto considermultipleappearanceschedules,andselectively
applyhybridthreadingto reducethecodesizecostof highly irregularschedules,whichcannotbeaccommodatedby compact
loop structures[49]. Suchirregularity often ariseswhenexploring the spaceof scheduleswhosebuffer requirementsare
significantly lower thanwhat is achievableby singleappearanceschedules[11]. The objective of this genetic-algorithm-
basedexplorationof hybrid threadingandloop schedulingis to efficiently computePareto-frontsin the multidimensional
designevaluationspaceof programmemorycost,buffer requirement,andexecutiontimeoverhead.
Theintelligentuseof hybridthreadingandcodesharing(subroutineinsertionoptimizations) canachievelowercodesize
coststhatwhatis achievablewith singleappearanceschedulesthatuseconventionalinlining. If aninlinedsingleappearance
schedulefits within theavailableon-chipmemory, it is not worth incurringtheoverheadof subroutineinsertion.However,
if an inline implementationis too largeto beheldon-chip,thensubroutineinsertionoptimizationscaneliminate,or greatly
17
reducetheneedfor off-chip memoryaccesses.Sinceoff-chip memoryaccessesinvolvesignificantexecutiontime penalties,
andlargepower consumptioncosts,subroutineinsertionenablesembeddedsoftwaredevelopersto exploit animportantpart
of thedesignspace.
2.2.6 Summary
In thissectionwehavereviewedavarietyof algorithmsfor addressingoptimizationtrade-offs duringsoftwaresynthesis.We
have illustratedsomeof theanalyticalmachineryusedin SDFoptimizationalgorithmsby examiningin somedetailRitz’s
algorithmfor minimizing actoractivations.SinceCSDF, MDSDF, WBDF, andBDF areextensionsof SDF, the techniques
discussedin thissectioncanalsobeappliedin thesemoregeneralmodels.In particular, they canbeappliedto any SDFsub-
graphsthatarefound.It is importantto recognizethiswhendevelopingor usingaDSPdesigntool sincein DSPapplications
thatarenot fully amenableto SDFsemantics,a significantsubsetof thefunctionalitycanusuallybeexpressedin SDF. Thus
thetechniquesdiscussedin thissectionremainusefulevenin DSPtoolsthatemploy moregeneraldataflow semantics.
Beyondtheirapplicationto SDFsubsystems,however, theextensionof mostof thetechniquesdevelopedin thissection
to moregeneraldataflow modelsis anon-trivial matter. To achievebestresultswith thesemoregeneralmodels,new synthesis
approachesarerequiredthattake into accountdistinguishingcharacteristicsof themodels.Themostsuccessfulapproaches
will combinethesenew approachesfor handlingthefull generalityof theassociatedmodels,with thetechniquesthatexploit
thestructureof pureSDFsubsystems.
3 Compilation of application programsto machinecode
In this section,we will first outlinethestateof theart in theareaof compilersfor PDSPs.As indicatedby severalempirical
studies,themajorproblemwith currentcompileris theirinability to generatemachinecodeof sufficientquality. Next, wewill
discussanumberof recentlydevelopedcodegenerationandoptimizationtechniques,whichexplicitly takeinto accountDSP-
specificarchitecturesandrequirementsin orderto improvecodequality. Finally, we will mentionkey techniquesdeveloped
for retargetablecompilation.
3.1 Stateof the art
Today, the mostwidespreadhigh-level programminglanguagefor PDSPsis ANSI C. Even thoughtherearemoreDSP-
specificlanguages,suchasthe dataflow languageDFL [50], the popularityandhigh flexibility of C aswell asthe large
amountof existing ”legacy code” hasso far largely preventedthe useof programminglanguagesmoresuitablefor DSP
programming.C compilersareavailablefor all importantDSPfamilies,suchasTexasInstrumentsTMS320xx,Motorola
56xxx,or AnalogDevices21xx. In mostcases,thecompilersareprovidedby thesemiconductorvendorsthemselves.
Due to the large semanticalgapbetweenthe C languageandPDSPinstructionsets,many of thesecompilersmake
extensionsto theANSI C standardby permittingtheuseof ”compiler intrinsics”,for instancein theform of compiler-known
functionswhich are expandedlike macrosinto specificassemblyinstructions. Intrinsicsareusedto manuallyguide the
compilerin makingthe right decisionsfor generationof efficient code. However, suchan ad-hocapproachhassignificant
drawbacks.First, thesourcecodedeviatesfrom thelanguagestandardandis no longermachine-independent.Thus,porting
of softwareto anotherprocessormightbeaverytime-consumingtask.Second,theprogrammingabstractionlevel is lowered
andtheefficientuseof compilerintrinsicsrequiresa deepknowledgeof theinternalPDSParchitecture.
Unfortunately, machine-specificsourcecodetodayis a mustwhenevertheC languageis usedfor programmingPDSPs.
Thereasonis thepoorqualityof codegeneratedby compilersfrom plainANSI C code.Theoverheadof compiler-generated
codeascomparedto hand-written,heavily optimizedassemblycodehasbeenquantifiedin the DSPStonebenchmarking
project[6]. In that project,both codesizeandperformanceof compiler-generatedcodehave beenevaluatedfor a number
18
of DSPkernelroutinesanddifferentPDSParchitectures.The resultsshowed that the compileroverheadtypically ranges
between100and700% (with thereferenceassemblycodesetto 0 % overhead).This is absolutelyinsufficient in thearea
of DSP, wherereal-timeconstraintsaswell aslimitationson programmemorysizeandpower consumptiondemandfor an
extremelyhigh utilization of processorresources.Therefore,anoverheadof compiler-generatedcodecloseor equalto zero
is mostdesirable.
In anotherempiricalstudy[51], DSPvendorshave beenasked to compilea setof C benchmarkprogramsexisting in
two differentversions,onebeingmachine-independentandtheotherbeingtunedfor thespecificprocessor. Again,theresults
showed that usingmachine-independentcodecausesan unacceptableoverheadin codequality in termsof codesizeand
performance.
Theseresultsmake thepracticaluseof compilersfor PDSPsoftwaredevelopmentquestionable.In theareaof general
purposeprocessors,suchasRISCs,thecompileroverheadtypically doesnotexceed100%, sothatevenfor DSPapplications
usingaRISCtogetherwith agoodcompilermayresultin amoreefficientimplementationthanusingaPDSP(with potentially
muchhigherperformance)wastingmostof its timeexecutingunnecessaryinstructioncyclesdueto a poorcompiler. Similar
argumentshold, if codesizeor powerconsumptionareof majorconcern.
As a consequence,thelargestpartof PDSPsoftwareis still written in assemblylanguages,which impliesa lot of well-
known drawbacks,suchashigh developmentcosts,low portability, andhigh maintenanceanddebuggingeffort. This has
beenquantifiedin a studyby Paulin [52], who foundthatfor a certainsetof DSPapplicationsabout90% of DSPcodelines
arewritten in assembly, while theuseof C only accountsfor 10%.
As bothDSPprocessorsandDSPapplicationstendto becomemoreandmorecomplex, the lack of goodC compilers
implies a significantproductivity bottleneck. About a decadeago,researchersstartedto analyzethe reasonsfor the poor
codequalityof DSPcompilers.A key observationwasthatclassicalcodegenerationtechnology, mainlydevelopedfor RISC
andCISCprocessorarchitectures,is hardlysuitablefor PDSPs,but thatnew DSP-specificcodegenerationtechniqueswere
required.In the following, we will summarizea numberof recenttechniques.In orderto put thesetechniquesinto context
with eachother, we will first give an overview aboutthe main phasesin compilation. Then,we will focuson techniques
developedfor particularproblemsin thedifferentcompilationphases.
3.2 Overview of the compilation process
Thecompilationof anapplicationprograminto machinecode,asillustratedin fig. 16,startswith severalsourcecodeanalysis
phases.
Lexical analysis: Thecharacterstringsdenotingatomicelementsof thesourcecode(identifiers,keywords,operators,con-
stants)aregroupedinto tokens, i.e. numericalidentifiers,which arepassedto thesyntaxanalyzer. Lexical analysisis
typically performedby ascanner, which is invokedby thesyntaxanalyzerwheneveranew tokenis required.Scanners
canbeautomaticallygeneratedfrom a languagespecificationwith toolslike ”lex”.
Syntaxanalysis: The structureof programminglanguagesis mostly describedby a context-freegrammar, consistingof
terminals(or tokens),nonterminals,andrules. Thesyntaxanalyzer, or parser, acceptstokensfrom thescanner, until
a matchinggrammarrule is detected.Eachrule correspondsto a primitiveelementof theprogramminglanguage,for
instanceanassignment.If a tokensequencedoesnot matchany rule,a syntaxerroris emitted.Theresultof parsinga
programis a syntaxtree, which accountsfor thestructureof a givenprogram.Parserscanbeconvenientlygenerated
from grammarspecificationswith toolslike ”yacc’.
Semanticalanalysis: Duringsemanticalanalysis,anumberof correctnesstestsareperformed.For instance,all usedidenti-
fiersmusthavebeendeclared,andfunctionsmustbecalledwith parametersin accordancewith their interfacespecifi-
cation.Failureof semanticalanalysisresultsin errormessages.Additionally, a symboltable is built, which annotates
19
eachidentifierwith its type andpurpose(e.g.typedefinition,globalor local variable). Semanticalanalysisrequires
a traversalof the syntaxtree. Frequently, semanticalanalysisis coupledwith syntaxanalysisby meansof attribute
grammars. Thesegrammarssupporttheannotationof informationlike typeor purposeto grammarsymbols,andthus
help to improve themodularityof analysis.Tools like ”ox” [53] areavailablefor automaticgenerationof combined
syntaxandsemanticalanalyzersfrom grammarspecifications.
Theresultof sourcecodeanalysisis an intermediaterepresentation(IR), which formsthebasisfor subsequentcompi-
lation phases.Both graph-basedandstatement-basedIRs arein use.Graph-basedIRs directly modeltheinterdependencies
betweenprogramoperations,while statement-basedIRs essentiallyconsistof anassembly-like sequenceof simpleassign-
ments(three-addresscode)andjumps.
In the next phase,several machine-independentoptimizationsareappliedto the generatedIR. A numberof suchIR
optimizationshavebeendevelopedin theareaof compilerconstruction[54]. Importanttechniquesincludeconstantfolding,
commonsubexpressionelimination,andloop-invariantcodemotion.
Thetechniquesmentionedsofararelargelymachine-independentandmaybeusedin any high-level languagecompiler.
DSP-specificinformation comesinto play only during the codegenerationphase,when the optimizedIR is mappedto
concretemachineinstructions.Dueto thespecializedinstructionsetsof PDSPs,this is themostimportantphasewith respect
to codequality. Dueto computationalcomplexity reasons,codegenerationis in turn subdividedinto differentphases.It is
importantto notethat for PDSPsthis phasestructuringsignificantlydiffers from compilersfor generalpurposeprocessors.
For thelatter, codegenerationis traditionallysubdividedinto thefollowing phases.
Codeselection: Theselectionof aminimumsetof instructionsfor agivenIR with respectto acostmetriclikeperformance
(executioncycles)or size(instructionwords).
Registerallocation: Themappingof variablesandintermediateresultsto a limited setof availablephysicalregisters.
Instruction scheduling: Theorderingof selectedinstructionsin timewhile minimizingthenumberof instructionsrequired
for temporarilymoving registercontentsto memory(spill code) andminimizing executiondelaydueto instruction
pipelinehazards.
Sucha phaseorganizationis not viable for PDSPsdue to several reasons.While generalpurposeprocessorsoften
have a large, homogeneousregisterfile, PDSPstendto show a datapatharchitecturewith several distributedregistersor
registerfiles of very limited capacity. An examplehasalreadybeengiven in fig. 1. Therefore,classicalregisterallocation
techniqueslike [55] arenot applicable,but registerallocationhasto beperformedtogetherwith codeselectionin orderto
avoid large codequality overheadsdueto superfluousdatamovesbetweenregisters. Furthermore,instructionscheduling
for PDSPshasto take into accountthe moderatedegreeof instruction-level parallelism (ILP) offeredby suchprocessors.
In many cases,several mutually independentinstructionsmay be groupedto be executedin parallel,therebysignificantly
increasingperformance.Thisparallelizationof instructionsis frequentlycalledcodecompaction. Anotherimportantareaof
codeoptimizationfor PDSPsconcernsthememoryaccessesperformedby a program.Both theexploitationof potentially
availablemultiple memorybanksandtheefficient computationof memoryaddressesundercertainrestrictionsimposedby
theprocessorarchitecturehave to beconsidered,which arehardly issuesfor generalpurposeprocessors.We will therefore
discusstechniquesusinga differentstructureof codegenerationphases.
Sequentialcodegeneration: Even thoughPDSPsgenerallypermit the executionof multiple instructionsin parallel, it is
often reasonableto temporarilyconsidera PDSPasa sequentialmachine,which executesinstructionsone-by-one.
During sequentialcodegeneration,IR blocks(statementsequences)aremappedto sequentialassemblycode. These
blocksaretypically basicblocks, wherecontrolflow enterstheblockat its beginningandleavestheblockatmostonce
at its endwith a jump. Sequentialcodegenerationaimsatsimultaneouslyminimizingthecostsof instructionsbothfor
operationsanddatamovesbetweenregistersandmemorywhile neglectingILP.
20
Memory accessoptimization: Generationof sequentialcodemakestheorderof memoryaccessesin aprogramknown. This
knowledgeis exploited to optimizememoryaccessbandwidthby partitioningthevariablesamongmultiple memory
banksandto minimizetheadditionalcodeneededfor addresscomputations.
Codecompaction: This phaseanalyzesinterdependenciesbetweengeneratedinstructionsandaimsat exploiting potential
parallelismbetweeninstructionsundertheresourceconstraintsimposedby theprocessorarchitectureandtheinstruc-
tion format.
3.3 Sequentialcodegeneration
Basicblocks in the IR of a programaregraphicallyrepresentedby data flow graphs(DFGs). A DFG ?è XAéDEis a
directedacyclic graph,wherethenodesinA
representoperations(arithmetic,Boolean,shifts,etc.),memoryaccesses(loads
andstores),andconstants.TheedgesetDÕêAæ¢tA
representsthedatadependenciesbetweenDFG nodes.If anoperation
representedby a node ë requiresa valuegeneratedby anoperationdenotedby, then
1 ë nID . DFG nodeswith more
thanoneoutgoingedgearecalledcommonsubexpressions(CSEs).As anexample,fig. 17 shows a pieceof C sourcecode,
whoseDFGrepresentation(afterdetectionof CSEs)is depictedin fig. 18.
Codegenerationfor DFGscanbevisualizedasa processof coveringa DFG by availableinstructionpatterns. Let us
considera processorwith instructionsADD, SUB, andMUL, to performaddition,subtraction,andmultiplication, respec-
tively. Oneof the operandsis expectedto residein memory, while the otheronehasto be first loadedinto a registerby
a LOAD instruction. Furthermore,writing backa resultto memoryrequiresa separateSTORE instruction. Then,a valid
coveringof theexampleDFGis thenoneshown in fig. 19.
Availableinstructionpatternsareusuallyannotatedwith a costvaluereflectingtheir sizeor executionspeed.Thegoal
of codegenerationis to find aminimumcostcoveringof a givenDFGby instructionpatterns.Theproblemis thatin general
thereexistnumerousdifferentalternativecoversfor aDFG.For instance,if theprocessoroffersaMAC(multiply-accumulate)
instruction,asfoundin mostPDSPs,andthecostvalueof MAC is lessthanthesumof thecostsof MUL andADD, thenit
mightbefavorableto selectthatinstruction(fig. 20).
However, using MAC for our exampleDFG would be lessuseful, becausethe multiply operationin this caseis a
CSE.Sincethe intermediatemultiply resultof a MAC is not storedanywhere,a potentiallycostly recomputationwould be
necessary.
3.3.1 Treebasedcodegeneration
Optimalcodegenerationfor DFGsis anexponentialproblem,evenfor very simpleinstructionsets[54]. A solutionto this
problemis to decomposea DFG into a setof dataflow trees(DFTs)by cuttingtheDFG at its CSEsandinsertingdedicated
DFG nodesfor communicatingCSEsbetweenthe DFTs (fig. 21). This decompositionintroducesschedulingprecedences
betweentheDFTs,sinceCSEsmustbewritten beforethey areread(dashedarrows in fig. 21). For eachof theDFTs,code
canbegeneratedseparatelyandefficiently. Liem [57] hasproposedadatastructurefor efficienttreepatternmatchingcapable
of handlingcomplex operationslike MAC.
For PDSPs,also the allocationof specialpurposeregistersduring DFT covering is extremely important,sinceonly
coveringtheoperatorsin a DFG by instructionpatternsdoesnot take into accountthecostsof instructionsneededto move
operandsandresultsto their requiredlocations.Wess[58] hasproposedtheuseof trellis diagramsto alsoincludedatamove
costsduringDFT covering.
AraujoandMalik [60] showedhow thepowerfulstandardtechniqueof treepatternmatchingwith dynamicprogramming
[56] widely usedin compilersfor generalpurposeprocessorscanbe effectively appliedalsoto PDSPswith irregulardata
paths.Treepatternmatchingwith dynamicprogrammingsolvesthecodegenerationproblemby parsinga givenDFT with
respectto aninstruction-setspecificationgivenasa treegrammar. Eachrule in sucha treegrammaris attributedwith a cost
21
valueandcorrespondsto oneinstructionpattern.OptimalDFT coversareobtainedby computinganoptimalderivationof
a given DFT accordingto the grammarrules. This requiresonly two passes(bottom-upandtop-down) over the nodesof
the input DFT, so that the runtimeis linear in the numberof DFT nodes.Codegeneratorsbasedon this paradigmcanbe
automaticallygeneratedwith toolslike ”twig” [56] and”iburg” [59].
Thekey ideain theapproachby Araujo andMalik is theuseof register-specificinstructionpatternsor grammarrules.
Insteadof separatingdetailedregisterallocationfrom codeselectionas in classicalcompilerconstruction,the instruction
patternscontainimplicit informationonthemappingof operandsandresultsto specialpurposeregisters.In orderto illustrate
this,weconsideraninstructionsubsetof theTI TMS320C25DSPalreadymentionedin section1 (seealsofig. 1. ThisPDSP
offerstwo typesof instructionsfor addition. Thefirst one(ADD) addsa memoryvalueto theaccumulatorregisterACCU,
while thesecondone(APAC) addsthevalueof theproductregisterPRto ACCU.In compilersfor generalpurposeprocessors,
adistinctionof storagecomponentsis madeonly between(generalpurpose)registersandmemory. In agrammarmodelused
for treepatternmatchingwith dynamicprogramming,theabovetwo instructionswould thusbemodeledasfollows:
reg: PLUS(reg,mem)
reg: PLUS(reg,reg)
The symbols”reg” and”mem” aregrammarnonterminals,while ”PLUS” is a grammarterminalsymbol representingan
addition.Thesemanticsof suchrulesis thatthecorrespondinginstructioncomputestheexpressionontheright handsideand
storestheresultin astoragecomponentrepresentedby theleft handside.WhenparsingaDFT with respectto thesepatterns
it would beimpossibleto incorporatethecostsof moving valuesto/from ACCU andPR,but thedetailedmappingof ”reg”
to physicalregisterswould beleft to a latercodegenerationphase,possiblyat theexpenseof codequality losses.However,
whenusingregister-specificpatterns,instructionsADD andAPAC wouldbemodeledas:
accu: PLUS(accu,mem)
accu: PLUS(accu,pr)
Using a separatenonterminalfor eachspecialpurposeregisterpermitsto model instructionsfor puredatamoves,which
in turn allows the codegeneratorto simultaneouslyminimize the costsof suchinstructions.As an example,considerthe
TMS320C25instructionPAC, which movesa valuefrom PRto ACCU. In thetreegrammar, thefollowing rule (a so-called
chain rule) for PAC wouldbeincluded:
accu: pr
SinceusingthePAC rule for derivationof a DFT would incur additionalcosts,thecodegeneratorimplicitly minimizesthe
datamoveswhenconstructingtheoptimalDFT derivation.
Generationof sequentialassemblycodealsorequiresto determineatotalorderingof selectedinstructionsin time. DFGs
andDFTstypically only imposeapartialordering,andtheremainingschedulingfreedommustbeexploitedcarefully. This is
dueto thefact,thatspecialpurposeregistersgenerallyhavevery limited storagecapacity. On theTMS320C25,for instance,
eachregistermayhold only a singlevalue,sothatunfavorableschedulingdecisionsmayrequireto spill andreloadregister
contentsto/from memory, therebyintroducingadditionalcode. In orderto illustratetheproblem,considera DFT ¦ whose
rootnoderepresentsanaddition,for which theaboveAPAC instructionhasbeenselected.Thus,theadditionoperandsmust
residein registersACCUandPR,sothattheleft andright subtrees¦ and ¦ of ¦ mustdelivertheir resultsin theseregisters.
Whengeneratingsequentialcodefor ¦ , it mustbedecidedwhether¦ or ¦ shouldbeevaluatedfirst. If someinstructionin
¦ writesits resultto PR,then ¦ shouldbeevaluatedfirst in orderto avoid a spill instruction,because¦ writesits resultto
PRaswell andthisvalueis ”li ve” until theAPAC instructionfor therootof ¦ is emitted.Conversely, if someinstructionfor
¦ writesregisterACCU,then ¦ shouldbescheduledfirst in orderto avoid a registercontentionfor ACCU.In [60], Araujo
andMalik formalizedthis observationandprovideda formal criterionfor theexistenceof a spill-freeschedulefor a given
DFT. This criterionrefersto thestructureof theinstructionsetand,for instance,holdsfor theTMS320C25.Whenusingan
22
appropriateschedulingalgorithm,which immediatelyfollowsfrom thatcriterion,thenoptimalspill-freesequentialassembly
codecanbegeneratedfor any DFT.
3.3.2 Graph basedcodegeneration
Unfortunately, the DFT-basedapproachto codegenerationmay affect codequality, becauseit performsonly a local opti-
mizationof codefor a DFGwithin thescopeof thesingleDFTs.Therefore,researchershave investigatedtechniquesaiming
atoptimalor near-optimalcodegenerationfor full DFGs.Liao [61] haspresentedabranch-and-boundalgorithmminimizing
thenumberof spills in accumulator-basedmachines,i.e. processorswheremostcomputedvalueshave to passa dedicated
accumulatorregister. In addition,hisalgorithmminimizesthenumberof instructionsneededfor switchingbetweendifferent
computationmodes.Thesemodes(e.g.signextensionor productshift modes)arespecialcontrolcodesstoredin dedicated
moderegisters in orderto reducetheinstructionword length.If theoperationswithin aDFGhaveto beexecutedwith differ-
entmodes,thesequentialschedulehasa strongimpacton thenumberof instructionsfor modeswitching. Liao’s algorithm
simultaneouslyminimizesaccumulatorspillsandmodeswitchinginstructions.However, dueto thetime-intensiveoptimiza-
tion algorithm,optimality cannotbe achieved for large basicblocks. The codegenerationtechniquein [62] additionally
performscodeselectionfor DFGs,but alsorequireshighcompilationtimesfor largeblocks.
A fasterheuristicapproachhasbeengivenin [63]. It alsorelieson thedecompositionof DFGsinto DFTs,but takesinto
accountarchitecturalinformationwhencuttingtheCSEsin aDFG.In somecases,themachineinstructionsetitself enforces
thatCSEshaveto passthememoryanyway, whichagainis aconsequenceof theirregulardatapathsof PDSPs.Theproposed
techniqueexploits thisobservationby assigningthoseCSEsto memorywith highestpriority, while othersmightbekeptin a
register, resultingin moreefficientcode.
Kolsonet al. [64] have focusedon the problemof codegenerationfor irregular datapathsin the context of program
loops. While theabove techniquesdealwell with specialpurposeregistersin basicblocks,thedo not take into accountthe
datamovesrequiredbetweendifferentiterationsof a loop body. This mayrequiretheexecutionof a numberof datamoves
betweenthoseregistersholdingtheresultsat theendof oneiterationandthoseregisterswhereoperandsareexpectedat the
beginningof thenext iteration.Bothanoptimalandaheuristicalgorithmhavebeenproposedfor minimizingthedatamoves
betweenloop iterations.
3.4 Memory accessoptimization
Duringsequentialcodegeneration,memoryaccessesareusuallytreatedonly ”symbolically” without particularreferenceto
acertainmemorybankor memoryaddresses.Thedetailedimplementationof memoryaccessesis typically left to a separate
codegenerationphase.
3.4.1 Memory bank partitioning
Thereexist severalPDSPfamilieshaving thememoryorganizedin two differentbanks(typically calledX andY memory),
which areaccessiblein parallel. ExamplesareMotorola56xxx andAnalogDevices21xx. Suchan architectureallows to
simultaneouslyloadtwo valuesfrom memoryinto registersandis thereforevery importantfor DSPapplicationslike digital
filtering or FFT, involving component-wiseaccessto differentdataarrays.Exploiting this featurein a compilermeans,that
symbolicmemoryaccesseshave to bepartitionedinto X andY memoryaccessesin sucha way, thatpotentialparallelismis
maximized.Sudarsanam[65] hasproposeda techniqueto performthis optimization.Thereis a strongmutualdependence
betweenmemorybankpartitioningandregisterallocation,becausevaluesfrom a certainmemorybankcanonly be loaded
into certainregisters. The proposedtechniquestartsfrom symbolicsequentialassemblycodeandusesa constraintgraph
modelto representtheseinterdependencies.Memorybankpartitioningandregisterallocationareperformedsimultaneously
23
by labelingthe constraintgraphwith valid assignments.Due to the useof simulatedannealing,the optimizationis rather
time-intensive,but mayresultin significantcodesizeimprovements,asindicatedby experimentaldata.
3.4.2 Memory layout optimization
As onecostmetric, Sudarsanam’s techniquealsocapturesthe costof instructionsneededfor addresscomputations.For
PDSPswhich typically show veryrestrictedaddressgenerationcapabilities,addresscomputationsareanotherimportantarea
of codeoptimization.Fig. 22showsthearchitectureof anaddressgenerationunit (AGU) asit is frequentlyfoundin PDSPs.
SuchanAGU operatesin parallelto thecentraldatapathandcontainsa separateadder/subtractorfor performingop-
erationson addressregisters (ARs). ARs storethe effective addressesfor all indirect memoryaccesses,exceptfor global
variablestypically addressedin direct mode. Modify registers (MRs) areusedto storefrequentlyrequiredaddressmodify
values. ARs andMRs arein turn addressedby AR andMR pointers. Sincetypical AR or MR file sizesare4 or 8, these
pointersareshortindicesof 2 or 3 bits,eitherstoredin theinstructionword itself or in specialsmallregisters.
Therearedifferentmeansfor addresscomputation,i.e., for changingthevalueof AGU registers.
AR load: LoadinganAR with animmediateconstant(from theinstructionword).
MR load: LoadingaMR with animmediateconstant.
AR modify: Addingor subtractinganimmediateconstantto/fromanAR.
Auto-incrementand auto-decrement: Addingor subtractingtheconstant1 to/fromanAR.
Auto-modify: Addingor subtractingthecontentsof oneMR to/fromanAR.
While detailslike thesizeof AR andMR filesor thesigned-nessof modify valuesmayvaryfor differentprocessors,the
generalAGU architecturefrom fig. 22 is actuallyfoundin a largenumberof PDSPs.It is importantto notethatperforming
addresscomputationsusingthe AGU in parallel to other instructionsis generallyonly possible,if the AGU doesnot use
the instructionword asa resource.The wide immediateoperandfor AR andMR load andAR modify operationsusually
leavesno spaceto encodefurther instructionswithin thesameinstructionword, so that thesetwo typesof AGU operations
requirea separatenon-parallelinstruction. On the otherhand,thoseAGU operationsnot using the instructionword can
mostlybeexecutedin parallelto otherinstructions,sinceonly internalAGU resourcesareoccupied.We call theseaddress
computationszero-costoperations. In order to maximizecodequality in termsof performanceand size it is obviously
necessaryto maximizetheutilizationof zero-costoperations.
A numberof techniqueshave beendevelopedwhich solve this problemfor the scalar variablesin a program. They
exploit thefact,thatwhenthesequenceof variableaccessesis known aftersequentialcodegeneration,agoodmemorylayout
for the variablescan still be determined. In order to illustrate this, supposea programblock containingaccessesto the
variables A Óì Líîï #is given,andthevariableaccesssequenceis
« Xí¥ï$ ì î¥3ï ì î¥Lí ì ï$ ì 3î¥ï
Furthermore,let theaddressspacereservedforA
beG ç \ : < M# andlet oneAR beavailableto computetheaddresses
accordingto thesequence«
. Considera memorylayoutwhereA
is mappedtoG
in lexicographicorder(fig. 23a).
First, AR needsto be loadedwith theaddress1 of thefirst elementí
of«
. Thenext accesstakesplacetoï
which is
mappedto address3. Therefore,AR mustbe modifiedwith a valueof +2. The next accessrefersto ì , which requiresto
subtract3 from AR, andsoforth. ThecompleteAGU operationsequencefor«
is givenin fig. 23 a). Accordingto our cost
24
metric,only 4 out of 13 AGU operationshappento bezero-costoperations(auto-incrementor decrement),sothata costof
9 extra instructionsfor addresscomputationsis incurred. However, onecanfind a bettermemorylayout forA
(fig. 23 b),
which leadsto only 5 extra instructions,dueto a betterutilizationof zero-costoperations.An evenbetteraddressingscheme
is possibleif a modify registerMR is available. Sincetheaddressmodifier2 is requiredthreetimesin theAGU operation
sequencefrom fig. 23b), onecanassignthevalue2 to MR (oneextra instruction)but reusethisvaluethreetimesatzerocost
(fig. 23c), resultingin a totalcostvalueof only 3.
How cansuch”low cost”memorylayoutsbeconstructed? A first approachhasbeenproposedby Bartley [66] andhas
laterbeenrefinedby Liao [67]. Bothuseanaccessgraphto modeltheproblem.
Thenodesof theedge-weightedaccessgraph ?Õ BAC3D ë correspondto thevariableset,while theedgesrepresent
transitionsbetweenvariablepairsin the accesssequence«
. An edge 6$ ë IKD is assignedan integerweight , if
thereare transitions1 ë or
ë in«
. Fig. 24showstheaccessgraphfor ourexample.Sinceany memorylayoutforA
impliesa linearorderofA
andvice versa,any memorylayoutcorrespondsto a Hamiltonianpathin ? , i.e.,a pathtouching
eachnodeexactly once. Informally, a ”good’ Hamiltonianpathobviously shouldcontainasmany edgesof high weight
aspossible,becauseincluding theseedgesin the pathimplies that the correspondingvariablepairswill be adjacentin the
memorylayout,whichin turnmakesauto-increment/decrementaddressingpossible.In otherwords,amaximumHamiltonian
path in ? hasto befound,in orderto obtainanoptimalmemorylayout,whichunfortunatelyis anexponentialproblem.
While Bartley [66] first proposedthe accessgraphmodel,Liao [67] provided an efficient heuristicalgorithmto find
maximumpathsin theaccessgraph.Furthermore,Liao proposedageneralizationof thealgorithmfor thecaseof anarbitrary
number of ARs. By partitioningthevariablesetA
into groups,the -AR problemis reducedto different1-ARproblems,
eachbeingsolvableby theoriginalalgorithm.
Triggeredby this work, a numberof improvementsan generalizationshave beenfound. Leupers[68] improved the
heuristicfor the1-AR caseandproposedamoreeffectivepartitioningfor the -AR problem.Furthermore,heprovidedafirst
algorithmfor theexploitationof MRsto reduceaddressingcosts.Wess’algorithm[69] constructsmemorylayoutsfor AGUs
with anauto-incrementrangeof 2 insteadof 1, while in [70] a generalizationfor anarbitraryintegerauto-incrementrange
waspresented.Thegeneticalgorithmbasedoptimizationgivenin [71] generalizesthesetechniquesfor arbitraryregisterfile
sizesandauto-incrementrangeswhile alsoincorporatingMRs into memorylayoutconstruction.
3.5 Codecompaction
Codecompactionis typically executedasthelastphasein codegeneration.At this point of time,all instructionsrequiredto
implementa givenapplicationprogramhave beengenerated,andthegoalof codecompactionis to schedulethegenerated
sequentialcodeinto a minimumnumberof parallelmachineinstructions,or control steps, undertheconstraintsimposedby
thePDSParchitectureandinstructionset.Thus,codecompactionis avariantof theresourceconstrainedschedulingproblem.
Input to thecodecompactionphaseis usuallya dependencygraph ? XAéDl, whosenodesrepresenttheinstructions
selectedfor abasicblock,while edgesdenoteschedulingprecedences.Therearethreetypesof suchprecedences:
Data dependencies:Two instructionsx
andx
aredatadependent,ifx¤
generatesa valuereadbyx
. Thus,x
mustbe
scheduledbeforex
.
Anti dependencies:Two instructionsx
andx
areanti dependent,ifx
potentiallyoverwritesa valuestill neededbyx
.
Thus,x¤
mustnotbescheduledbeforex
.
Output dependencies:Two instructionsx
andx
areoutputdependent,ifx
andx
write their resultsto thesamelocation
(registeror memorycell). Thus,x
andx
mustbescheduledin differentcontrolsteps.
Additionally, incompatibilityconstraintsxlðñ x betweeninstructionpairs
Xxx¤haveto beobeyed.Theseconstraints
ariseeitherfrom processorresourcelimitations(e.g.only onemultiplier available)or from theinstructionformat,whichmay
25
preventtheparallelschedulingof instructionsevenwithouta resourceconflict. In eithercase,ifx ðñ x , then
x and
x must
bescheduledin differentcontrolsteps.
Thecodecompactionproblemhasalreadybeenstudiedin theearlyeightieswithin thecontext of verylong instruction
word (VLIW) processors,showing a largedegreeof parallelismat the instructionlevel. A numberof differentcompaction
heuristicshavebeendevelopedfor VLIW machines[73]. However, eventhoughPDSPsresembleVLIW machinestoacertain
extent,VLIW compactiontechniquesarenot directly applicableto PDSPs.Thereasonis that instruction-level parallelism
(ILP) is typically muchmoreconstrainedin PDSPsthanin VLIWs, becauseusingvery long instructionwordsfor PDSPs
would leadto extremelyhighcodesizes.Furthermore,PDSPinstructionsetsfrequentlyshow alternativeopcodesto perform
a certainmachineinstruction.
As anexample,considertheTI TMS320C25instructionset.This PDSPoffersinstructionsADD andMPY to perform
additionandmultiplication.However, thereis alsoamultiply-accumulateinstructionMPYA, whichperformsbothoperations
in parallelandthusfaster. InstructionMPYA maybeconsideredasanalternativeopcodebothfor ADD andMPY, but its use
is stronglycontext dependent.Only if anadditionanda multiplicationcanbescheduledin parallelfor a givendependency
graph,MPYA may be used. Otherwise,usingMPYA insteadof eitherADD or MPY could leadto an incorrectprogram
behavior aftercompaction,becauseMPYA overwritestwo registers(PRandACCU), thuspotentiallycausingundesiredside
effects.
In addition,coderunningonf PDSPsin mostcaseshasto meetreal-timeconstraints,which cannotbe guaranteedby
heuristics.Dueto thesespecialcircumstances,DSP-specificcodecompactiontechniqueshavebeendeveloped.In Timmer’s
approach[74], bothresourceandtimingconstraintsareconsideredduringcodecompaction.A bipartitegraphisusedtomodel
possibleassignmentsof instructionsto control steps.In importantfeatureof Timmer’s techniqueis that timing constraints
areexploitedin orderto quickly find exactsolutionsfor compactionprobleminstances.Themobilityof aninstructionis the
interval of controlsteps,to whichaninstructionmaybeassigned.Trivial boundsonmobility canbeachievedby performing
an ASAP/ALAP analysison thedependency graph,which accountsfor the earliestandthe latestcontrol stepin which an
instructionmaybescheduledwithoutviolatingdependencies.An additionalexecutionintervalanalysis, basedonbothtiming
andresourceconstraintsis performedto furtherrestrictthemobility of instructions.Theremainingmobility on theaverage
is low, anda schedulemeetingall constraintscanbedeterminedquickly by abranch-and-boundsearch.
AnotherDSP-specificcodecompactiontechniquewaspresentedin [75], whichalsoexploits theexistenceof alternative
instructionopcodes.The codecompactionproblemis transformedinto an Integer Linear Programmingproblem. In this
formulation,a set of integer solutionvariablesaccountfor the detailedschedulingof instructions,while all precedences
andconstraintsaremodeledaslinear equationsandinequationson the solutionvariables.The Integer Linear Programis
thensolvedoptimally usinga standardsolver, suchas”lp solve” [76]. SinceIntegerLinearProgrammingis anexponential
problem,theapplicabilityof this techniqueis restrictedto smallto moderatesizebasicblocks,whichhoweveris sufficient in
mostpracticalcases.
In orderto illustratetheimpactof codecompactiononcodequalityaswell asits cooperationwith othercodegeneration
phases,weuseasmallC programfor complex numbermultiplicationasanexample.
int ar,ai,br,bi,cr,ci;
cr = ar * br - ai * bi ;
ci = ar * bi + ai * br ;
For theTI TMS320C25,thesequentialassemblycode,asgeneratedby techniquesmentionedin section3.3,would be
thefollowing.
LT ar // TR = ar
MPY br // PR = TR * br
26
PAC // ACCU = PR
LT ai // TR = ai
MPY bi // PR = TR * bi
SPAC // ACCU = ACCU - PR
SACL cr // cr = ACCU
LT ar // TR = ar
MPY bi // PR = TR * bi
PAC // ACCU = PR
LT ai // TR = ai
MPY br // PR = TR * br
APAC // ACCU = ACCU + PR
SACL ci // ci = ACCU
Thissequentialcodeshowsthefollowing (symbolic)variableaccesssequence:
« ìò 3í ò ìF 3í F î ò ìò 3í F ìF Lí ò î F
Suppose,oneaddressregisterAR is availablefor computingthememoryaddressesaccordingto«
. Then,thememorylayout
optimizationmentionedin section3.4.2would computethefollowing addressmappingof thevariablesto theaddressspace
\ £ .0 ci
1 br
2 ai
3 bi
4 cr
5 ar
We can now insert the correspondingAGU operationsinto the sequentialcodeand invoke codecompaction. The
resultingparallelassemblycodemakesuseof parallelismbothwithin thedatapathitself andwith respectto parallelAGU
operations(auto-incrementanddecrement).
LARK 5 // load AR with &ar
LT * // TR = ar
SBRK 4 // AR -= 4 (&br)
MPY *+ // PR = TR * br, AR++ (&ai)
LTP *+ // TR = ai, ACCU = PR, AR++ (&bi)
MPY *+ // PR = TR * bi, AR++ (&cr)
SPAC // ACCU = ACCU - PR
SACL *+ // cr = ACCU, AR++ (&ar)
LT * // TR = ar
SBRK 2 // AR -= 2
MPY *- // PR = TR * bi, AR-- (&ai)
LTP *- // TR = ai, ACCU = PR, AR-- (&br)
MPY *- // PR = TR * br, AR-- (&ci)
APAC // ACCU = ACCU + PR
SACL * // ci = ACCU
Eventhoughaddresscomputationsfor thevariableshave beeninserted,theresultingcodeis only oneinstructionlarger
thanthe original symbolicsequentialcode. This is achievedby a high utilization of zero-costaddresscomputations(only
27
two extraSBRKinstructions)aswell asparallelLTP instructions,whichperformtwo datamovesin parallel.Thiswouldnot
havebeenpossiblewithoutmemorylayoutoptimizationandcodecompaction.
3.6 Phasecoupling
Eventhoughcodecompactionis a powerful codeoptimizationtechnique,only thedirectcouplingof sequentialandparallel
codegenerationphasescanyield globally optimal results.Phase-coupledtechniquesfrequentlyhave to resortto heuristics
due to extremely large searchspaces.However, heuristicsfor phase-coupledcodegenerationstill may outperformexact
techniquessolvingonly partsof thecodegenerationproblem.In this sectionwe thereforesummarizeimportantapproaches
to phase-coupledcodegenerationfor PDSPs.
Early work [77, 78] combinedinstructionschedulingwith a dataroutingphase.In any stepof scheduling,datarouting
performsdetailedregisterallocationbasedon resourceavailability in accordancewith a partialscheduleconstructedsofar.
In this way, the schedulingfreedom(mobility) of instructionscannotbe not obstructedby unfavorableregisterallocation
decisionsmadeearlier during codegeneration. However, significanteffort hasto be spentfor avoidanceof scheduling
deadlocks, whichrestricttheapplicabilityof suchtechniquesto simplePDSParchitectures.
Wilson’sapproachto phasecoupledcodegeneration[79] is alsobasedon IntegerLinearProgramming.In his formula-
tion, thecompletesearchspace,includingregisterallocation,codeselection,andcodecompactionis exploredatonce.While
this approachpermitsthegenerationof provableoptimalcodefor basicblocks,thehigh problemcomplexity alsoimposes
heavy restrictionsonapplicabilityfor realisticprogramsandPDSPs.
An alternativeIntegerLinearProgrammingformulationhasbeengivenin [80]. By bettertakinginto accountthedetailed
processorarchitecture,optimalcodecouldbegeneratedfor smallsizeexamplesfor theTI TMS320C25DSP.
A morepracticalphasecouplingtechniqueis MutationScheduling[81]. During instructionscheduling,a setof muta-
tions is maintainedfor eachprogramvalue. Eachmutationrepresentsanalternative implementationof valuecomputation.
For instance,mutationsfor a commonsubexpressionin aDFGmayincludestoringtheCSEin somespecialpurposeregister
or recomputingit multiple times. For othervalues,mutationsaregeneratedby applicationof algebraicruleslike commuta-
tivity or associativity. In eachschedulingstep,thebestmutationfor eachvalueto bescheduledis chosen.While Mutation
Schedulingrepresentsan”ideal” approachto phasecoupling,its efficacy critically dependsontheschedulingalgorithmused
aswell ason thenumberof mutationsconsideredfor eachvalue.
A constraintdrivenapproachto phase-coupledcodegenerationfor PDSPsis presentedin [82]. In thatapproach,alterna-
tiveswith respectto codeselection,registerallocation,andschedulingareretainedaslongaspossibleduringcodegeneration.
Restrictionsimposedby theprocessorarchitectureareexplicitly modeledin theform of constraints,whichensurecorrectness
of thegeneratedcode.Theimplementationmakesuseof aconstraint logic programmingenvironment.For severalexamples
it hasbeendemonstratedthatthequalityof thegeneratedcodeis equalto thatof hand-writtenassemblycode.
3.7 Retargetablecompilation
As systemsbasedon PDSPsmostlyhave to bevery cost-efficient,a comparatively largenumberof differentstandard(”off-
the-shelf”) PDSPsareavailableon thesemiconductormarket at thesametime. Fromthis variety, a PDSPusermayselect
thatprocessorarchitecturewhichmatcheshis requirementsatminimumcosts.In spiteof thelargevarietyof standardDSPs,
however, it is still unlikely thata customerwill find a processorideally matchingonegivenapplication.In particular, using
standardprocessorsin theform of cores(layoutmacrocells) for systems-on-a-chipmayleadto a wasteof silicon area.For
mobileapplications,alsotheelectricalpowerconsumedby a standardprocessormaybetoohigh.
As aconsequence,thereis a trendtowardstheuseof anew classof PDSPs,calledapplicationspecificsignalprocessors
(ASSPs).Thearchitectureof suchASSPsis still programmable,but is customizedfor restrictedapplicationareas.A well-
known exampleis theEPICSarchitecture[83]. A numberof furtherASSPsarementionedin [52].
28
Theincreasinguseof ASSPsfor implementingembeddedDSPsystemsleadsto anevenlargervarietyof PDSPs.While
thecodeoptimizationtechniquesmentionedin theprevioussectionshelpto improve thepracticalapplicabilityof compilers
for DSPsoftwaredevelopment,they do not answerthe question:Who will write compilersfor all thesedifferentPDSP
architectures? Developinga compilerfor eachnew ASSP, possiblyhaving a low productionvolumeandproductlifetime, is
noteconomicallyfeasible.Still, theuseof compilersfor ASSPsinsteadof assemblyprogrammingis still highly desirable.
Therefore,researchershave looked at technologyfor developingretargetablecompilers. Suchcompilersarenot re-
strictedto generatingcodefor a singletargetprocessor, but aresufficiently flexible to bereusedfor awholeclassof PDSPs.
Morespecifically, wecall a compilerretargetable,if adaptingthecompilerto a new targetprocessordoesnot involverewrit-
ing a largepartof thecompilersourcecode.This canbeachievedby usingexternalprocessormodels. While in a classical,
target-specificcompilertheprocessormodelis hard-codedin thecompilersourcecode,a retargetablecompilercanreadan
externalprocessormodelasanadditionalinput specifiedby theuserandgeneratecodefor thetargetprocessorspecifiedby
themodel.
3.7.1 The RECORD compiler system
An exampleof a retargetablecompiler for PDSPsis the RECORDsystem[84], a coarseoverview of which is given in
fig. 25. In RECORD,processormodelsaregivenin thehardwaredescriptionlanguage(HDL) MIMOLA, which resembles
structuralVHDL. A MIMOLA processormodelcapturestheregistertransferlevel structureof aPDSPs,includingcontroller,
datapath,andaddressgenerationunits. Alternatively, the pure instructionsetcanbe described,while hiding the internal
structure.UsingHDL modelsis anaturalwayof describingprocessorhardware,with a largeamountof modelingflexibility .
Furthermore,theuseof HDL modelsreducesthenumberof differentprocessormodelsrequiredduringthedesignprocess,
sinceHDL modelscanbeusedalsofor hardwaresynthesisandsimulation.
Sequentialcodegenerationin RECORDis basedon the dataflow tree(DFT) modelexplainedin section3.3.1. The
sourceprogram,givenin theprogramminglanguageDFL, is first transformedinto anintermediaterepresentation,consisting
of DFTs. The codegeneratoris automaticallygeneratedfrom the HDL processormodelby meansof the iburg tool [59].
Sinceiburg requiresa treegrammarmodelof thetargetinstructionset,somepreprocessingof theHDL modelis necessary.
RECORDusesan instruction set extraction phaseto transformthe structuralHDL model into an internal model of the
machineinstructionset.This internalmodelcapturesthebehavior of availablemachineinstructionsaswell astheconstraints
on instruction-level parallelism.
During sequentialcodegeneration,the codegeneratorgeneratedby meansof iburg is usedto mapDFTs into target
specificmachinecode.Whilemapping,RECORDexploitsalgebraicruleslikecommutativity andassociativity of operatorsto
increasecodequality. Theresultingsequentialassemblycodeis furtheroptimizedby meansof memoryaccessoptimization
(section3.4) and codecompaction(section3.5). An experimentalevaluationfor the TI TMS320C25DSP showed, that
thanksto theseoptimizationsRECORDon theaveragegeneratessignificantlydensercodethana commercialtargetspecific
compiler, however at the expenseof lower compilationspeed. Furthermore,RECORDis easily retargetableto different
processorarchitectures.If a HDL modelis available,thengenerationof processorspecificcompilercomponentstypically
takes lessthanoneworkstationCPU minute. This short turnaroundtime permitsto usea retargetablecompileralso for
quickly exploringdifferentarchitecturaloptionsfor anASSP, e.g.,with respectto thenumberof functionalunits,registerfile
sizes,or interconnectstructure.
3.7.2 Further retargetablecompilers
A widespreadexamplefor a retargetablecompileris theGNU compiler”gcc” [85]. Sincegcchasbeenmainly designedfor
CISCandRISCprocessorarchitectures,it is basedon theassumptionof regularprocessorarchitecturesandthusis hardly
applicableto PDSPs.
29
TheMSSQcompiler[86] hasbeenanearlyapproachto retargetablecompilationbasedonHDL models,howeverwithout
specificoptimizationsfor PDSPs.
In theCodeSyncompiler[57], specificallydesignedfor ASSPs,thetargetprocessoris heterogeneouslydescribedby the
setof availableinstructionpatterns,a graphmodelrepresentingthedatapath,anda resourceclassificationthataccountsfor
specialpurposeregisters.
TheCHESScompiler[87] usesaspecificlanguagecallednML for describingtargetprocessorarchitectures.It generates
codefor a specificASSParchitecturalstyleandthereforeemploysspecialcodegenerationandoptimizationtechniques[88].
ThenML languagehasalsobeenusedin a retargetablecompilerprojectatCadence[89].
Severalcodeoptimizationsmentionedin this paper[61, 62, 60, 63] have beenimplementedin theSPAM compilerat
PrincetonUniversityandMIT. AlthoughSPAM canbeclassifiedasaretargetablecompiler, it is morebasedonexchangeable
softwaremodulesperformingspecificoptimizationinsteadof anexternaltargetprocessormodel.
Anotherapproachto retargetablecodegenerationfor PDSPsis theAVIV compiler[90], which usesa speciallanguage
(ISDL [91]) for modelingVLIW-likeprocessorarchitectures.
As compilersfor standardDSPsandASSPsbecomemoreimportantandretargetablecompilertechnologygetsmore
mature,severalcompanieshavestartedto sellcommercialretargetablecompilerswith specialemphasisonPDSPs.Examples
aretheCoSycompilerdevelopmentsystemby ACE, thecommercialversionof theCHESScompiler, aswell asArchelon’s
retargetablecompilersystem.Detailedinformationabouttheserecentsoftwareproductsis availableon theWorld WideWeb
[92, 93, 94].
4 Conclusions
Thispaperhasreviewedthatstateof theart in front- andback-enddesignautomationtechnologyfor DSPsoftwareimplemen-
tation. We have motivateda designflow thatbeginswith a high-level, hierarchicalblock diagramspecification;synthesizes
a C-languageapplicationprogramor subsystemfrom this specification;andthencompilesthe C programinto optimized
machinecodefor the given targetprocessor. We have reviewedseveralusefulcomputationalmodelsthat provide efficient
semanticsfor theblock diagramspecificationsat the front endof this designflow, We thenexaminedthevastspaceof im-
plementationtrade-offs oneencounterswhensynthesizingsoftwarefrom thesecomputationalmodels,in particularfrom the
closely-relatedsynchronousdataflow (SDF)andscalablesynchronousdataflow (SSDF)models,whichcanbeviewedaskey
“commondenominators”of theothermodels.Subsequently, we examineda varietyof usefulsoftwaresynthesistechniques
thataddressimportantsubsetsof andprioritizationsof relevantoptimizationmetrics.
Complementaryto softwaresynthesisissues,we have outlinedthestate-of-the-artin compilationof efficient machine
codefrom applicationsourceprograms. Taking the stepfrom assembly-level to C-level programmingof DSPsdemands
for specialcodegenerationtechniquesbeyondthescopeof classicalcompilertechnology. In particular, this concernscode
generation,memoryaccessoptimization,andexploitation of instruction-level parallelism. Recently, also the problemof
tightly couplingthesedifferentcompilationphasesin orderto generatedvery efficient codehasgainedsignificantresearch
interest.In addition,wehavemotivatedtheuseof retargetablecompilers,whichareimportantfor programmingapplication-
specificDSPs.
In ouroverview, wehavehighlightedusefuldirectionsfor furtherstudy. A particularlyinterestingandpromisingdirec-
tion,whichremainslargelyunexplored,is theinvestigationof theinteractionbetweensoftwaresynthesisandcodegeneration
– that is, thedevelopmentof synthesistechniquesthatexplicitly aid thecodegenerationprocess,andcodegenerationtech-
niquesthatincorporatehigh-level applicationstructurethatis exposedduringsynthesis.
30
AGU
register fileaddress
(8 x 16)
data RAM(256 x 16)
ALU
ACCU
shifter
MUX
shifter
multiplier
TR
PR
(4096 x 16)program ROM
controller
exchangebus
ARP
program bus
data bus
16 16
16
16
32
32
3
16
Figure1: Simplifiedarchitectureof TexasInstrumentsTMS320C25DSP
31
Figure2: Thetop-level blockdiagramspecificationof adiscretewavelettransformapplicationimplementedin Ptolemy[7].
32
BA
C
D
J LG
I K
H N
P QF
ME O
111
1
1 1 1 1
1 1
1
1
1 1 1 1
1 1 1 1
2
2
2
2
2
2
2
1
11 2
1
1
1 1
1
Figure3: An illustrationof anexplicit SDFspecification.
B
A
C3
2
3
12
1D
3D
1
Figure4: A deadlockedSDFgraph.
(1,1,1) (1,0,0) 3 13
(a)
3
(b)
Figure5: CSDFandSDFversionsof adownsamplerblock.
33
K
+ Ψ
+
Ψ
IN xn
OUT ynG
111
1
1
1
1
1
1
1
1
(1,0)
(0,1)
(0,1)
(1,0)
A
C
E
B
D
F
= k²yn-1ny +kxn+xn-1= k(ky+x )+xn-1nn
CEG2FDBA1valid schedule: A
1D
1D
Figure6: An examplethat illustratesthecompactmodelingof resourcesharingusingCSDF. Theactorlabeledó denotesa
dataflow fork, whichsimply replicatesits input tokensonall of its outputedges.Thelowerportionof thefiguregivesa valid
schedulefor thisCSDFspecification.Here,G
andG
denotethefirst andsecondphasesof theCSDFactorG
.
34
delay-free SDF cycle=> deadlock
K
+ INΨ
+
OUT
Ψ
111
1
1
1
1
1
1
1
1
1
1
1
1
1D
1D
Figure7: TheSDFversionof thespecificationin fig. 6.
35
A B
D C D C
Ωô Ωôschedule: CD
ΩôΩô
Ω
Ωô
D C
(a)
(c)
(b)
1 2
1D
1
2
: B
: A
1
1
(1,0)
11
1 1
1 11
111
1 1
11
1 1
(0,1)
Figure8: An examplethatillustratestheutility of cyclo-staticdataflow in constructinghierarchicalspecifications.Grouping
theactorsG
and5
into thehierarchicalSDFactor ã , asshown in (b), resultsin a deadlockedSDFgraph. In contrast,an
appropriateCSDFmodelof thehierarchicalgrouping,illustratedin (c), avoidsdeadlock.Thetwo phasesof thehierarchical
CSDFactorã[a in (c)arespecifiedin thelowerrightcornerof thefigurealongwith avalidschedulefor theCSDFspecification.
36
M1
M2
source distribute
source distribute
M1
M2
2N1
1 (N,N)
1
1
(b)
(a)
A
D
C1
1
N
N
B
(N,0)
(0,N)
Figure9: An exampleof theuseof CSDFto decreasebufferingrequirements.
37
OUT3 2(1,0)(1,1)(1,1,1)(1,0,0)
IN
A B C
FIR11 11
"deadsubgraphs"
IN A B C OUT
A B C
A B C OUT
IN A B C
A B C OUT
A B C
1 1 1 11
2 2 2
3 3 3
42
2
4 4
5
6
5 5
6 6
3
Figure10: An exampleof efficientdeadcodeeliminationusingCSDF.
imageexpander
(512x512) (1024x1024)
Figure11: An exampleof anMDSDFactor.
X Y Z10 1 1
5D
1
Figure12: A simpleexamplethatweuseto illustratetrade-offs involvedin compilingSDFspecifications.
38
A B
C
2
1
2 5
52
10D
Figure13: An examplethatweuseto illustratethe õ7ö÷0ø metric.
111
B C4D
7D
1 1 1 1
1
A
Figure14: Thisexampleillustratesthatminimizingactoractivationsdoesnot imply minimizingactorappearances.
39
101
A B Cù
D2 1 2 1 2 1
1224ú
2 1
12D 16D 12D
A Ωû
12 1
2 1
12D
B 2 1ü 24ú
16D
Ωû
2 C D2 1ü 12ü
12D
(b) (c) (d)
(a)
Figure15: An illustrationof a completehierarchization.
40
source program
lexical analysissyntax analysis
semantical analysis
source code analyses
representation intermediate
machine-independentIR optimizations
representation intermediate
optimized
sequential code generation
code generation
memory access optimizationcode compaction
assembly program
Figure16: Compilationphases
int a,b,c,d,x,y,z;
void f()
x = a + b;
y = a + b - c * d;
z = c * d;
Figure17: ExampleC sourcecode
41
load a load b load dload c
*
store x store y store z
+
-
Figure18: DFG representationof codefromfig. 17
load a load b load dload c
*
store x store y store z
+
- SUBMUL
LOAD LOAD
STORESTORESTORE
ADD
Figure19: DFG fromfig. 18coveredby instructionpatterns
load a load b load dload c
*
store x store y store z
+
- SUB
LOAD LOAD
STORESTORESTORE
MAC
Figure20: UsingMAC for DFG covering
load a load b load dload c
*
store y store z
+
store x
write CSE write CSE
read CSE read CSE
read CSEread CSE
-
Figure21: Decompositionof a DFG into DFTs
42
effectiveaddress
modify registerfile
addressregisterfile
+/-
"1"
AR pointer
AGU
immediate value
MR pointer
Figure22: Addressgenerationunit
LOAD AR, 1AR += 2AR -= 3AR += 2AR ++AR -= 3AR += 2AR --AR --AR += 3AR -= 3AR += 2AR ++
LOAD AR, 3AR --AR --AR --LOAD MR, 2AR += MRAR --AR --AR += 3AR -= MRAR ++AR --AR --AR += MR
bdacdacbadacd
LOAD AR, 3AR --AR --AR --AR += 2AR --AR --AR += 3AR -= 2AR ++AR --AR --AR += 2
bdacdacbadacd
bdac
acbadacd
dabcd
cadb
cadb
0123
0123
0123
a) b) c)
cost: 9 cost: 5 cost: 3
Figure23: AlternativememorylayoutsandAGU operationsequences
a
c
b
d
4
3
2
1
1
1
a
c
b
d
4
3 1
access graph maximum weighted path
Figure24: Accessgraphmodelandmaximumweightedpath
43
DFL source program
processor modelMIMOLA HDL
mapping to DFTsinstruction set
extraction
generation with iburgcode generator
sequential codegeneration
memory access optimizationcode compaction
assembly codesequential
parallel assembly code
Figure25: Coarsearchitectureof theRECORDsystem
44
References
[1] TheDesignandImplementationof SignalProcessingSystemsTechnicalCommittee.VLSI designandimplementation
fuelsthesignalprocessingrevolution. IEEESignalProcessingMagazine, 15(1):22–37,January1998.
[2] P. Lapsley, J.Bier, A. Shoham,andE. A. Lee.DSPProcessorFundamentals. Berkeley DesignTechnology, Inc.,1994.
[3] E. A. Lee. ProgrammableDSParchitectures— Part I. IEEEASSPMagazine, 5(4),October1988.
[4] E. A. Lee. ProgrammableDSParchitectures— Part II. IEEEASSPMagazine, 6(1),January1988.
[5] P. MarwedelandG. Goossens,editors. CodeGeneration for EmbeddedProcessors. Kluwer AcademicPublishers,
1995.
[6] V. Zivojnovic, H. Schraut,M. Willems, and H. Meyr. DSPs,GPPs,and multimediaapplications— an evaulation
usingDSPstone.In Proceedingsof the InternationalConferenceon SignalProcessingApplicationsandTechnology,
November1995.
[7] J.T. Buck, S.Ha,E. A. Lee,andD. G. Messerschmitt.Ptolemy:A framework for simulatingandprototypinghetero-
geneoussystems.InternationalJournalof ComputerSimulation, January1994.
[8] P. P. Vaidyanathan.MultirateSystemsandFilter Banks. PrenticeHall, 1993.
[9] E. A. Lee andD. G. Messerschmitt.Synchronousdataflow. Proceedingsof the IEEE, 75(9):1235–1245, September
1987.
[10] E. A. Lee. Consistency in dataflow graphs.IEEETransactionsonParallel andDistributedSystems, 2(2),April 1991.
[11] S. S. Bhattacharyya,P. K. Murthy, and E. A. Lee. Software Synthesisfrom Dataflow Graphs. Kluwer Academic
Publishers,1996.
[12] S.Ritz,M. Willems,andH. Meyr. Schedulingfor optimumdatamemorycompactionin blockdiagramorientedsoftware
synthesis.In Proceedingsof theInternationalConferenceonAcoustics,Speech,andSignalProcessing, May 1995.
[13] E. A. LeeandD. G. Messerschmitt.Staticschedulingof synchronousdataflow programsfor digital signalprocessing.
IEEETransactionsonComputers, February1987.
[14] E. A. Lee, W. H. Ho, E. Goei, J. Bier, and S. S. Bhattacharyya.Gabriel: A designenvironmentfor DSP. IEEE
TransactionsonAcoustics,Speech,andSignalProcessing, 37(11),November1989.
[15] D. R. O’Hallaron. TheASSIGNparallelprogramgenerator. Technicalreport,Schoolof ComputerScience,Carnegie
Mellon University, May 1991.
[16] G. Bilsen,M. Engels,R. Lauwereins,andJ.A. Peperstraete.Cyclo-staticdataflow. In Proceedingsof theInternational
ConferenceonAcoustics,Speech,andSignalProcessing, pages3255–3258,May 1995.
[17] G. Bilsen, M. Engels,R. Lauwereins,and J. A. Peperstraete.Cyclo-staticdataflow. IEEE Transactionson Signal
Processing, 44(2):397–408,February1996.
[18] G. DeMicheli. SynthesisandOptimizationof Digital Circuits. McGraw-Hill, 1994.
[19] T. M. Parks,J.L. Pino,andE. A. Lee. A comparisonof synchronousandcyclo-staticdataflow. In Proceedingsof the
IEEEAsilomarConferenceonSignals,Systems,andComputers, November1995.
45
[20] S.Ritz, M. Pankert,andH. Meyr. Optimumvectorizationof scalablesynchronousdataflow graphs.In Proceedingsof
theInternationalConferenceonApplicationSpecificArrayProcessors, October1993.
[21] S. Ritz, M. Pankert, andH. Meyr. High level softwaresynthesisfor signalprocessingsystems.In Proceedingsof the
InternationalConferenceonApplicationSpecificArrayProcessors, August1992.
[22] E. A. Lee. Representingandexploiting dataparallelismusingmultidimensionaldataflow diagrams.In Proceedingsof
theInternationalConferenceonAcoustics,Speech,andSignalProcessing, pages453–456,April 1993.
[23] P. K. Murthy andE. A. Lee. An extensionof multidimensionalsynchronousdataflow to handlearbitrarysampling
lattices. In Proceedingsof the InternationalConferenceon Acoustics,Speech, and SignalProcessing, pages3306–
3309,May 1996.
[24] G. R. Gao,R. Govindarajan,andP. Panangaden.Well-behavedprogramsfor DSPcomputation.In Proceedingsof the
InternationalConferenceonAcoustics,Speech, andSignalProcessing, March1992.
[25] J. T. Buck andE. A. Lee. Schedulingdynamicdataflow graphsusingthe token flow model. In Proceedingsof the
InternationalConferenceonAcoustics,Speech, andSignalProcessing, April 1993.
[26] J. T. Buck. SchedulingDynamicDataflowGraphswith BoundedMemoryusingthe Token Flow Model. PhD thesis,
Departmentof ElectricalEngineeringandComputerSciences,Universityof CaliforniaatBerkeley, September1993.
[27] J.T. Buck. Staticschedulingandcodegenerationfrom dynamicdataflow graphswith integer-valuedcontrolsystems.
In Proceedingsof theIEEEAsilomarConferenceonSignals,Systems,andComputers, October1994.
[28] S.S.Bhattacharyya,P. K. Murthy, andE.A. Lee.Optimalparenthesizationof lexicalorderingsfor DSPblockdiagrams.
In Proceedingsof the InternationalWorkshopon VLSI SignalProcessing. IEEE press,October1995. Sakai,Osaka,
Japan.
[29] M. Ade, R. Lauwereins,andJ. A. Peperstraete.Buffer memoryrequirementsin DSPapplications.In Proceedingsof
theIEEEWorkshoponRapidSystemPrototyping, pages198–123,June1994.
[30] M. Ade,R. Lauwereins,andJ.A.Peperstraete.Datamemoryminimisationfor synchronousdataflow graphsemulated
onDSP-FPGAtargets.In Proceedingsof theDesignAutomationConference, pages64–69,June1994.
[31] M. CubricandP. Panangaden.Minimal memoryschedulesfor dataflow networks. In CONCUR’93, August1993.
[32] R. Govindarajan,G. R. Gao,andP. Desai.Minimizing memoryrequirementsin rate-optimalschedules.In Proceedings
of theInternationalConferenceonApplicationSpecificArray Processors, August1994.
[33] S.How. Codegenerationfor multirateDSPsystemsin gabriel. Master’s thesis,Departmentof ElectricalEngineering
andComputerSciences,Universityof CaliforniaatBerkeley, May 1990.
[34] S.S.Bhattacharyya,P. K. Murthy, andE. A. Lee. Synthesisof embeddedsoftwarefrom synchronousdataflow specifi-
cations.Journalof VLSISignalProcessingSystems, 21(2):151–166,June1999.
[35] S.S.Bhattacharyya,J.T. Buck,S.Ha, andE. A. Lee. A schedulingframework for minimizing memoryrequirements
of multirateDSPsystemsrepresentedasdataflow graphs.In Proceedingsof theInternationalWorkshoponVLSISignal
Processing, October1993.Veldhoven,TheNetherlands.
[36] S.S.Bhattacharyya,J.T. Buck,S.Ha,andE.A. Lee.Generatingcompactcodefrom dataflow specificationsof multirate
signalprocessingalgorithms.IEEE Transactionson CircuitsandSystems– I: FundamentalTheoryandApplications,
42(3):138–150,March1995.
46
[37] S.S.Bhattacharyya,P. K. Murthy, andE. A. Lee. APGAN andRPMC:Complementaryheuristicsfor translatingDSP
blockdiagramsinto efficientsoftwareimplementations.Journalof DesignAutomationfor EmbeddedSystems, January
1997.
[38] P. K. Murthy, S. S. Bhattacharyya,and E. A. Lee. Joint minimizationof codeand datafor synchronousdataflow
programs.Journalof FormalMethodsin SystemDesign, 11(1):41–70,July1997.
[39] J.L. Pino,S.S.Bhattacharyya,andE. A. Lee. A hierarchicalmultiprocessorschedulingsystemfor DSPapplications.
In Proceedingsof theIEEEAsilomarConferenceonSignals,Systems,andComputers, November1995.
[40] P. K. Murthy andS.S. Bhattacharyya.Sharedmemoryimplementationsof synchronousdataflow specificationsusing
lifetime analysistechniques.TechnicalReportUMIACS-TR-99-32,Institutefor AdvancedComputerStudies,Univer-
sity of MarylandatCollegePark,June1999.
[41] P. K. Murthy andS.S.Bhattacharyya.A buffer merging techniquefor reducingmemoryrequirementsof synchronous
dataflow specifications.In Proceedingsof theInternationalSymposiumonSystemsSynthesis, 1999.SanJose,Califor-
nia, to appear.
[42] E. Zitzler, J. Teich,andS. S. Bhattacharyya.Optimizedsoftwaresynthesisfor DSPusingrandomizationtechniques.
Technicalreport,ComputerEngineeringandCommunicationNetworksLaboratory, SwissFederalInstituteof Technol-
ogy, Zurich,July1999.Revisedversionof teic1998x1.
[43] J. Teich,E. Zitzler, andS. S. Bhattacharyya.Optimizedsoftwaresynthesisfor digital signalprocessingalgorithms–
anevolutionaryapproach.In Proceedingsof theIEEEWorkshoponSignalProcessingSystems, October1998.Boston,
Massachusetts.
[44] E. Zitzler, J. Teich,andS. S. Bhattacharyya.Evolutionaryalgorithmsfor thesynthesisof embeddedsoftware. IEEE
TransactionsonVeryLargeScaleIntegration(VLSI)Systems, 1999.Acceptedfor publication;to appear.
[45] T. Back,U. Hammel,andH-PSchwefel.Evolutionarycomputation:Commentson thehistoryandcurrentstate.IEEE
TransactionsonEvolutionaryComputation, 1(1):3–17,1997.
[46] V. Zivojnovic, S. Ritz, andH. Meyr. Multirate retiming: A powerful tool for hardware/softwarecodesign.Technical
report,AachenUniversityof Technology, 1993.
[47] V. Zivojnovic, S. Ritz, andH. Meyr. Retimingof DSPprogramsfor optimumvectorization. In Proceedingsof the
InternationalConferenceonAcoustics,Speech, andSignalProcessing, April 1994.
[48] W. Sung,J. Kim, andS. Ha. Memory efficient synthesisfrom dataflow graphs. In Proceedingsof the International
SymposiumonSystemsSynthesis, 1998.
[49] E. Zitzler, J.Teich,andS.S.Bhattacharyya.Multidimensionalexplorationof softwareimplementationsfor DSPalgo-
rithms. Journalof VLSISignalProcessingSystems, 1999.Acceptedfor publication;to appear.
[50] MentorGraphicsCorporation.DSPArchitectDFL User’sandReferenceManual,V 8.2 6. 1993.
[51] M. Levy. C compilersfor DSPsflex theirmuscles.EDNAccess, issue12,June1997.http://www.ednmag.com
[52] P. Paulin,M. Cornero,C. Liem, et al. Trendsin EmbeddedSystemsTechnology. In: M.G. Sami,G. DeMicheli (eds.):
Hardware/SoftwareCodesign, Kluwer AcademicPublishers,1996.
[53] K.M. Bischoff. OxUser’sManual. TechnicalReport#92-31.IowaStateUniversity, 1992.
47
[54] A.V. Aho, R. Sethi,J.D.Ullman. Compilers - Principles,Techniques,andTools. Addison-Wesley, 1986.
[55] G.J.Chaitin. RegisterAllocation andSpilling via GraphColoring. ACM SIGPLANSymp.on CompilerConstruction,
1982,pp.98-105.
[56] A.V. Aho, M. Ganapathi,S.W.K Tjiang. CodeGenerationUsing TreeMatchingandDynamicProgramming.ACM
Trans.onProgrammingLanguagesandSystems, vol. 11,no.4, 1989,pp.491-516.
[57] C. Liem, T. May, P. Paulin. Instruction-SetMatchingandSelectionfor DSPandASIP CodeGeneration.European
DesignandTestConference(ED & TC), 1994,pp.31-37.
[58] B. Wess.AutomaticInstructionCodeGenerationbasedon Trellis Diagrams.IEEEInt. Symp.on CircuitsandSystems
(ISCAS), 1992,pp.645-648.
[59] C.W. Fraser, D.R. Hanson,T.A. Proebsting.Engineeringa Simple,Efficient CodeGeneratorGenerator. ACM Letters
onProgrammingLanguagesandSystemsvol. 1, no.3, 1992,pp.213-226.
[60] G. Araujo,S.Malik. OptimalCodeGenerationfor EmbeddedMemoryNon-HomogeneousRegisterArchitectures.8th
Int. Symp.onSystemSynthesis(ISSS), 1995,pp.36-41.
[61] S.Liao, S.Devadas,K. Keutzer, S.Tjiang,A. Wang.CodeOptimizationTechniquesfor EmbeddedDSPMicroproces-
sors.32ndDesignAutomationConference(DAC), 1995,pp.599-604.
[62] S. Liao, S. Devadas,K. Keutzer, S. Tjiang. InstructionSelectionUsingBinateCoveringfor CodeSizeOptimization.
Int. Conf. onComputer-AidedDesign(ICCAD), 1995,pp.393-399.
[63] G. Araujo, S. Malik, M. Lee. UsingRegisterTransferPathsin CodeGenerationfor HeterogeneousMemory-Register
Architectures.33rd DesignAutomationConference(DAC), 1996
[64] D.J.Kolson,A. Nicolau,N. Dutt,K. Kennedy. OptimalRegisterAssignmentfor Loopsfor EmbeddedCodeGeneration.
8th Int. Symp.onSystemSynthesis(ISSS), 1995.
[65] A. Sudarsanam,S. Malik. Memory Bank and RegisterAllocation in SoftwareSynthesisfor ASIPs. Int. Conf. on
Computer-AidedDesign(ICCAD), 1995,pp.388-392.
[66] D.H. Bartley. OptimizingStackFrameAccessesfor Processorswith RestrictedAddressingModes.Software– Practice
andExperience, vol. 22(2),1992,pp.101-110.
[67] S. Liao, S. Devadas,K. Keutzer, S. Tjiang, A. Wang. StorageAssignmentto DecreaseCodeSize. ACM SIGPLAN
ConferenceonProgrammingLanguageDesignandImplementation(PLDI), 1995.
[68] R. Leupers,P. Marwedel.Algorithmsfor AddressAssignmentin DSPCodeGeneration.Int. Conf. onComputer-Aided
Design(ICCAD), 1996.
[69] B. Wess,M. Gotschlich.OptimalDSPMemoryLayoutGenerationasa QuadraticAssignmentProblem.Int. Symp.on
CircuitsandSystems(ISCAS), 1997.
[70] A. Sudarsanam,S. Liao, S. Devadas. Analysis andEvaluationof AddressArithmetic Capabilitiesin CustomDSP
Architectures.DesignAutomationConference(DAC), 1997.
[71] R. Leupers,F. David. A Uniform OptimizationTechniquefor OffsetAssignmentProblems.11thInt. Symp.on System
Synthesis(ISSS), 1998.
48
[72] C. Liem, P.Paulin, A. Jerraya. AddressCalculationfor RetargetableCompilationandExplorationof Instruction-Set
Architectures.33rd DesignAutomationConference(DAC), 1996.
[73] S. Davidson,D. Landskov, B.D. Shriver, P.W. Mallett. SomeExperimentsin Local MicrocodeCompactionfor Hori-
zontalMachines.IEEETrans.onComputers, vol. 30,no.7, 1981,pp.460-477.
[74] A. Timmer, M. Strik, J.vanMeerbergen,J.Jess.ConflictModellingandInstructionSchedulingin CodeGenerationfor
In-HouseDSPCores.32ndDesignAutomationConference(DAC), 1995,pp.593-598.
[75] R. Leupers,P. Marwedel.Time-ConstrainedCodeCompactionfor DSPs.IEEETrans.on VLSISystems, Vol. 5, No. 1,
1997.
[76] M. Berkelaar. EindhovenUniversityof Technology. availableat ftp.es.ele.tue.nl/pub/lpsolve/
[77] K. Rimey, P.N. Hilfinger. Lazy DataRoutingandGreedySchedulingfor Application-SpecificSignalProcessors.21st
AnnualWorkshoponMicroprogrammingandMicroarchitecture(MICRO-21), 1988,pp.111-115.
[78] R. Hartmann. CombinedSchedulingandDataRoutingfor ProgrammableASIC Systems.EuropeanConferenceon
DesignAutomation(EDAC), 1992,pp.486-490.
[79] T. Wilson,G. Grewal, B. Halley, D. Banerji. An IntegratedApproachto RetargetableCodeGeneration.7th Int. Symp.
onHigh-LevelSynthesis(HLSS), 1994,pp.70-75.
[80] C.H. Gebotys.An Efficient Model for DSPCodeGeneration:Performance,CodeSize,EstimatedEnergy. 10th Int.
Symp.onSystemSynthesis(ISSS), 1997.
[81] S.Novack,A. Nicolau,N. Dutt. A UnifiedCodeGenerationApproachusingMutationScheduling.Chapter12 in [5].
[82] S.Bashford,R. Leupers.ConstraintDrivenCodeSelectionfor Fixed-PointDSPs.36thDesignAutomationConference
(DAC), 1999.
[83] R. Woudsma.EPICS:A Flexible Approachto EmbeddedDSPCores.Int. Conf. onSignalProcessingApplicationsand
Technology(ICSPAT), 1994.
[84] R. Leupers.RetargetableCodeGenerationfor Digital SignalProcessors.Kluwer AcademicPublishers,ISBN 0-7923-
9958-7,1997.
[85] R.M. Stallmann.UsingandPortingGNU CC V2.4. FreeSoftwareFoundation,Cambridge/Massachusetts,1993.
[86] L. Nowak. GraphbasedRetargetableMicrocodeCompilationin theMIMOLA DesignSystem.20thAnn.Workshopon
Microprogramming(MICRO-20), 1987,pp.126-132.
[87] D. Lanneer, J.VanPraet,A. Kifli, K. Schoofs,W. Geurts,F. Thoen,G.Goossens.CHESS:RetargetableCodeGeneration
for EmbeddedDSPProcessors.chapter5 in [5].
[88] J.VanPraet,D. Lanneer, G. Goossens,W. Geurts,H. DeMan. A GraphBasedProcessorModel for RetargetableCode
Generation.EuropeanDesignandTestConference(ED & TC), 1996.
[89] M.R. Hartoog,J.A. Rowson,P.D. Reddy, et al. Generationof SoftwareTools from ProcessorDescriptionsfor Hard-
ware/SoftwareCodesign.34thDesignAutomationConference(DAC), 1997.
[90] S. Hanono,S. Devadas. InstructionSelection,ResourceAllocation, andSchedulingin the AVIV retargetablecode
generator. 35thDesignAutomationConference(DAC), 1998.
49
[91] G. Hadjiyiannis,S. Hanono,S. Devadas. ISDL: An Instruction-SetDescriptionLanguagefor Retargetability. 34th
DesignAutomationConference(DAC), 1997.
[92] ACEAssociatedCompilerExperts.http://www.ace.nl
[93] TargetCompilerTechnologies.http://www.retarget.com
[94] ArchelonInc. http://www.archelon.com
50
Biographical sketchesof the authors
Shuvra S.Bhattacharyya
ShuvraS.BhattacharyyareceivedthePh.D.degreein ElectricalEngineeringandComputerSciencesfrom theUniversity
of Californiaat Berkeley in 1994. SinceJuly, 1997,hehasbeenanAssistantProfessorin theDepartmentof Electricaland
ComputerEngineeringat theUniversityof Marylandat CollegePark. He holdsa joint appointmentwith theUniversityof
MarylandInstitutefor AdvancedComputerStudies(UMIACS).
Dr. Bhattacharyya’s researchinterestscenteraroundcomputer-aideddesignfor embeddedsystems,with emphasison
synthesisandoptimizationof hardwareandsoftwarefor digital signal/image/videoprocessing(DSP)applications.
From1991to 1992,hewasat KuckandAssociates,Inc. in Champaign,Illinois, wherehewasinvolvedin theresearch
anddevelopmentof programtransformationsfor performanceimprovementin C andFortrancompilers.From1994to 1997,
hewasaResearcherat theSemiconductorResearchLaboratoryof HitachiAmerica,Ltd., in SanJose,California.At Hitachi,
hewasinvolvedin researchonsoftwareoptimizationtechniquesfor embeddedDSPapplications.
Dr. Bhattacharyyais a recipientof the NSF CAREER award (1997), and is co-authorof Software Synthesisfrom
DataflowGraphs(Kluwer AcademicPublishers,1996),andEmbeddedMultiprocessors: Schedulingand Synchronization
(Marcel-Dekker, to bepublishedin 2000).
RainerLeupers
RainerLeupersreceivedhis DiplomaandPh.D.degreesin ComputerSciencewith distinctionfrom the Universityof
Dortmund,Germany, in 1992and1997, respectively. He received the HansUhdeAward andthe bestdissertationaward
from theUniversityof Dortmundfor outstandingtheses.Since1993,hehasbeenworking asa researcherat theComputer
ScienceDepartmentat Dortmund,whereheis currentlyheadingtheDSPcompilergroup. Dr. Leupersis theauthorof the
bookRetargetableCodeGeneration for Digital SignalProcessors, publishedby Kluwer Academicpublishersin 1997.His
researchinterestsincludedesignautomationandcompilersfor embeddedsystems.
PeterMarwedel
PeterMarwedelreceivedhis Ph.D.in Physicsfrom theUniversityof Kiel (Germany) in 1974.He workedat theCom-
puterScienceDepartmentof that University from 1974until 1989. In 1987,he received the Dr. habil. degree(a degree
requiredfor becominga professor)for his work on high-level synthesisandretargetablecodegenerationbasedon thehard-
waredescriptionlanguageMIMOLA. Since1989heis a professorat theComputerScienceDepartmentof theUniversityof
Dortmund(Germany). HeservedastheDeanof thatDepartmentbetween1992and1995.Currently, heis thepresidentof the
technologytransferinstituteICD, locatedat Dortmund. His researchareasincludehardware/softwarecodesign,high-level
testgeneration,high-level synthesisandcodegenerationfor embeddedprocessors.He is oneof theeditorsof thebookCode
Generation for EmbeddedProcessors publishedby Kluwer Academicpublishersin 1995.Dr. Marwedelis a memberof the
IEEEComputersociety, theACM, andtheGesellschaftfur Informatik (GI).
51