19
1 Advanced Computer Architecture Federico Dominguez Ward Van Heddeghem Literature study: the TriMedia, C6000, SODA and EVP processor architectures December 19, 2008 1 Introduction This report describes and compares the micro‐architectural aspects and ISA rationale of 4 processors: the TriMedia mediaprocessor (NXP, formerly Philips Semiconductor), the C6000 platform (Texas Instruments), the SODA architecture (University of Michigan) and the EVP (NXP). These 4 processors can all be broadly categorized as application specific digital signal processors (DSPs). The first two processors (Trimedia and TI’s C6000) are targeted as media processors, the latter two (SODA and EVP) are processors developed almost exclusively for Software Defined Radio (SDR). Media processors are typically found on DVD players, digital TV, IP TV, video security systems, video editing systems, etc. SDR processors try to replace the RF processing hardware by software, and are aimed mainly at consumer handheld devices such as PDAs, Smart Phones, Mobile Phones etc. 2 General comparison of media and SDR processors 2.1 Multimedia Processing Multimedia processing is the handling of video and audio data in electronic devices. This is sometimes accomplished by specific purpose integrated circuits, but the lack of flexibility and high cost of this approach has led to the development of multimedia processors, programmable processors specially designed to efficiently execute all tasks related to multimedia processing. Multimedia processors are typically found on DVD players, digital TV, IP TV, video security systems, video editing systems, etc. [5] The most common tasks for a multimedia processor are [10] : Image, video and audio decoding and encoding Image, video and audio compression Image, video and audio transmission Image, video and audio enhancement, filtering Pattern recognition Motion detection Since multimedia processors are most often found on battery operated embedded devices and since the nature of multimedia processing requires real time deadlines, the design choices in this family of microprocessors aim to achieve high performance processing of multimedia data streams with minimum power consumption and unit cost. [5] Multimedia processors achieve these goals by optimizing the execution of the multimedia tasks mentioned above. This implies: High level of parallelism: VLIW, SIMD to boost performance and to allow for a low cost silicon implementation. Large cache memories: To accommodate typically large images and video frames close to the processor saving costly memory access and bandwidth.

Literature study: the TriMedia, C6000, SODA and EVP ...wardvh.fastmail.fm/docs/2008_media_and_sdr_processor_architectures.pdf · Literature study: TriMedia, C6000, SODA and EVP processor

  • Upload
    others

  • View
    11

  • Download
    0

Embed Size (px)

Citation preview

1

AdvancedComputerArchitecture

FedericoDominguez

WardVanHeddeghem

Literaturestudy:theTriMedia,C6000,SODAandEVPprocessorarchitectures

December19,2008

1 IntroductionThisreportdescribesandcomparesthemicro‐architecturalaspectsandISArationaleof4processors:

• theTriMediamediaprocessor(NXP,formerlyPhilipsSemiconductor),

• theC6000platform(TexasInstruments),

• theSODAarchitecture(UniversityofMichigan)

• andtheEVP(NXP).

These4processorscanallbebroadlycategorizedasapplicationspecificdigitalsignalprocessors(DSPs).Thefirsttwoprocessors(TrimediaandTI’sC6000)aretargetedasmediaprocessors,thelattertwo(SODAandEVP)areprocessorsdevelopedalmostexclusivelyforSoftwareDefinedRadio(SDR).

MediaprocessorsaretypicallyfoundonDVDplayers,digitalTV,IPTV,videosecuritysystems,videoeditingsystems,etc.SDRprocessorstrytoreplacetheRFprocessinghardwarebysoftware,andareaimedmainlyatconsumerhandhelddevicessuchasPDAs,SmartPhones,MobilePhonesetc.

2 GeneralcomparisonofmediaandSDRprocessors

2.1 MultimediaProcessing

Multimediaprocessingisthehandlingofvideoandaudiodatainelectronicdevices.Thisissometimesaccomplishedbyspecificpurposeintegratedcircuits,butthelackofflexibilityandhighcostofthisapproachhasledtothedevelopmentofmultimediaprocessors,programmableprocessorsspeciallydesignedtoefficientlyexecutealltasksrelatedtomultimediaprocessing.MultimediaprocessorsaretypicallyfoundonDVDplayers,digitalTV,IPTV,videosecuritysystems,videoeditingsystems,etc.[5]

Themostcommontasksforamultimediaprocessorare[10]:

• Image,videoandaudiodecodingandencoding• Image,videoandaudiocompression• Image,videoandaudiotransmission• Image,videoandaudioenhancement,filtering• Patternrecognition• Motiondetection

Sincemultimediaprocessorsaremostoftenfoundonbatteryoperatedembeddeddevicesandsincethenatureofmultimediaprocessingrequiresrealtimedeadlines,thedesignchoicesinthisfamilyofmicroprocessorsaimtoachievehighperformanceprocessingofmultimediadatastreamswithminimumpowerconsumptionandunitcost.[5]

Multimediaprocessorsachievethesegoalsbyoptimizingtheexecutionofthemultimediatasksmentionedabove.Thisimplies:

• Highlevelofparallelism:VLIW,SIMDtoboostperformanceandtoallowforalowcostsiliconimplementation.

• Largecachememories:Toaccommodatetypicallylargeimagesandvideoframesclosetotheprocessorsavingcostlymemoryaccessandbandwidth.

2

• Penaltyfreeun­alignedmemoryaccess:Processingalgorithmstypicallyaccessimageblocksinanun‐alignedmanner.

• Datapre­fetching:Multimediaprocessingalgorithmsaccessmemorylocationsinpredictablestridesandblocks.

• Largeregisterfiles:Largedataworkingsetscanbekeptinregisters,preventingcostlyloadandstoreoperations.

• DMA­stylememorytransfers:Increasesoverallsystemperformance.• Applicationspecificinstructions:Typicalmultimediaoperationscanbedoneinoneora

fewclockcycles,savingenergyandgainingperformance.• Smalldatawords:8and16‐bitdatawordsaretypicallysufficientformostmultimedia

processingtask.Limitingthesizeofdatawordsallowsforhighermemorydensityandparallelism.

TriMediaandTIC6000aretwocompetingfamiliesofmultimediamicroprocessorsthatinonewayoranotherimplementthearchitecturaldesignchoiceslistedabove.

2.2 Software‐definedradio(SDR)

Thephysicallayerofmostwirelessprotocolsistraditionallyimplementedincustomhardwaretosatisfytheheavycomputationalrequirementswhilekeepingpowerconsumptiontoaminimum.Theseimplementationsaretimeconsumingtodesignanddifficulttoverify.Aprogrammablehardwareplatformcapableofsupportingsoftwareimplementationsofthephysicallayer,orsoftware­definedradio(SDR),hasanumberofadvantages:supportformultipleprotocols(i.e.multimodeoperation),fastertime‐to‐market,higherchipvolumes,andsupportforlateimplementationchanges.SDRcanbeconsideredasahigh‐enddigitalsignalprocessing(DSP)application.[13]

ThedigitalbasebandprocessingforSDRcanbedividedintothreestages:filterstage,modemstage,andcodecstage.

• Thefilterstagerequiresahighcomputationalloadanditsimplementationisuniformbetweendifferentstandardsandalgorithms,aprogrammablesolutionwouldnotbethemostefficientone.

• Themodemstageisdiverseamongdifferentstandards,inthisstagethereisplentyofspacefordifferentvendorstodifferentiatetheirproductsbyapplyingdifferentalgorithmsandstandards.Thisstageisidealforafullyprogrammablesolution.

• Thecodecstageisimplementedbysimilaralgorithmsacrossseveralstandardsmakingaprogrammablesolutionnotdesirable.[17]

ThemodemstageofdigitalbasebandprocessingisanidealapplicationforaprogrammableSDRprocessor.SODAandEVParetwosuchprocessors.

ThemaindesigngoalsoftheseSDRprocessorsaretypicallyhighperformance(imposedbythehighthroughputrequirementsofcurrentwirelessprotocols),energyefficiency(batteryoperateddevices)andprogrammability(multi‐protocolsupport,higherchipvolumes).

2.3 Commoncharacteristics

Asisclearfromthediscussionabove,bothmediaandSDRprocessorsarecharacterizedbyanumberofcommonproperties:theyshouldachievehighperformanceprocessingofdatastreams,mustmeetreal­timedeadlines,shouldbeenergyefficientsincetheyaretypicallyusedonmobile(batteryoperated)devices,andtheyshouldbehighlyprogrammableastoprovidesupportfordifferentandnewmediaandradioprotocols.Thedatastreamstheyoperateonarerelativelysmall(8or16bit).

Figure1showsthethroughputandpowerrequirementsofanumberofwirelessprotocols,aswellastheperformanceoftwoprocessorsdiscussedinthispaper:theTIC6000andtheSODAprocessor.Thisfiguregivesagoodnotiononthepowerefficiencytheseprocessorstargettoachieve.

3

Figure1–Throughputandpowerrequirementsoftypical3Gwirelessprotocols[13]

2.4 DifferencesbetweenmediaandSDRprocessors

Asmuchastheyhaveincommon,thesetwogroupsofprocessorsalsoaredistinctlycharacterizedbytheirintendedapplication:mediaandSDR.Inthenextparagraphswewillhighlightsomeofthemajordesigndistinctionsbetweenthesetwoprocessorgroups.

Differencesbetweenthetwomediaprocessors,anddifferencesbetweenthetwoSDRprocessorswillbediscussedin§3.3and§4.3respectively.Table1givesthespecificationofeachofthefourprocessors.

SIMDwidth—Sincemediaprocessorstypicallyoperateonsmallmacroblocks(amacroblockisausedinvideocompressingandrepresentsablockof16by16pixels[1]),smallervectorssuffice.ThistranslatesinnarrowSIMDdesignsformediaprocessors.ForSDR,ontheotherhand,mostofthecomputationallyintensivealgorithmshaveabundantdatalevelparallelism,muchmorethaninstructionlevelparallelism,sowideSIMDisveryefficient.[13]ItshouldalsobenotedthatSIMDingeneralispowerefficientcomparedtootherarchitectures,sincethe‘overhead’ofaddresscalculation,addressdecoding,instructionfetchingetc.issharedperpoperations.[16]

Predication—Predicationisavailableonbothmediaprocessors.Thisseemslogicalsinceinvideocompressing,acommonthingtodoiscomparedatawithpreviousdata(e.g.inmotionestimation).Predicationcouldthenbeapowerefficientwaytomakedecisions,asopposedtocostlybranching.Wehavenotfoundanyhardreferencestobackupthisassumption.However,onedocumentexplicitlymentionspredicationastechniqueforincreasingperformanceinmediaprocessors.[2]ForSDRprocessorswedidn’tfindanyindicationsthatpredicationisused.

Datawidth—Bothmediaprocessorshaveadatawidthof32bits,incontrastwiththeSDRprocessorswhichhaveonlyhalfofthat.Thisprovidesmediaprocessorswithamuchlargeraddressablememoryrange,notinvaingivennewvideostandardslikeHDTV,whichcurrentlyreachesuptoresolutionsof1920x1080pixels(i.e.atotalof2.073.600pixels,farbeyondwhatisaddressablewith16bits).[1]AlgorithmsrunningonSDRprocessorstypicallyoperateonvariableswithsmallvalues.Analysisoftwotypicalwirelessprotocolsshowsthatthereshouldbestrongsupportfor8and16‐bitoperations.[13]Also,wirelesspacketshandledbytheseprocessorsdon’tpotentiallyconsumeasmuchmemoryspaceasthemediaprocessors.

Unalignedloads—Mediaprocessingalgorithmstypicallyaccessimageblocksinanun‐alignedmanner.Asalreadymentioned,whendealingwithvideocompression,mediaprocessorshandlemacroblocks.Thelocationofthesemacroblocksisnotfixed,neitheronscreen,norinmemory.Itseemslogicalthatprovidingforunalignedloadsallowstheseblockstobefetchedfaster.We

4

don’thaveanyreferencestobackupthisassumption.ForSDRprocessorswedidn’tfindanyindicationsthatunalignedloadsaresupported.

Cache—Onmediaprocessors,cachesareimplemented(oroptional),toaccommodatefortypicallylargeimagesandvideoframesclosetotheprocessor,savingcostlymemoryaccessandbandwidth.SDRprocessorsdon’timplementcachessincetheyoperateonstreamingdata(cachingisofusethen),andtheyrequirereal‐timebehavior.[13]

Delayslots—Delayslotsareavailableonmediaprocessorstoimproveperformance.OnSDRprocessors,delayslotsarenotusedsincebranchpredictionisn’tuseeither.

Table1–Processorspecifications

Mediaprocessors SDRprocessorsFeature

TriMedia C6000 SODA EVP

Architecture Uni‐processor5issueslotVLIWguardedRISC‐like

operations

Uni‐processor 1ARMprocessor+4PEs(processingelements)staticscheduled

Uni‐processorSIMDthroughvector

processingcapabilities.13issueslotVLIWscalarand

vectoroperationsVLIW Yes Yes Yes YesPipelinedepth 7–12stages 11stages 5stages ‐‐Datawidth 32bits 32bits 16bits 16bitsRegisterfile Unified,12832‐bit

registersClustered,1632‐bitregisters

PerPE:32x16SIMD16‐bitregisters,16scalar16‐bitregisters

16vectorregisters(eachregistercontains16

words).32scalarregisters.

Functionalunits

31 8 PerPE:32+1 6vector,3scalar

SIMDcapabilities

1x32‐bit,2x16‐bit,4x8‐bit

C62/67:noC64:4x8‐bit,2x

16‐bit

PerPE:32x16‐bit

16x16‐bits

Memorystructure

Cache Optionalcache Clustered,ScratchpadGlobal:64KB

PerPE:4KBscalar,8KBSIMD

Scratchpad

• Instructioncache

64Kbyte,128‐bytelines,8wayset‐associative,LRUreplacementpolicy

(modeldependent)

No No

• Datacache 128Kbyte,128bytelines4wayset‐associative,LRU

replacementpolicy,Allocate‐on‐writemiss

policy

(modeldependent)

No

No

Frequency 240MHz(300MHz) 300MHz(1GHz)

380MHz 300MHz

Delayslots 5forbranchinstructions

1formultiplyinstruction,4foraloadinstructionand5forabranch

instruction

No No

Compression Yes(10bittemplate)

Yes(usingstopbits)

NDA NDA

Predication Yes Yes NDA NDA

Forwarding Yes Yes No NDAUnalignedloads

Yes Yes DNA DNA

Compiling/programming

Hard.C,C++withspecial

extensions.

Hard.C,C++,assembly.

Hard.SPIRontopofhighlevellanguage

Hard.EVP‐C

NDA:nodataavailable

5

3 Multimediaprocessors:TriMediaandTIC6000

3.1 TriMedia

Remark:unlessotherwisespecified,theinformationinthissectionistakenfromthemainTriMediapaperbyVandeWaerdt,etal.[3]

3.1.1 Introduction

TriMediaisafamilyofmicroprocessorsdevelopedbyNXP.ThemainapplicationoftheTriMediamicroprocessorsismultimediadataprocessinginembeddedsystems.ThisparticularapplicationdomainshapestheTriMediamicroprocessordesign,divergingdrasticallyfromgeneralpurposeprocessors.

TriMediaprocessorsaredeployedonconsumerdevicesasaSystemOnChipsolution.TriMediaisprogrammable,sothatitcanbeadaptedtomanydifferentapplicationsandsothatproductscanbechangedthroughsoftwaretomeetevolvingstandardsrequirements[5].

TriMediaprocessorsemployVLIWarchitecture.Thelatestmodel’sCPUcore(TM3270)has31functionalunitswithfiveissueslots.ThisfamilyofprocessorsalsosupportsSIMDoperations.VLIWandSIMDprovideahighdegreeofinstructionparallelism,optimizingtheoverallperformanceofthesystem.

3.1.2 Architectureoverview

TriMediainstructionanddatalevelparallelism

TriMediaimplementsinstructionanddatalevelparallelismthrougha5‐issueslotVLIWandthroughSIMD,respectively.VLIWsupportsupto5operationsperinstructionword.SIMDinstructionscanworkontwo16‐bitdatawordsorfour8‐bitdatawords.

Figure2showsthefunctionalunitsoftheTM1000,thefirstmemberoftheTriMediafamily.TheTM1000has28functionalunitsdividedamongfiveissueslots.TheTM3270hasaddedthreenewfunctionalunits(totaling31units)butmaintainsthesamebasicdesignstructureastheTM1000.Ascanbeseeninthefigure,eachissueslothasmultipletypesoffunctionalunits,allowingthecompilermorefreedomwhenschedulingtheVLIWinstructions.

TheTM3270implementssuper­operationsortwo‐slotoperations.Theseoperationsareexecutedbytwofunctionalunitsthatoccupytwoneighboringissueslots.Theseoperationscantakeuptofouroperandsanduptwodestinationoperands.Two‐slotoperationscanimproveoverallsystemperformancesincetheequivalentseparateoperationsrequiremorecyclestocomplete.

Figure2­TriMediaTM1000FunctionalUnits[5]

6

Cachepoliciesandmemoryprefetching

TheTM3270hasa128Kbytedatacachewith128cachelines.Thecachesupportspenalty‐freeunalignedaccesses;imageprocessingtypicallyrequiresfetchingblocksofunalignedimagedatafrommemory.Thecachehasawrite‐backpolicyandanallocate‐on‐write‐misspolicy.Thecombinationofthesetwopoliciesreducestherequiredbandwidthtooff‐chipmemory.

TheTM3270supportsprefetchingandisbasedonmemoryregions.Theprefetchingpatternandregioncanbespecifiedbytheprogrammerandisdesignedtomatchthememoryaccesspatternoftypicalmultimediacodecalgorithms(typicallyunalignedandnon‐sequential).TheTM3270supportsfourseparatememoryregions.Thepatternisdefinedbyastartandendaddressandastride.Theregionspecifiedbythestartandendaddressisusuallyacompleteimageorframe,andthestridecorrespondstothesizeoftheblockswithintheimagethatarebeingprocessed.Themaingoalofprefetchingistoreducecachemissesandthereforeavoidcostlystallcyclesandimprovesystemperformance.

Pipeline

Figure3showsabasicblockdiagramoftheTM3270pipeline.Thepipelinehasaminimumdepthof7stagesforsinglecyclelatencyoperationsandamaximumof12stagesformorecomplexinstructions.Thefrontendofthepipelineconsistsofinstructionfetch(I1–I3)andinstructionpre‐decode(P).Theinstructionfetchcontrolunitatthefrontendofthepipelinesupports5delayslotsforthejumpoperation,thenumberofdelayslotscorrespondstothedistancebetweenthefirststageandthefunctionalunits.Thebackendofthepipelineconsistsofoperationdecode(D),execution(X1‐X6)andwrite(W).

DataforwardingfromtheCacheWriteBufferiscurrentlynotimplementedintheTriMediapipelinebecauseitsinclusioncausespathdelaysbeyondthedesireddesignparameters.Predicationissupportedbyusingguardregisters;eachoperationexecutionispredicatedwiththeleastsignificantbitofitsguardregister.[6]

3.1.3 Compilersupport

TheTriMediaprocessorfamilycomeswithcompilersupportfromthemanufacturer.Binarycompatibilityisnotguaranteedbetweendifferentmodelswithinthefamilyrequiringadifferentcompilerforeachmodel.ThesupportedprogramminglanguageisC/C++.

ProgramswritteninC/C++arecompatiblebetweenmodelsprovidedtheyarerecompiledforeachmodel.Someperformanceislostwhenportingprogramswrittenforonemodeltoanewermodelwithoutincludingmodelspecificoptimizations.ForexamplewhenportingaprogramfromtheTM3260totheTM3270,aperformancegainof20%canbeobtainedwhenapplyingnewdataprefetchingtechniques.

7

Figure3­TriMediaTM3270Pipeline[3]

8

3.2 TIC6000

Remark:unlessotherwisespecified,theinformationinthissectionistakenfromanumberofTexasInstrumentsreferencedocuments.[7][8][9][10]

3.2.1 Introduction

TheTMS320C6000digitalsignalprocessor(DSP)platformispartoftheTMS320DSPfamilydevelopedbyTexasInstruments.

TheTMS320familyconsistsof16‐bitand32‐bitfixed‐andfloating‐pointDSPs.Therearethreemainplatforms,includingtheTMS320C2000(controlapplications),theTMS320C5000(power‐efficientapplications),andtheTMS320C6000(high‐performancesapplications).Theseprocessorsareusedincellphones,digitalcameras,modemsetc.

TheC6000platformisaverylonginstructionword(VLIW)architecturetargetedathigh‐performanceDSPtasks.Itcomprisesthefollowingthreemaingenerations(i.e.versions):

• TMS320C62x(‘C62x)offeringfixed­pointarithmeticupto300MHz,

• TMS320C64x(‘C64x)offeringfixed­pointarithmeticupto1GHz,

• TMS32067x(‘C67x)offeringfloating­pointarithmeticupto300MHz.

3.2.2 Architectureoverview

Overview

The‘C6000processorconsistsofthreemainparts:aCPU,peripherals(notdiscussedfurther)andmemory.Theprocessorsoperateatvariousfrequencies,rangingfrom150MHzupto1GHz.

Theprocessorexecutesuptoeight32‐bitinstructionseverycycle.ThecoreCPU,asshowninFigure4,consistsofeightfunctionalunits,tworegisterfilesandtwodatapaths.

Figure4–The‘C6000blockdiagram

9

CPU

TheCPUhastwoclusters(labeleddatapath1and2inFigure4)inwhichprocessingoccurs.Eachdatapathhasfourfunctionalunits(.L,.S,.M,and.D)andaregisterfilecontaining1632‐bitregisters(32forthe‘C64xgeneration).

Thefunctionalunitsexecutelogic,shifting,multiply,anddataaddressoperations.Allinstructionsexceptloadsandstoresoperateontheregisters.Thetwodata‐addressingunits(.D1and.D2)areexclusivelyresponsibleforalldatatransfersbetweentheregisterfilesandmemory.Thecrosspathsbusconnectingbothdatapaths,supportsonereadandwriteoperationpercycle.

VLIWprocessingflowbeginsbyfetchinga256‐bitwidepacket(eight32‐bitwords)fromprogrammemory,whichisthendynamicallydispatchedoverthedifferentfunctionalunits(FUs).DynamicdispatchingreferstothefactthattheorderofinstructionsinaVLIWshouldnotmatchtheorderofthefunctionalunits;insteadinstructionsintheVLIWaredispatchedtotheappropriateFU.

Thepipelinestagesaregroupedinto3phases:instructionfetch(4stages),instructiondecode(2stages)andinstructionexecute(maximum5stages).Thepipelinealsosupportsdelayslotsformultiply(1delayslot),load(4delayslots),andbranch(5delayslots)operations.

Memory

Dataandprogrammemory—TheC6000DSPhasa32‐bit,byte‐addressableaddressspace.Internal(on‐chip)memoryisorganizedinseparatedataandprogramspaces,whichmaybeconfiguredascacheonsomedevices.Whenoff‐chipmemoryisused,thesespacesareunifiedonmostdevicestoasinglememoryspaceviatheexternalmemoryinterface(EMIF).

Memoryports—TheC6000DSPhastwo32‐bitinternalportstoaccessinternaldatamemory.TheC6000DSPhasasingleinternalporttoaccessinternalprogrammemory,withaninstruction‐fetchwidthof256bits.

Memoryalignment—TheC6400,unliketheC6200andC6700,supportsunalignedmemoryaccessforloadandstoreoperations.Wordanddouble‐worddatadoesnotalwaysneedalignmentto32‐bitor64‐bitboundariesasintheothermodels.

‘C64xextensions

Thekeyextensionsbetweenthe‘C62x/’C67xarchitectureandthe‘C64xarchitecturethatallowmoreworkeachclockcycleinclude(seealsoFigure5):

• widerdatapaths(dual64‐bitload/storepathsinsteadof32‐bit),

• alargerregisterfile(32registersinsteadof16),

• moreorthogonality,i.e.moregeneralityinthearchitecture(forexample,the.DFUcanperform32‐bitlogicalinstructionsinadditiontothe.Sand.Lunits).

• unalignedmemoryaccessforloadandstoreoperations.

• MoreSIMDsupport.Forexample,thefunctionalunitsontheC64xcannowexecutetwo32‐bitmultipliesandfour16‐bitmultiplies.OnFigure5,theyellowboxesineachfunctionalunitontheC64xrepresenttheincreaseinSIMDsupport.ThisincreaseinSIMDparallelismispossibleduetotheincreasedregisterfilesizeanddatapathwidth.

10

Figure5–Differences‘C62/67xand‘C64xgeneration

3.2.3 Compilersupport

TheC6000platformcanbeprogrammedusingC,C++orassemblylanguage.

TexasInstrumentsprovidesaproprietarytoolchainandIDE,calledCodeComposerStudio,todoso.NoothercompilersareavailablefortheC6000platform;however,adepartmentattheChemnitzUniversityofTechnologydevelopedpreliminarysupportfortheTMS320C6xseriesintheGNUCompilerCollection(GCC).[12]

ThefactthatonlyonepropertoolchainisavailablefortheC6000platform,andthatTIprovideshand‐optimizedlibrariesforvariousalgorithms(forexample,FFT),seemstoindicatethatitcanbeacomplextaskfortheprogrammertomanuallydevelopsoftwarethatefficientlyusesthearchitecturalresourcesandcapabilitiesoffered.

ThisassumptionisenforcedbyanInternetforumpostcitingaTICcompilerdeveloper:“Hisanswer[tothequestionwhytherearenoothercompilerssupportingC6000]wasthatwhenTIdevelopedtheC6000core,itwasaside­by­sidedevelopmenteffortbyboththehardwareandsoftwaredevelopers,andthattheCcompilerbynecessitywouldberathercomplexandhenceveryhardforathirdpartytodevelop.”[11]

3.3 ComparingTriMediaandTIC6000

Remark:unlessotherwisespecified,theinformationinthissectionistakenfromanumberofTexasInstrumentsreferencedocuments[7][8][9][10]andthemainTriMediapaperbyVandeWaerdt,etal.[3]

TriMediaandTIC6000achievehighperformance,lowpowerconsumptionandlowunitcostbyimplementingthearchitecturalfeaturesdescribedin§2.1.However,thedetailsofhowtheyaccomplisheachfeaturemaydifferbetweenthem.Eacharchitecturalfeatureimpliesdesigncompromisesthatweresometimessolveddifferentlyineachprocessor.

Mostofthearchitecturaldesignchoicescanbedividedbythegoaltheytrytoreach:highperformanceorlowpowerconsumption.Althoughthesetwogoalscertainlyoverlapattimes,forclaritythemaincomparisonbetweenthetwoprocessorfamiliesisdividedassuch.

3.3.1 Highperformance

ParallelismonTriMedia—ThealgorithmsrequiredtoimplementvideoprocessingcodecsaresuitableforparallelizationusingVLIWandSIMD.Theytypicallyrequiresimilarindependentoperationsonmultiplesmalldatawords.TriMediaprocessorsprovideinstructionsandsuperinstructionsthataccomplishtypicaltasksinthesecodecsinfewcyclesandwithminimummemorybandwidthconsumption.TheVLIWinTriMediaisimplementedwith5issueslotsandcompressedwitha10‐bittemplatefield.

ParallelismonTIC6000—VLIWandSIMDprovideinstructionanddatalevelparallelismintheTIC6000.Inaddition,compressionthroughstopbitsissupported:parallelismofinstructionsin

11

thefetched256‐bitVLIWcanbecontrolledbyusingastopbit,whensettingtheleastsignificantbit(LSB)ofeachoftheeightcontainedindividualinstructionstoeither0or1.Whensetto1,theindividualinstructionwillbeexecutedinparallelwiththesubsequentindividualinstruction.Thisallowseightinstructionstobeexecutedfullyserial,partiallyserialorfullyparallel.

LargeregisterfileonTrimedia—Videoprocessingrequiresworkingwithalargedataset;alargeregisterfilepreventsoveruseofexpensiveloadandstoreinstructions.TriMediaprocessorTM3270hasaunifiedregisterfilewith12832‐bitregisters.

SpecificdomainISAinstructions—Notalltasksfoundinvideoprocessingcodecsareparallelizable,forexampleinH.264Context‐BasedAdaptiveBinaryArithmeticCoding(CABAC)intrinsicsequentialbehaviorcannotbeproperlyoptimizedwithSIMD.TheTriMediaISAisequippedwithseveralnativeinstructionstosimplifyCABACprogrammingtocompensateforthisdisadvantageandminimizesequentialexecution.TIC6000providesalsospecificdomaininstructionsdesignedtooptimizetheexecutionofimageprocessingkernels.

Fixed­pointarithmetic—OntheTIC6000processorsfixed‐pointarithmeticislesscomputationallydemandingthanfloating‐pointarithmetic,andthusasuitabledesignchoicetoincreasearithmeticprocessingspeed.

Predication—Bothfamiliesofprocessorsallowinstructionstobeexecutedconditionally(predication),thusreducingcostlybranching.

Delayslots—OntheTIC6000,forfixed‐pointinstructions,anumberofdelayslotsareavailable.Thenumberofdelayslotsisequivalenttothenumberofadditionalcyclesrequiredafterthesourceoperandsarereadfortheresulttobeavailableforreading:foramultiplyinstructionthisis1,foraloadthisis4,andforabranchthisis5.OnTriMedia,fivedelayslotsareprovidedonlyforbranchinstructions.

3.3.2 Lowpowerconsumption

ClockgatingandfrequencyscalingonTriMedia—TriMediaappliestwomainpowersavingtechniques:clockgatingandvoltage‐frequencyscaling.ThelatestTrimediaimplements70clockdomains,forexampleallstagesofallfunctionalunitsaregated.Thenormalsupplyvoltageis1.2VbuttheTriMediaguaranteesnormaloperationsat0.8Vatalowerfrequency.ThemaximumoperatingfrequencyfortheTM3270is350Mhz,ampleroomforscalingdownitsfrequencywhenprocessingtypicaltaskssuchasMP3decoding(only8Mhzneeded).

ClusteringonTIC6000—Ontheotherhand,theTIC600family,toavoidaslowandpower‐hungryregisterfile(read/writeportsfromeachregistertoeachofthe8FUs)andalargeforwardingnetwork,twodatapaths(i.e.clusters)areavailable.Thisallowshigherfrequenciesandlowerpowerconsumption.TriMedia’sregisterfileistwicethesizeoftheTIC6000andisunified.InthiswayTriMediadesignerschoosetosavememorybandwidthwithalargeregisterfileandTIC6000designerschoosetosaveonpowerconsumptionandtoreachhigherfrequencieswithasmallerdividedregisterfile.

MemoryAccess—OnTIC6000,4interleavedsingle‐portedmemorybanksassurelowerpowerconsumption.ThishowevercanleadtoreducedperformanceinaVLIWarchitecture:onlyoneaccesstoeachbankisallowedpercycle;iftwoparallelloadinstructionsarebothtryingtoaccessthesamebank,oneloadmustwait,resultinginamemorystall.OnTriMedia,lowpowerconsumptionisalsoaccomplishedwitha4‐waysetassociativecache.

12

4 SDRprocessors:SODAandEVP

4.1 SODA

Remark:unlessotherwisespecified,theinformationinthissectionistakenfromthemainSODApaperbyYuanLinetal.[13]

4.1.1 Introduction

SODA(Signal‐processingOn‐DemandArchitecture)isa(proposed)SDRprocessorarchitecturedevelopedattheUniversityofMichigan.Itsmaindesigngoalsare:highperformance,energyefficiency,andprogrammabilitythroughacombinationoffeaturesthatincludesingle‐instructionmultiple‐data(SIMD)parallelism,andhardwareoptimizedfor16‐bitcomputations.

Theproposedarchitecturehasbeenimplementedon180nmprocesstechnologies,anditisprojectedtomeetthethroughputandpowerrequirementsofcurrentwirelessprotocols(seeFigure1)whenimplementedon65nm.

4.1.2 Architectureoverview

Overview

TheSODAsystemiscomposedof(seeFigure6):fourprocessingelements(PEs),ascalarARMcontrolprocessorandaglobalscratchpadmemory,allconnectedthroughasharedbus.

Figure6–OverviewofSODAmulti­coreDSParchitecture

ProcessingElement

EachPEconsistsof3pipelinesthatruninlockstep:awideSIMDpipeline,ascalarpipelineandanAGU(addressgenerationunit)pipeline.

• ThewideSIMDunitrunsmostofthecomputeintensivealgorithms(FFT,Viterbi,Turbodecoder,etc.).TheSIMDpipelineconsistsof32clusterswitha16‐bitdatapath(seeFigure7).Eachclustercontains1ALUandasimpleregisterfilewith2readportsand1writeport.Ashuffleexchangenetworkinterconnectstheclusters.

• ThescalarunitisusedtosupportmanyoftheDSPalgorithmsthatarescalarinnatureandcannotbeparallelized.Thescalarpipelineconsistsofone16‐bitdatapath.

• AnAGU(addressgenerationunit)pipelinehandlesDMAtransfersandmemoryaddresscalculationsforboththescalarandSIMDpipelines.

13

TheSIMDShuffleNetwork(SSN)supportintra‐processordatamovements.Itcontainsashuffleexchangenetworkandaninverseshuffleexchangenetwork,whichallowstoefficientlyperformpermutationsonvectors,acommontaskinwirelesscommunicationalgorithms.

Figure7–ProcessingElementarchitecture

Controlprocessor

TheARMprocessorismainlyusedasasystemcontrollertohandleprotocols’controlcodeandstatetransitions.

Memory

Therearenocachestructures:boththelocalPEmemories(4KBscalar,8KBSIMD)andtheglobalscratchpadmemory(64KB)aremanagedbysoftwarethrougheachPE’sDMAcontrollertowhichalllocalandglobalmemoriesarevisible.

TheSIMDmemorycontainstwo512‐bitports(oneread,onewrite).Thescalarmemorycontainstwo16‐bitports(oneread,onewrite).

CommunicationsbetweenPEsareviascalarstreams,butintra‐PEcomputationsarevectoroperations.Therefore,supportforascalar‐vectorinterfacebetweenthescalarandSIMDpipelineisnecessary.ThisisshowninFigure6asWtoSandStoW.

4.1.3 Compilersupport

ItisclearthatheterogeneousmultiprocessorchipssuchasSODAprovidedifficultchallengesforprogrammersandcompilerstoefficientlymapapplicationsontothehardware.Especiallysincemanyoftheprojectedhigh‐performancefeaturesrelyonstaticscheduling.

Anewdataflowprogrammingmodel,calledSPIR(SignalProcessingIntermediateRepresentation),designedformodelingSDRapplicationshasbeenresearchedbyLinetal[14].Itactsasanintermediaterepresentationthatcanbeautomaticallygeneratedfromexistinghigh‐levellanguages(suchasC)throughacompilerfront‐end.ThegeneralideaisthatSPIRrepresentsataskgraphconsistingofasetofnodes(tasks,forexampleadescrambler)interconnectedwitharrows(dataflow).Eacharrowspecifiestheinputandoutputstreamratesforthesourceanddestinationnodes.Thispropertyallowsacompilertogeneratestaticexecutionschedules.

14

4.2 EVP

Remark:unlessotherwisespecified,theinformationinthissectionistakenfromthemainEVPpaperbyVanBerkeletal.[16]

4.2.1 Introduction

EmbeddedVectorProcessor(EVP)isanapplicationspecificprocessordevelopedbyNXP.TheEVPwasdevelopedalmostexclusivelyforSDRapplicationsandisaimedmainlyatconsumerhandhelddevicessuchasPDA,SmartPhones,MobilePhones,etc.

EVPhasbeendesignedtoexecutethetypicalalgorithmsfoundinthemodemstage(see§2.2)inrealtimeandinapowerefficientmanner.EVPaccomplishesthiswithdataandinstructionlevelparallelism(vectorprocessingandVLIW)andbyprovidingapplicationspecificinstructions.

4.2.2 ArchitectureOverview

Figure8showsablockdiagramoftheEVParchitecture.Theprocessorconsistsofageneralscratchpadprogrammemory,aVLIWcontroller,anAddressCalculationUnit(ACU),avectorregisterfile,ascalarregisterfile,avectormemory,asetofvectorfunctionalunits(FUs),andasetofscalarfunctionalunits.Themainwordis16bits.

TheVLIWcontrollerdispatchesaVLIWinstructionoverthevariousvectorandscalarfunctionalunits.ThemaximumVLIW‐parallelismavailableequalsfivevectoroperationsplusfourscalaroperationsplusthreeaddressupdatesplusloop‐control.

ThevectorFUssupportSIMD,allowingfordataparallelism.TheSIMDwidthoftheEVPis16(256bits).ThisimpliesthattheEVPcanexecutethesameinstructionon16wordsof16bits,32wordsof8bitsor8wordsof32bits.

SomeofthevectorFUsareveryspecific:theMACunitprovidesforMultiply‐Accumulates,theshuffleunitcanrearrangetheelementsofsinglevector,theintravectorunitprovidesvectorspecificoperationssuchasdeterminingthemaximumelementofavector,thecodegenerationunitprovidessupportorCDMAspecificfunctionality(i.e.socalled“codechip”generation).

Figure8­EVPArchitecture

4.2.3 CompilerSupport

TheEVPcomeswithitsownlanguageandowncompiler.EVPprogramsarewritteninEVP‐CasupersetofANSI‐C.CprogramsaremappedexclusivelytothescalarunitsoftheEVP.ToexploittheVLIW,SIMD,andvectorparallelismitismandatorytousetheEVPlanguageextensionsandsomelibraries.ItisstillataskoftheprogrammertomanuallymaptheDSPbasebandalgorithmstotheir“vectorized”versioninEVP‐C.

15

Furthermore,manualoptimizationisrequiredtoachieveefficientvectorizedcode.Asanexample,thecodeproducedbyaprototypeEVP‐CcompilerforaspecificFFTimplementationrequired25%morecyclesthanhand‐scheduledassemblycode.[16]

4.3 ComparingSODAandEVP

ThissectionwilldiscusswhycertaindesigndecisionswerechosenforbothSODAandEVP.

4.3.1 Designgoals

Asstatedin§2.1,themaindesigngoalsforanSDRprocessorarehighperformance,energyefficiency,andprogrammability.Thetablebelowsummarizesthetechniquesusedtoachievethis.Moredetailonthedifferenttechniquesisprovidedinthesubsequentparagraphs.

Unlessotherwisementioned,alltechniquesanddiscussionsapplytobothSODAandEVP.

Highperformance Energyefficiency Programmability

• Multipleparallelism

• VLIW• SIMD• Multiplecores(SODA

only)

• Nobranchprediction • VLIW/SIMD

• Hardwareoptimizedfor16‐bitfixed‐pointoperations

• Fixed‐pointoperations

• Scratchpadmemory • Nocache

• SpecialDSPinstructions • SpecialDSPinstructions

• SeparatedSIMDmemory

4.3.2 Wirelessprotocolcharacteristics

Wirelessprotocolsarecharacterizedbyanumberofspecificproperties,whichhaveanimportantimpactonthedesignofaDSPsystem.

• HighData­LevelParallelism—MostofthecomputationallyintensiveDSPalgorithmshaveabundantdatalevelparallelism(forexamplethe“searcher”inaW‐CDMAprotocol,canberepresentedby320‐widevectors),muchmorethaninstructionlevelparalellism.[13]Parallelismiselaboratedonin§4.3.3.

• 8to16­bitdatawidth—Mostalgorithmsoperateonvariableswithsmallvalues.Analysisoftwotypicalwirelessprotocolsshowsthatthereshouldbestrongsupportfor8and16‐bitfixed–pointoperations.32‐bitfixed‐pointoperationsandfloating‐pointsupportisnotnecessary.[13]

• Real­timerequirements—Strictreal‐timerequirementinwirelessprotocolsrequiresdeterministicarchitecturalbehaviour.Thereforefeaturessuchascaching,multi­threadingandpredictionarenotwellsuited.[13]

• Vectoroperations—Intravectoroperations(vectorreductions)andshufflingofdatawithinavectoriskeytoanumberofcommonalgorithms(e.g.FFT).Thereforespecific,powerandperformanceoptimal,supportfortheseoperationsshouldbeprovided.[16]Thisiselaboratedonin§4.3.4.

4.3.3 Parallelism

ThekeytoachievethehighperformancerequiredforSDRistomaximallyexploittheparallelismavailable.

SODAprovidesthreelevelsofparallelism:multiplecores(PEs),VLIWandSIMD.[13]

• Multiplecores—Periodicreal‐timedeadlinesrequirealgorithmstocomputewithdifferentdataratesanddifferentlatencies.Realizingthisforcomplexprotocolsusingasinglethreadedsystemistooexpensive(forexample,complexcontextswitchingsoftware).

16

However,itisnecessarytosupportefficientconcurrentDSPalgorithmexecution.Theretomultiplecores,implementedasPEs,areused.(Eachprotocolpipelineisbrokenupintokernels,andeachkernelisassignedtoaPE.)Sincethetaskinthewirelessprotocolsanalysedcanbepartitionedinto4majortaskgroups,4PEshasbeenchosen.

• VLIW—Thethreepipelines–scalar,SIMDandAGU–canbeconsideredasanasymmetricVLIWpipeline(asymmetric,sincee.g.scalarinstructionscannotbescheduledontheSIMDpipelineandviceversa).Thescalarpipelineisnecessarybecausetherearemanysmall,importantscalaralgorithms.WideSIMDunitswouldbetooinefficientinsupportingthesescalarandnarrowSIMDoperations.

• SIMD—ThecomputationintensiveDSPalgorithmsinwirelessprotocolsusuallycontainoperationsonverywidevectors.Inaddition,vectorwidthsandstridesareknownatcompiletime.Althoughthismakesagoodfitforvectorarchitectures,theextrahardwaresupportfordynamicvectormanagementisunnecessary,asalldataoperationscanbestaticallyscheduled.ThisfavorsawideSIMD‐styleexecution.ThewidthoftheSIMDunitisnotchosentobe32clustersbyaccident.Itistheoptimalpowerconsumptionconfiguration,giventheestimatedcomputationrequirementof10GOPSperPE.Figure9mapsthepowerconsumptiontoSIMDwidth.Therequired10GOPScanbeachievedthroughanumberofSIMD/frequencycombinations,forexamplea2‐wideSIMDrunningat5GHz.Highfrequenciesrequireunrealisticdeeppipelines(indicatedby‘I’),andsufferfromhighpowerconsumption.

Figure9–AveragenormalizedpowerrequirementsofaPE(180nm)

EVPonlyprovidestwolevelsofparallelism:VLIWandSIMD.However,EVPdiffersinitsVLIWorganization:itcanexecutemultipletypesofwideSIMDoperationsinparallel.WhilethisprovidessupportforahigherdegreeofILP,italsorequiresamorecomplexregisterfile.BecausethereisverylimitedILPamongvectorcomputations(i.e.datalevelparallelismismuchmoreprevalentthaninstructionlevelparallelism),thisextralevelofparallelismdoesnotaddmuchtoperformance.[13]

TheEVPSIMDwidthisscalable,buthasbeensetto16forthefirstEVPprototype,labeledEVP16.[16]Theexactreasonofchoosing16isnotspecificaswasthecaseforSODA.Ofcourse,higherSIMDwidthwillprovideforhigherdataparallelism.

4.3.4 SpecialDSPinstructions

SODAcontainsspecialDSPinstructions,bothtooptimizeperformanceandpowerconsumption.ProvidedarespecialvectoroperationssupportedbytheSIMDshufflenetwork(forexample,thebutterflyoperationtoenhanceFFTperformance)andvectortoscalaroperations(forexample,determiningthemaximumvalueinavector).Multiply‐Accumulate(MAC)instructionsarenotprovidedbutinsteadhandledbyvectorlogicoperations(predicatednegation)sincemanyofthesemultiplicationoperationsarewith1bitvalues,consumingsignificantlylesspower.[13]

17

EVPprovidesanumberofspecializedFUs(seealsoFigure8).TheshuffleunitprovidesfunctionalitysimilartoSODA’sSSN(SIMDshufflenetwork).TheintravectorunitprovidesfunctionalitysimilartoSODA’sspecialinstructionsforvectortoscalaroperations(forexampleitcanalsodeterminethemaximumvalueinavector).ThecodegenerationunitsupportsCDMA‘codechip’generation(acodechipisafundamentalunitoftransmissioninCDMA,awirelesschannelaccessmethod).Inasingleclockcycle16successivecomplexcodechipsaregenerated.Theunitisconfigurablefordifferentstandard(UMTS,GPS,etc.).ContrarytoSODA,theMACunitprovidessupportforMACinstructions.[16]

4.3.5 Other

Nobranchprediction—Branchpredictionisnotimplemented,becauseformostDSPalgorithmsthecorekernelsconsistofshallownestedloops.Instead,forSODA,aloopcounter‐basedbranchinstructionisavailable.[13]

Scratchpadmemory/nocache—Tomeettheneedforhighmemorybandwidthandlowpowerconsumption,anon‐chipscratchpadmemoryisusedwhichissmallandhencefast.Withnodatatemporallocality(streamingdata),cachestructuresprovidelittleadditionalbenefits,intermsofpowerandperformance,oversoftwarecontrolledscratchpadmemories.[13]

SeparatedSIMDmemory—BothSODAandEVPhaveadedicatedSIMDmemory.Ingeneraltwomemoriesconsumelowerpowerthanoneunifiedmemory.Inaddition,thisseparationallowsoptimizationoftheread/writeportsforitsintendeduse(inthecaseofSODA:512‐bitfortheSIMDmemoryand16‐bitforthescalarmemory).[13]

Powerconsumption—At180nm,SODA’spowerconsumptionis3W.Thisistoohighforembeddedmobiledevices(200mWisatypicalcellularphone’spowerbudgetforthephysicallayer).Scalingto65nmpredictsthepowerconsumptiontoaround250mW.[13]NopracticalcomparabledataisgivenfortheEVP’spowerconsumption,althoughEVPliteraturesomewhatunclearlymentionsthatrunningcertainUMTSprotocolsontheEVPresultsinapowerconsumptioncloseto1W(at90nm);otherprotocolsperformbetter.[16]

BothEVPandSODAliteraturedoesnotmentionsupportunalignedloads,forwardingandcompression,whichseemstoindicatethatitisnotsupported.

4.4 FinalthoughtsonSODAandEVP

ItshouldbeclearfromthepreviousoverviewthatkeytoachievingtheperformancerequiredforSDRliesinexploitingasmuchaspossibletheavailableparallelism.How,then,canbothSODAandEVPreachthisperformance,giventhatSODAismuchbetterequippedforexploitingparallelism(runningatapproximatelythesamefrequency)?SODAinadditionnotonlyhasfourcores,itsSIMDpipelineis2timeswider.

AresponsefrombothaSODAandEVPdevelopertothissamequestionclarifiesanumberofthings:

• BothcitethedifferenceinVLIWorganization—theEVPcanexecutemultipletypesofwideSIMDoperationsinparallel.[15][18]

• TheEVP‘outsources’anumberofalgorithmstootherprocessors(suchastheViterbidecoding),whereasSODAperformsthesefunctionalitiesonchip(i.e.ondesignatedPEs).[15]

• ForEVPtocatchupwithnewandmoredemandingprotocols,multiplecoresareindeednecessary.[18]

• Finally,SDRisarelativelynewresearchtopic.Itislackingstandardizedbenchmarks,andeachwirelessprotocolhascountlessnumberofdifferentimplementationsandconfigurations.Thesedifferentconfigurationsandimplementationscanaffectperformance/powersignificantly.Hence,itisveryhardtocomparetwoseparateddevelopedandtestedprocessors.[15]

18

5 Literature

5.1 General[1] Wikipedia[2] JasonFritts,WayneWolfandBeddeLiu,Understandingmultimediaapplication

characteristicsfordesigningprogrammablemediaprocessors,1999

5.2 TriMedia

TheTriMediasectionwasbasedmainlyonVanDeWaerdt’spaper.Hedesignedthisprocessorforhisthesisthereforehewasabletoexplainthoroughlymanytechnicaldetailswithenoughclarityandtothepoint.Theremainingsourceswereusedmainlytoseethe“bigpicture”oftheTriMedia,theyarelesstechnicalandmoreconcernedwithconsumerapplicationsandbusinessaspects.[3] VandeWaerdt,etal.TheTM3270Media­Processor.Proceedingsofthe38thAnnual

IEEE/ACMInternationalSymposiumonMicroarchitecture(MICRO’05).2005[4] HoogerbruggeJanetal.InstructionSchedulingforTriMedia.PhilipsResearchLaboratories.[5] BoresSignalProcessing.TriMediaOverview.

http://www.bores.com/courses/tm_overview/index.htm.Lastupdated:March2007.[6] VandeWaerdt,TheTM3270Media­processor,October2006,TUDelft,ISBN90‐9021060‐1,

PhDThesis

5.3 TIC6000

TexasInstrumentsprovidesextensivedocumentationontheC6000platformandthedifferencesbetweenspecificgenerations.Thesedocumentsareingeneralverycomplete,however,informationonwhycertainfeaturesareimplementedarenotincluded.[7] TexasInstruments,TMS320C6000CPUandInstructionSetReferenceGuide,SPRU189d,July

2006[8] TexasInstruments,TMS320C6000TechnicalBrief,SPRU197d,February1999[9] TexasInstruments,TMS320C64xTechnicalOverview,SPRU395b,January2001[10] TexasInstruments,TMS320C62xDSPCPUandInstructionSetReferenceGuide,SPRU731,July

2006[11] Internetforumpost“CcompilersforTITMS320DSPs”,2005‐3‐02,

http://www.dsprelated.com/showmessage/30734/1.php[12] JanPartheyandRobertBaumgartl,PortingGCCtotheTMS320­C6000DSPArchitecture,

AppearedintheProceedingsofGSPx’04,SantaClara,September2004

5.4 SODA

Thepaper“SODA,ALow­powerArchitectureForSoftwareRadio”seemsverycompleteandthorough,discussingtheSODAarchitectureaswellastherationalebehindthedifferentdesigndecisions,whilealsosettingthemofagainstthealternatives.Itwaslackinghoweverindiscussinghowsuchacomplexarchitecturecanbereasonablyprogrammed.[13] YuanLinetal.SODA,ALow­powerArchitectureForSoftwareRadio.InProceedingsofthe

33rdAnnualInternationalSymposiumonComputerArchitecture,2006.[14] YuanLinetal,HierarchicalCoarse­grainedStreamCompilationforSoftwareDefinedRadio,In

CASES’07,2007[15] PersonalemailcommunicationwithKeesMoerman,2008‐12‐16.

19

5.5 EVP

TherewasaverylimitedamountofinformationavailableintheWebontheEVP.ThissectionwasbasedmainlyonVanBerkel’spaper.ThispaperdoesnotfocusontheEVPbutonSDRapplicationsonvectorprocessorsandofferstheEVPasanexample;itdoesnotexploreindepthitsarchitecturalfeatures.Moerman’sarticleoffersabriefbusinessorientedviewoftheEVPcapabilitiesandapplications.[16]VanBerkeletal.VectorProcessingasanEnablerforSoftware­DefinedRadioinHandheld

Devices.EURASIPJournalonAppliedSignalProcessing2005.[17]Moerman,Kees.Embeddedvectorprocessorisonewaytotunesoftware­definedradios.

WirelessNetDesignLine.http://www.wirelessnetdesignline.com/202403292;jsessionid=5UPHUZ4YXPVRYQSNDLRSKHSCJUNN2JVN?pgno=1.Lastupdated:October2007.

[18] PersonalemailcommunicationwithKeesMoerman,2008‐12‐16.