Upload
others
View
21
Download
0
Embed Size (px)
Citation preview
TITRE
INSTITUTD’ÉLECTRONIQUEETDETÉLÉCOMMUNICATIONSDERENNES
DesignSpaceExplora>oninthecontextofSKA
Jean-FrançoisNEZANCompu>ngforSKAFebruary2016
WiththecourtesyofKalray,
D.Ménard,M.PelcatandK.Desnos
1
TITRE
INSTITUTD’ÉLECTRONIQUEETDETÉLÉCOMMUNICATIONSDERENNES
Introduc>on
• Manychallengingapplica>ons– MPEGCodecs
• MPEG4Part2,AVC,SVC,HEVC,SHVC
– ComputerVisionand3DProcessing• StereoVision,SLAM
– Cryptography• Chao>c-basedCryptography
– Telecommunica>ons• 3GPPLTEeNodeB
– SquareKilometerArrayproject(SKA)
• DesignSpaceExplora>on:Whatisthebesthardware?
2
TITRE
INSTITUTD’ÉLECTRONIQUEETDETÉLÉCOMMUNICATIONSDERENNES
Introduc>on
3
Texas Instruments Keystone II 8x DSP cores @ 1.2 GHz – 4x ARM Cortex A15 @ 1.4 GHz – 6 MB memory + DDR – 650$ - 28nm – 16 Watts
Odroid with Samsung Exynos 5 4x Cortex A15 @ 1.8 GHz– 4x Cortex A7 @ 1.2 GHz + PowerVR GPU – 50$ - 28 nm – 1Watt to 6 Watts
• Poten>alCandidates– Mul>core
TITRE
INSTITUTD’ÉLECTRONIQUEETDETÉLÉCOMMUNICATIONSDERENNES
Introduc>on
4
Zboard with Xilinx Zynq 2x ARM Cortex A9 @ 1 GHz – 444 Logic Cells
512 MB memory + DDR – 400$ - 28nm – 3 or 4 Watts ?
Nvidia GeForce GTX Titan X 3072 cores – 12 Go memory – 1200€ – 1 GHz – 28nm – 313 Watts
• Poten>alCandidates
TITRE
INSTITUTD’ÉLECTRONIQUEETDETÉLÉCOMMUNICATIONSDERENNES
Introduc>on
5
Kalray MPPA
MPPA®-1024
§ 16computeclusters§ 2I/Oclusterseachwithquad-coreCPUs,DDR3,4Ethernet10Gand8PCIeGen3§ Dataandcontrolnetworks-on-chip§ Distributedmemoryarchitecture§ 634GFLOPSSPfor25W@600Mhz
§ 16usercores+1systemcore§ NoCTxandRxinterfaces§ Debug&SupportUnit(DSU)§ 2MBmul>-bankedsharedmemory§ 77GB/sSharedMemoryBW§ 16coresSMPSystem
§ 32-bitor64-bitaddresses§ 5-issueVLIWarchitecture§ MMU+I&Dcache(8KB+8KB)§ 32-bit/64-bitIEEE754-2008FMAFPU§ Tightlycoupledcryptoco-processor§ 2.4GFLOPSSPpercore@600Mhz
• Poten>alCandidates– Manycore(NUMAarchitecture)
TITRE
INSTITUTD’ÉLECTRONIQUEETDETÉLÉCOMMUNICATIONSDERENNES
Outline
• DesignSpaceExplora>on:Whatisthebesthardware?
1. Introduc>on2. Casestudy:SKACSPChannelizeronMPPA3. ChallengesforDSE4. PiSDF:aparallelModelofComputa>onforDSE5. PREESM:atoolforDSE
6
TITRE
INSTITUTD’ÉLECTRONIQUEETDETÉLÉCOMMUNICATIONSDERENNES
SKACSPChannelizeronMPPA
• SquareKilometerArrayProjectinSouthAfrica
7
Channelizer64Kà256KFFT3-4bitsà8bits
Correlator
PulsarSearchbeamformer
PulsarSearch
CSP Central Processing Unit
MiddleArray Receiver
197 x 15m dishes
SDP Science Data Processing
Ingest Gridding InverseFFT Deconvolu>on
ReginonalScienceCenters
80 PFlops
TITRE
INSTITUTD’ÉLECTRONIQUEETDETÉLÉCOMMUNICATIONSDERENNES
SKACSPChannelizeronMPPA
• CSPChannelizerapplica>on– 219realFFTin524µs– Mainconstraints:throughputandenergy– Selec>onofbothalgorithm&architecture
8
Inpu
t
Outpu
t
Tran
spose
FFT
Tran
spose
TwiddleCo
r.
FFT
Tran
spose
Combine
219 x 8bit real
219 x 8bit Real Part
219 x 8bit Imaginary Part
TITRE
INSTITUTD’ÉLECTRONIQUEETDETÉLÉCOMMUNICATIONSDERENNES
SKACSPChannelizeronMPPA
• Profilingof219realFFT(requirementof524µs)includingdatatransfers
9
Implementa@onMPPAComputeCluster
Andey@400MHz(ms)
MPPA-256Andey@400MHz
(ms)
MPPA-256Bostan@400MHz
(ms)
MPPA-256Bostan@600MHz
(ms)
FullyFixed-point–6steps 11,64 1,07 0,975 0.65
MixFloatImplementa@onFloa@ng-pointinFFTstageswithfixed-pointstorageinthemiddleofthe6-step
13.70 1.17 1,05 0.70
FullyFixed-point–6stepsMPPA-256
Andey@400MHz
MPPA-256Bostan@400MHz
MPPA-256Bostan@373MHz
MPPA-256Bostan@600MHz
Powerconsump@onof1MPPA 8.7W 8.7W 8W 17W
NumberofMPPA 2+1cluster 1+14clusters 2 1+4clusters
Powerconsump@on 17,95W 16,3W 16W 22W
TITRE
INSTITUTD’ÉLECTRONIQUEETDETÉLÉCOMMUNICATIONSDERENNES
SKACSPChannelizeronMPPA
10
MPPA®-1024
AndeyKalray1stgenera@on
211GFLOPSSP70GFLOPSDP
32-bitaddressing400Mhz
Kalray2ndgenera@on634GFLOPSSP316GFLOPSDP64-bitaddressing
600Mhz
Kalray3rdgenera@on1056GFLOPSSP527GFLOPSDP64-bitaddressing
1000Mhz
2014 2015
Coolidge
MPPA®-256
MPPA®-64
2016 2017
TSMC28nm(HP) 28nm/16nm
Bostan
Volume Samples Volume Samples Volume
Produc'on
Development
UnderSpecifica'on
MPPA® MANYCORE Processor Roadmap
TITRE
INSTITUTD’ÉLECTRONIQUEETDETÉLÉCOMMUNICATIONSDERENNES
SKACSPChannelizeronMPPA
11
• Architecturedependentop>miza>ons– Throughput
– Latency
– Energy
– Memory
– ProgrammingTime
SIMD&singlecoreParallelism
Datarepresenta>on
ManycoreMapping
Energy-awareprocessing
TITRE
INSTITUTD’ÉLECTRONIQUEETDETÉLÉCOMMUNICATIONSDERENNES
SKACSPChannelizeronMPPA
• MatlabCodetransla>on– Timeconsuming– Errorprone
• Evolu>onofthealgorithmrequirements– Discussionsaroundaccuracy– Futuredeployments(1stin2018andseveralonesun>l2030)
• Evolu>onoftheHardware– NumberofCores– Corearchitecture– Technology
• SameworktobedoneonFPGA,GPU,Mul>core...
12
TITRE
INSTITUTD’ÉLECTRONIQUEETDETÉLÉCOMMUNICATIONSDERENNES
Outline
• DesignSpaceExplora>on:Whatisthebesthardware?
1. Introduc>on2. Casestudy:SKACSPChannelizeronMPPA3. ChallengesforDSE4. PiSDF:aparallelModelofComputa>onforDSE5. PREESM:atoolforDSE
13
TITRE
INSTITUTD’ÉLECTRONIQUEETDETÉLÉCOMMUNICATIONSDERENNES
Source:ITRSSystemDrivers2011
Hardware Complexity
2010 2015 2020 2025
5000
4000
3000
2000
1000
0
Nb of PE per SoC
ChallengesforDesignSpaceExplora>on
TITRE
INSTITUTD’ÉLECTRONIQUEETDETÉLÉCOMMUNICATIONSDERENNES
Lines of code/chip x2 every 10 months
Software productivity Lines of code/day x2 every 5 years
Hardware complexity Transistors/chip x2 every 18 months
Software Productivity Gap
Source:ITRS&Hardware-dependentSo9ware,Eckeretal.,Springer
1990 1995 2000 2005 2010 2015
log
ChallengesforDesignSpaceExplora>on
Multicore Era
TITRE
INSTITUTD’ÉLECTRONIQUEETDETÉLÉCOMMUNICATIONSDERENNES
ChallengesforDesignSpaceExplora>on
• Specifica>onofparallelalgorithms– Throughput/Latencyevalua>on– Predictablememoryfootprints– Legacycodereuse
• Workingprototypes– Seamlesspor>ngtoanexis>ngplaporm(CPU,GPU,NUMA…)– Guaranteeddeadlock-freeness– Inter-PEcommunica>ons
• Virtualprototypes– Numberofcores– Memorysizeandtechnology– Hierarchicaltopologies(memoriesandclustersofcores)
16
TITRE
INSTITUTD’ÉLECTRONIQUEETDETÉLÉCOMMUNICATIONSDERENNES
ChallengesforDesignSpaceExplora>on
• Fromsequen>altoparallelModelsofComputa>ons(MoCs)
– OpenACC,M-TAPI,OpenDataPlane(ODP)…
17
Legacycode
Shared/Dist.memories
GPU Mapping/scheduling
HeterogeneousMPSoC
Hardwaredependance
VHDL
VivadoHLS
OpenCL
OpenMP3
OpenMP4
PThread
TITRE
INSTITUTD’ÉLECTRONIQUEETDETÉLÉCOMMUNICATIONSDERENNES
ChallengesforDesignSpaceExplora>on
• Analogywithsequen>almodels,toolsandmethods
18
Hardware independence Efficiency
Assembly
Sequential Models
Compilers
OpenCL,OpenMP…
Tobedefined:PiSDF?
Tobedefined:PREESM?
TITRE
INSTITUTD’ÉLECTRONIQUEETDETÉLÉCOMMUNICATIONSDERENNES
Outline
• DesignSpaceExplora>on:Whatisthebesthardware?
1. Introduc>on2. Casestudy:SKACSPChannelizeronMPPA3. ChallengesforDSE4. PiSDF:aparallelModelofComputa>onforDSE5. PREESM:atoolforDSE
19
TITRE
INSTITUTD’ÉLECTRONIQUEETDETÉLÉCOMMUNICATIONSDERENNES
PiSDFMoC
Algorithmdescrip@onsusingDataflowGraphs• SynchronousDataflow(SDF)
– ActorsandDataports– FIFOqueues
20
2
2
1
1 1 2 1 1
B
A C D E
E.LeeandD.Messerschmis,“Synchronousdataflow”,ProceedingsoftheIEEE,1987.
TITRE
INSTITUTD’ÉLECTRONIQUEETDETÉLÉCOMMUNICATIONSDERENNES
PiSDFMoC
Algorithmdescrip@onsusingDataflowGraphs• Data-drivenexecu>on
– AnactorisfiredwhenitsinputFIFOscontainenoughdata-tokens.
21
E.LeeandD.Messerschmis,“Synchronousdataflow”,ProceedingsoftheIEEE,1987.
B
A C D E
Core1 A B C C D E
2
2
1
1 1 2 1 1
TITRE
INSTITUTD’ÉLECTRONIQUEETDETÉLÉCOMMUNICATIONSDERENNES
PiSDFMoC
Algorithmdescrip@onsusingDataflowGraphs• Expressionofparallelisms:///
22
2
2
1
1 1 2 1 1
B
A C D E
Core1
Core2
Core3
x2
A B C C E A C +1 +1
B C +1 +1
D E +1 +1
D
Task parallelism Data parallelism Pipeline parallelism Internal parallelism
Data Pipeline InternalTask
E.LeeandD.Messerschmis,“Synchronousdataflow”,ProceedingsoftheIEEE,1987.
TITRE
INSTITUTD’ÉLECTRONIQUEETDETÉLÉCOMMUNICATIONSDERENNES
PiSDFMoC
PiSDF(ParameterizedandInterfacedSynchronousDataflow)
N
Read Header Size =4
Size Size Read Image
Filter
/N Size
out
Size
in
Size
SetNb Slices =2 Size
Kernel /N Size Size
Size Size Send
=6
TITRE
INSTITUTD’ÉLECTRONIQUEETDETÉLÉCOMMUNICATIONSDERENNES
PiSDFMoC
PiSDF(ParameterizedandInterfacedSynchronousDataflow)• PiSDFis:
– Hierarchical&Composi>onal– Sta>callyparameterizable(forcompile>meanalysis)– Dynamicallyreconfigurable(atrun>me)
• PiSDFfosters:– Predictability– Parallelism– Lightweightrun>meoverhead– Developer-friendliness
K. Desnos, M. Pelcat, J.-F. Nezan, S. S. Bhattacharyya, S. Aridhi “PiMM: Parameterized and Interfaced Dataflow Meta-Model for MPSoCs Runtime Reconfiguration”, SAMOS XIII
TITRE
INSTITUTD’ÉLECTRONIQUEETDETÉLÉCOMMUNICATIONSDERENNES
Outline
• DesignSpaceExplora>on:Whatisthebesthardware?
1. Introduc>on2. Casestudy:SKACSPChannelizeronMPPA3. ChallengesforDSE4. PiSDF:aparallelModelofComputa>onforDSE5. PREESM:atoolforDSE
25
TITRE
INSTITUTD’ÉLECTRONIQUEETDETÉLÉCOMMUNICATIONSDERENNES
PREESM:atoolforDesignSpaceExplora>ons
WhatisPREESMIS:• OpenSourceTool
– AvailableonGitHub• Research-OrientedTool
– Newmodels,op>miza>ons,scheduling• Eclipse-basedIntegratedTool
– Severalplug-ins,metamodels• ExtendedWebTutorials
– hsp://preesm.sourceforge.net/website
26
TITRE
INSTITUTD’ÉLECTRONIQUEETDETÉLÉCOMMUNICATIONSDERENNES
27
PREESM:atoolforDesignSpaceExplora>ons
PREESM +C compiler
Simulator + Debugger
+ Profiler
Algorithm
Architecture PE
DSP DSP
DSP DSP
PE PE PE PE
Peripherals
Main Memory
Multicore Runtime
TITRE
INSTITUTD’ÉLECTRONIQUEETDETÉLÉCOMMUNICATIONSDERENNES
PREESM:atoolforDesignSpaceExplora>ons
• PREESMMainfeatures:– Automa>cmapping/scheduling
• Workingandvirtualprototypes– Memoryanalysis:boundingthememoryneeds– Memoryop>misa>on:Buffermergingtechniques– Powerop>miza>ontechniques– Codegenera>on
• Baremetal(sta>cparameters)• Specificrun>me(dynamicparameters)
• AskforaDemo!!
28
TITRE
INSTITUTD’ÉLECTRONIQUEETDETÉLÉCOMMUNICATIONSDERENNES
PREESMDeployment
Mapping/Scheduling• State-of-the-artalgorithms(FAST,List,…)• Latencyandloadbalancingop>miza>on• Customizableaccuracy(w.r.t.communica>ons)
core1 core2 core3 core4
TITRE
INSTITUTD’ÉLECTRONIQUEETDETÉLÉCOMMUNICATIONSDERENNES
PREESM:atoolforDesignSpaceExplora>ons
Memoryop@miza@ons• Boundingthememoryneedsofanapplica>ongraphto
– Evaluatethememoryrequirements– Adjustthesizeofarchitecturememory– Assesstheop>malityofamemoryalloca>on
0 Available Memory
Wasted memory
Possible allocated memory
Insufficient memory
Upper Bound Lower Bound
TITRE
INSTITUTD’ÉLECTRONIQUEETDETÉLÉCOMMUNICATIONSDERENNES
PREESM:atoolforDesignSpaceExplora>ons
Memoryop@miza@ons• BuffermergingtechniqueforSDFgraphs
– 48%lessmemorythanstate-of-the-arttechniques
B
D
C A 20
30 10
No buffer merging AB 30
BC 20
BD 10
memory
Buffer merging
AB 30
BC 20
BD 10
memory
TITRE
INSTITUTD’ÉLECTRONIQUEETDETÉLÉCOMMUNICATIONSDERENNES
PREESM:atoolforDesignSpaceExplora>ons
Energyop@miza@on:plaeormExynos5
OdroidExynos5Big.LITTLE
A7
A7
A7
A7
A15
A15
A15
A15
TITRE
INSTITUTD’ÉLECTRONIQUEETDETÉLÉCOMMUNICATIONSDERENNES
PREESM:atoolforDesignSpaceExplora>ons
Energyop@miza@onsetupcore1 core2 core3 core4
P=0 P=1 P=0.5 P=1 P=0.5 P=0
OdroidExynos5Big.LITTLE
A7
A7
A7
A7
A15
A15
A15
A15
Image Processing
Linux-basedRun>me
(AboAkademi)QoS DVFS
DPM
P-Value
20% energy savings on a parallel Sobel + sequential postprocessing wrt. Linux completely fair scheduler and on-demand governor
TITRE
INSTITUTD’ÉLECTRONIQUEETDETÉLÉCOMMUNICATIONSDERENNES
PREESM:atoolforDesignSpaceExplora>ons
Codegenera@onformul@pletargets• Mul>-C6XDSPs:
– TMS320c6678fromTexasInstruments– Supportstheac>va>onoftheDSPcaches.
• Mul>-x86andmul>-ARMCPUs:– LinuxandWindows,pthread
• OMAP4heterogeneousplaporm:– dual-coreARMCortex-A9,2Cortex-M3,andaC64xTDSP
TITRE
INSTITUTD’ÉLECTRONIQUEETDETÉLÉCOMMUNICATIONSDERENNES
PREESM:academicpartners
35
TITRE
INSTITUTD’ÉLECTRONIQUEETDETÉLÉCOMMUNICATIONSDERENNES
PREESM:industrialpartners
36
TITRE
INSTITUTD’ÉLECTRONIQUEETDETÉLÉCOMMUNICATIONSDERENNES
Questions ?
http://preesm.sf.net @PreesmProject