Upload
others
View
36
Download
1
Embed Size (px)
Citation preview
Programming Heterogeneous Embedded Systems for IoT
Jeronimo CastrillonChair for Compiler ConstructionTU [email protected]
Get-together toward a sustainable collaboration in IoTApril 18th, Tunis, Tunisia
Internet of things
2 © J. Castrillon. Get-together for IoT, Apr. 2016
Source: Open clipart
Real time
Distributed
Resource constrained
Energy efficient
Highly heterogeneous
Multiple domains
Internet of things
3 © J. Castrillon. Get-together for IoT, Apr. 2016
Source: Open clipart
Distributed Energy efficient
Highly heterogeneous
Real time Resource constrained
Multiple domains
IoT/Systems-of-Systems: Extremely hard to program and
reason about
Independent nodes: Trends
q Single node: HW complexityq Increasing heterogeneityq Complex interconnect
q Increasing number of cores
q SW “complexity”q Not anymore a simple control loopq Need expressive programming
models
© J. Castrillon. Get-together for IoT, Apr. 20164
0 2 4 6 8
10 12 14 16
2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
Num
ber o
f PEs
Year
PE Count in SoCs
OMAP Family
OMAP1 OMAP2OMAP3530
OMAP3640OMAP4430
OMAP4470
OMAP5430
Snapdragon Family
S1 QSD8650 S2 MSM8255S3 MSM8060
S4 MSM8960
S4 APQ8064
[Castrill14]
DMAs, sema-
phoresPMU
Peripherals
Communicationsupport
HW queues
Network Processor
Packet DMA
MEMsubsystem
NoC
A15L1
A15L1
L2A15L1
A15L1
VLIW DSP
L1,L2
Independent nodes: Trends
q Single node: HW complexityq Increasing heterogeneityq Complex interconnect
q Increasing number of cores
q SW “complexity”q Not anymore a simple control loopq Need expressive programming
models
© J. Castrillon. Get-together for IoT, Apr. 20165
0 2 4 6 8
10 12 14 16
2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
Num
ber o
f PEs
Year
PE Count in SoCs
OMAP Family
OMAP1 OMAP2OMAP3530
OMAP3640OMAP4430
OMAP4470
OMAP5430
Snapdragon Family
S1 QSD8650 S2 MSM8255S3 MSM8060
S4 MSM8960
S4 APQ8064
[Castrill14]
DMAs, sema-
phoresPMU
Peripherals
Communicationsupport
HW queues
Network Processor
Packet DMA
MEMsubsystem
NoC
A15L1
A15L1
L2A15L1
A15L1
VLIW DSP
L1,L2
Outline
q Introduction
q Models of computation
q Tool flows
q Outlook for IoT
q Summary
© J. Castrillon. Get-together for IoT, Apr. 20166
Outline
q Introduction
q Models of computation
q Tool flows
q Outlook for IoT
q Summary
© J. Castrillon. Get-together for IoT, Apr. 20167
What is a MoC?
8 © J. Castrillon. Get-together for IoT, Apr. 2016
A model of computation is the definition of the set of allowable operations used in computation and their respective costs. It is used for measuring the complexity of an algorithm in execution time and or memory space
Wikipedia: From computational theory
Model of computations are abstract specifications of how a computation can progress
http://c2.com/cgi/wiki?ModelsOfComputat ion
A Model of computation is a collection of rules that govern the interaction of components
Ptolemaeus, C. (Ed.). System Design, Modeling, and Simulation using Ptolemy II Ptolemy.org, 2014
(Parallel) Models of Computation (MoCs): Rules
q What are the componentsq Threads, processes, actors, tasks or procedures
q Execution and concurrencyq Order and determinism
q Communication mechanisms: How do components exchange dataq Asynchronous or rendezvous, perfect or lossy
9 © J. Castrillon. Get-together for IoT, Apr. 2016
A Model of computation is a collection of rules that govern the interaction of components
Ptolemaeus, C. (Ed.). System Design, Modeling, and Simulation using Ptolemy II Ptolemy.org, 2014
Why MoCs?
q Depending on the rules, MoCs feature different propertiesq Properties
q Relate to the power: what can be computed with the modelq Makes it easier for automatic tools to understand and synthesize code
§ E.g., compile-time optimization, support resource constraints, …
q Examplesq Can the application run on bounded memory?q Can we compute the maximal throughput?
q Is the execution deterministic?
10 © J. Castrillon. Get-together for IoT, Apr. 2016
MoCs: Overview
11 © J. Castrillon. Get-together for IoT, Apr. 2016
Ptolemaeus, C. (Ed.). System Design, Modeling, and Simulation using Ptolemy II Ptolemy.org, 2014
For IoT in general: Mixture of MoCs interacting (across different domains. In this talk: Software (concurrent, dataflow + PN + time)
Homogeneous Synchronous Dataflow (HSDF)
q Graph with simple execution semantics
q Functional, graph model can be extended to modelq Bounded buffersq Schedules
12 © J. Castrillon. Get-together for IoT, Apr. 2016
a1 a2 a3e1 e3
e4
e2
a1 a2 a3e1 e3
e4
e2
a1 a2 a3e1 e3
e4
e2
a1 a2
c
HSDF: schedules
q Overlapping schedule
13 © J. Castrillon. Get-together for IoT, Apr. 2016
a2
a1
a3
a4 Time
PE2
PE1 a4
a3
a1
a2
a4
a3
a1
a2
a4
a3
a1
a2
How to delay the execution of a3?, how to prevent a4 from executing earlier?
Outline
q Introduction
q Models of computation
q Tool flows
q Outlook for IoT
q Summary
© J. Castrillon. Get-together for IoT, Apr. 201614
© J. Castrillon. Get-together for IoT, Apr. 201615
Programming flow: Overview
MoC-based Application
Non-functional specification
PNargs_ifft_r.ID = 6U;PNargs_ifft_r.PNchannel_freq_coef = filtered_coef_rightPNargs_ifft_r.PNnum_freq_coef = 0U;PNargs_ifft_r.PNchannel_time_coef = sink_right;PNargs_ifft_r.channel = 1;sink_left = IPCllmrf_open(3, 1, 1);sink_right = IPCllmrf_open(7, 1, 1);PNargs_sink.ID = 7U;PNargs_sink.PNchannel_in_left = sink_left;PNargs_sink.PNnum_in_left = 0U;PNargs_sink.PNchannel_in_right = sink_right;PNargs_sink.PNnum_in_right = 0U;taskParams.arg0 = (xdc_UArg)&PNargs_src;taskParams.priority = 1;
ti_sysbios_knl_Task_create((ti_sysbios_knl_Task_FuncPtr&taskParams, &eb);
glob_proc_cnt++;hasProcess = 1;taskParams.arg0 = (xdc_UArg)&PNargs_fft_l;taskParams.priority = 1;
ti_sysbios_knl_Task_create((ti_sysbios_knl_Task_FuncPtrft_Templ, &taskParams, &eb);
glob_proc_cnt++;hasProcess = 1;taskParams.arg0 = (xdc_UArg)&PNargs_ifft_r;taskParams.priority = 1;
ti_sysbios_knl_Task_create((ti_sysbios_knl_Task_FuncPtrfft_Templ, &taskParams, &eb);
glob_proc_cnt++;hasProcess = 1;taskParams.arg0 = (xdc_UArg)&PNargs_sink;taskParams.priority = 1;
DMAs, sema-
phoresPMU
Peripherals
Communicationsupport
HW queues
Network Processor
Packet DMA
MEMsubsystem
NoC
A15L1
A15L1
L2A15L1
A15L1
VLIW DSP
L1,L2Architecture
model
Analysis
Synthesis
Code generation
[Castrill11a]
Application model: Beyond HSDF
q SDF: Popular model for static applicationsq Possible to derive equivalent HSDF
q Dynamic DF (DDF): For dynamic applicationsq Flexible firing rules (even non-determinism)
q Kahn Process Networks (KPN): Also dynamicq Deterministic q No firing semantics
16 © J. Castrillon. Get-together for IoT, Apr. 2016
a1 a2 a33 1 2 3
6 2
12
e1 e3
e4
e2
a1 a2 a3i1
i2
p1 p2 p3e1 e3
e4
e2
Non/extra-functional: Constraints
q Timing constraintsq Process throughputq Latencies along paths
q Time triggeringq Mapping constraints
q Processes to processors
q Channels to primitivesq Platform constraints
q Subset of resources (processors or memories)q Utilization
© J. Castrillon. Get-together for IoT, Apr. 201617
1 ms
3 ms
1 ms
Architecture model
q Key for retargetable tool flowsq Abstract for speedq Relevant components: processors (types),
interconnect and memories
q Why? Understand effect of executing an actor on the different resources for a schedule
18 © J. Castrillon. Get-together for IoT, Apr. 2016
DMAs, sema-
phoresPMU
Peripherals
Communicationsupport
HW queues
Network Processor
Packet DMA
MEMsubsystem
NoC
A15L1
A15L1
L2A15L1
A15L1
VLIW DSP
L1,L2
Time
PE2
PE1 a4
a3
a1
a2
a4
a3
a1
a2
a4
a3
a1
a2
Modeling processors
q Abstract models with functional units, latencies
19 © J. Castrillon. Get-together for IoT, Apr. 2016
Execution counts, branch stats and execution traces
[Eusse14]
Modeling communication
q High-level: Available APIs for implementing application channels
q Benchmarking for different platforms
20 © J. Castrillon. Get-together for IoT, Apr. 2016
[Oden13]
hs-serial
hs-serial
hs-serial
hs-serial
hs-serial
parallelRouter(1,0)
FPGA-Interface
Router(1,1)
Router(0,1)
Router(0,0)
Duo-PE0
Duo-PE1
FEC
Duo-PE2
SDDuo-PE6
Duo-PE7
Duo-PE5
Duo-PE3 CM
ADPLL, PM
GT
ADPLL, PM
GT
ADPLL, PM
GT
AVS Contr.
UART-GPIO
ADPLL, PM
GT
ADPLL, PM
GT
ADPLL, PM
GT
ADPLL, PM
GT
ADPLL, PM
GT
ADPLL, PMGT
ADPLL, PMGT
ADPLL, PMGT
ADPLL
ADPLL
DDR-SDRAM-Interface
Tomahawk2_core
APP
Duo-PE4
ADPLL, PMGT
VDSPRISC
VDSPRISC
VDSPRISC
VDSPRISC
VDSPRISC
VDSPRISC
VDSPRISC
ADPLL, PMGT
VDSPRISC
[Arnold13]
Modeling communication
q High-level: Available APIs for implementing application channels
q Benchmarking for different platforms
21 © J. Castrillon. Get-together for IoT, Apr. 2016
[Oden13]
hs-serial
hs-serial
hs-serial
hs-serial
hs-serial
parallelRouter(1,0)
FPGA-Interface
Router(1,1)
Router(0,1)
Router(0,0)
Duo-PE0
Duo-PE1
FEC
Duo-PE2
SDDuo-PE6
Duo-PE7
Duo-PE5
Duo-PE3 CM
ADPLL, PM
GT
ADPLL, PM
GT
ADPLL, PM
GT
AVS Contr.
UART-GPIO
ADPLL, PM
GT
ADPLL, PM
GT
ADPLL, PM
GT
ADPLL, PM
GT
ADPLL, PM
GT
ADPLL, PMGT
ADPLL, PMGT
ADPLL, PMGT
ADPLL
ADPLL
DDR-SDRAM-Interface
Tomahawk2_core
APP
Duo-PE4
ADPLL, PMGT
VDSPRISC
VDSPRISC
VDSPRISC
VDSPRISC
VDSPRISC
VDSPRISC
VDSPRISC
ADPLL, PMGT
VDSPRISC
[Arnold13]
0 1 024 2 048 3 072 4 096 5 120 6 144 7 168 8 1920
500
1 000
1 500
2 000
© J. Castrillon. Get-together for IoT, Apr. 201622
Analysis and synthesis: Overview
Non-functional specification
Applicationspecification
Architecture model
Analysis: Instrumentation, profiling, tracing
Sequential performance estimation
Mapping and scheduling
Mapping configuration
Time-annotated traces
Parallel perf. estimation
Increase resources
For dynamic models!
Tracing: Dealing with dynamic behavior
q KPNs do not have firing semanticsq White model of processes: source code analysis and tracingq Tracing: instrumentation, token logging and event recording
© J. Castrillon. Get-together for IoT, Apr. 201623
TracingSeq. perf. estimation
MappingPar. perf. estimation
Resources
Elapsed time from processor models
Trace-based synthesis
q Synthesis based on code and trace analysis (using simple heuristics)q Mapping of processes and channelsq Scheduling policies
q Buffer sizing24 © J. Castrillon. Get-together for IoT, Apr. 2016
[Castrill10, Castrill10b, Castrill13]
TracingSeq. perf. estimation
MappingPar. perf. estimation
Resources
t
t
t
Time-annotated traces
Mapping, scheduling,
buffer sizing
Non-functional specification
Mapping configuration
Architecture model
TracingSeq. perf. estimation
MappingPar. perf. estimation
Resources
Mapping, scheduling and buffer sizing
q Graph representation of event tracesq Reason about deadlocksq Find critical path
q Downside: For dynamic portions of the application, graph is input dependent
25 © J. Castrillon. Get-together for IoT, Apr. 2016
...
...[Castrill12]
Multiple traces (ongoing work)
q Different input à different behavior (traces)q Characterize behaviors and impact on mapping performance
© J. Castrillon. Get-together for IoT, Apr. 201626
[Goens15]
TracingSeq. perf. estimation
MappingPar. perf. estimation
Resources
…
...
...
...
Mapping configuration
Mapping configuration
Mapping configuration
Multiple traces (2)
q Different input à different behavior (traces)q Characterize behaviors and impact on mapping performance
27 © J. Castrillon. Get-together for IoT, Apr. 2016
...
...
...
TracingSeq. perf. estimation
MappingPar. perf. estimation
Resources
Behavior
Perf
orm
ance
[Goens15]
Mapping configuration
Mapping configuration
Mapping configuration
Parallel performance estimation
q Discrete event simulator to evaluate a solutionq Replay traces according to mappingq Extract costs from architecture file (NoC modeling, context switches, communication)
© J. Castrillon. Get-together for IoT, Apr. 201628
TracingSeq. perf. estimation
MappingPar. perf. estimation
Resources
Mapping configuration
t
t
t
Time-annotated traces
Trace Replay Module(TRM)
Architecture model
Grantt Chart, platform utilization, channel profiles, ...
t
Contextswitch
Blockedtime
...
Increasing resources
q Add resources to the synthesis until constraints are met
q Add processors and memoriesq Easy for homogeneous platforms
q Non trivial for heterogeneous platforms (ongoing work)
© J. Castrillon. Get-together for IoT, Apr. 201629
TracingSeq. perf. estimation
MappingPar. perf. estimation
ResourcesMapping and scheduling
Parallel perf. estimation
Increase resources
Multiple mapping configurations
q In case multiple applications may run on the system (or system of systems)q Need different configurationsq Adapt to resource and energy budgets
q Reason about mapping equivalences
30 © J. Castrillon. Get-together for IoT, Apr. 2016
=?
[Goens15]
Analysis for Multi-applications
q Use utilization profiles from the discrete event simulator (TRM)q Quickly separate good from bad implementations
31 © J. Castrillon. Get-together for IoT, Apr. 2016
[Castrill14]
Code generation
q Take mapping configuration and generate code accordinglyq From architecture model: APIs, configuration parameters, …q Sample targets: experimental heterogenous systems, TI (Keysonte,
TDA3x), Parallela/Ephiphany, ARM-based (Exynos, Snapdragon)
32 © J. Castrillon. Get-together for IoT, Apr. 2016
PNargs_ifft_r.ID = 6U;PNargs_ifft_r.PNchannel_freq_coef = filtered_coef_right;PNargs_ifft_r.PNnum_freq_coef = 0U;PNargs_ifft_r.PNchannel_time_coef = sink_right;PNargs_ifft_r.channel = 1;sink_left = IPCllmrf_open(3, 1, 1);sink_right = IPCllmrf_open(7, 1, 1);PNargs_sink.ID = 7U;PNargs_sink.PNchannel_in_left = sink_left;PNargs_sink.PNnum_in_left = 0U;PNargs_sink.PNchannel_in_right = sink_right;PNargs_sink.PNnum_in_right = 0U;taskParams.arg0 = (xdc_UArg)&PNargs_src;taskParams.priority = 1;...
Code generationCPN
application
Mapping configuration Architecture
model Debugging info[Castrill11b, Murillo14]
Examples
q Same application(s) automatically compiled to different platforms
q Virtual platforms: SystemC models of full systemsq Bare metalq Multi-RISC/VLIW platform
q Real platform: TI Keystone platformq Speedup on commercial platformsq Code generation against vendor stacks
33 © J. Castrillon. Get-together for IoT, Apr. 2016
Example: multi-media applications
q Iterative mapping
© J. Castrillon. Get-together for IoT, Apr. 201634
© J. Castrillon. Get-together for IoT, Apr. 201635
Manual vs. Automatic: TI Keystone
Image processing Audio filtering application
LTE digital receiver
[Aguilar14]
Outline
q Introduction
q Models of computation
q Tool flows
q Outlook for IoT
q Summary
© J. Castrillon. Get-together for IoT, Apr. 201636
© J. Castrillon. Get-together for IoT, Apr. 201637
So far…
MoC-based Application
Non-functional specification
PNargs_ifft_r.ID = 6U;PNargs_ifft_r.PNchannel_freq_coef = filtered_coef_rightPNargs_ifft_r.PNnum_freq_coef = 0U;PNargs_ifft_r.PNchannel_time_coef = sink_right;PNargs_ifft_r.channel = 1;sink_left = IPCllmrf_open(3, 1, 1);sink_right = IPCllmrf_open(7, 1, 1);PNargs_sink.ID = 7U;PNargs_sink.PNchannel_in_left = sink_left;PNargs_sink.PNnum_in_left = 0U;PNargs_sink.PNchannel_in_right = sink_right;PNargs_sink.PNnum_in_right = 0U;taskParams.arg0 = (xdc_UArg)&PNargs_src;taskParams.priority = 1;
ti_sysbios_knl_Task_create((ti_sysbios_knl_Task_FuncPtr&taskParams, &eb);
glob_proc_cnt++;hasProcess = 1;taskParams.arg0 = (xdc_UArg)&PNargs_fft_l;taskParams.priority = 1;
ti_sysbios_knl_Task_create((ti_sysbios_knl_Task_FuncPtrft_Templ, &taskParams, &eb);
glob_proc_cnt++;hasProcess = 1;taskParams.arg0 = (xdc_UArg)&PNargs_ifft_r;taskParams.priority = 1;
ti_sysbios_knl_Task_create((ti_sysbios_knl_Task_FuncPtrfft_Templ, &taskParams, &eb);
glob_proc_cnt++;hasProcess = 1;taskParams.arg0 = (xdc_UArg)&PNargs_sink;taskParams.priority = 1;
DMAs, sema-
phoresPMU
Peripherals
Communicationsupport
HW queues
Network Processor
Packet DMA
MEMsubsystem
NoC
A15L1
A15L1
L2A15L1
A15L1
VLIW DSP
L1,L2Architecture
model
Analysis
Synthesis
Code generation
What’s missing?
q Reactivityq Supported for periodic events (trigger)q Enough for many applications (automotive)
q Sporadic (I/O) events may break determinism
q Better execution guarantees
q Multi-SoCq Supported if inter-SoC communication is
predictable (same cost-function approach)
q Scalability: Add a layer of “geographic” coarse mapping
38 © J. Castrillon. Get-together for IoT, Apr. 2016
1 ms
3 ms
1 ms
DMAs, sema-
phoresPMU
Peripherals
Communicationsupport
HW queues
Network Processor
Packet DMA
MEMsubsystem
NoC
A15L1
A15L1
L2A15L1
A15L1
VLIW DSP
L1,L2
Outline
q Introduction
q Models of computation
q Tool flows
q Outlook for IoT
q Summary
© J. Castrillon. Get-together for IoT, Apr. 201640
Summary
q IoT complexity: heterogeneity, resource constraints, distributed, …q The need for formal MoC for programming and synthesisq Sample tool flow
q Heterogeneity: High-level architecture models
q Support dynamic applications: Analysis and synthesis based on tracesq Extensions for: multiple traces
q Outlookq Adaptivity at runtime
q Reactivity and multi-SoC case studies
© J. Castrillon. Get-together for IoT, Apr. 201641
Acknowledgements
q Silexica Software Solutions GmbH
q German Cluster of Excellence: Center for Advancing Electronics Dresden (www.cfaed.tu-dresden.de)
© J. Castrillon. Get-together for IoT, Apr. 201642
References[Castrill14] J. Castrillon , et al., “Programming Heterogeneous MPSoCs:Tool Flows to Close the Software Productivity Gap”, Springer, 2014
[Castrill11a] J. Castrillon, et al., “Trends in embedded softwaresynthesis”, International Conference on Embedded Computer Systems(SAMOS) pp. 347-354, 2011
[Oden13] M. Odendahl, et al., “Split-cost communication model forimproved MPSoC application mapping”, In International Symposium onSystem on Chip pp. 1-8, 2013
[Arnold13] O. Arnold, et al. “Tomahawk - Parallelism andHeterogeneity in Communications Signal Processing MPSoCs”. TECS,2013
[Eusse14] J.F. Eusse, , et al., "Pre-architectural performance estimationfor ASIP design based on abstract processor models," In SAMOS 2014pp.133-140, 2014
[Castrill10] J. Castrillon, , et al., “Component-based waveformdevelopment: The nucleus tool flow for efficient and portable SDR,”Wireless Innovation Conference and Product Exposition (SDR), 2010
[Castrill10b] J. Castrillon, et al., “Trace-based KPN composabilityanalysis for mapping simultaneous applications to MPSoC platforms”, InDATE 2010, pp. 753-758
[Castrill13] J. Castrillon, et al., “MAPS: Mapping concurrent dataflowapplications to heterogeneous MPSoCs,” IEEE Transactions on IndustrialInformatics, vol. 9, 2013
[Castrill12] J. Castrillon, et al. , “Communication-aware mapping ofKPN applications onto heterogeneous MPSoCs,” in DAC 2012
[Goens15] A. Goens, et al., “Analysis of Process Traces for MappingDynamic KPN Applications to MPSoCs”, In IFIP International EmbeddedSystems Symposium (IESS), 2015, Foz do Iguaçu, Brazil (Accepted forpublication), 2015
[Castrill11b] J. Castrillon, et al., “Backend for virtual platforms withhardware scheduler in the MAPS framework”, In LASCAS 2011 pp. 1-4,2011
[Murillo14] L. G. Murillo, et al., “Automatic detection of concurrencybugs through event ordering constraints”, In DATE 2014, pp. 1-6, 2014
[Aguilar14] M. Aguilar, et al., "Improving performance andproductivity for software development on TI Multicore DSP platforms,"in Education and Research Conference (EDERC), 2014 6th EuropeanEmbedded Design in , vol., no., pp.31-35, 11-12 Sept. 2014
© J. Castrillon. Get-together for IoT, Apr. 201643