View
6
Download
0
Category
Preview:
Citation preview
1
Optimization-centric embeddedsystems design: addressingthe multi-core challenge
L. Benini, DEIS Università di Bologna,
Luca.benini@unibo.it
Year of IntroductionYear of Introduction20052005 20072007 20092009 20112011 20132013 20152015
5GOPS/W5GOPS/W
100 GOPS/W100 GOPS/W
SignSignrecognitionrecognition
A/VA/Vstreamingstreaming
AdaptiveAdaptiverouteroute
CollisionCollisionavoidanceavoidance
AutonomousAutonomousdrivingdriving
3D projected3D projecteddisplaydisplay
HMI by motionHMI by motionGesture detectionGesture detection
UbiquitousUbiquitousnavigationnavigation
SiSi XrayXray
GbitGbit radioradio
UWBUWB
802.11n802.11n
Structured Structured encodingencoding
Structured Structured decodingdecoding
3D TV3D TV 3D gaming3D gaming
H264H264encodingencoding
H264H264decodingdecoding
ImageImagerecognitionrecognition
Full recognitionFull recognition(security)(security)
AutoAutopersonalizationpersonalization
dictationdictation
3D ambient3D ambientinteractioninteraction
LanguageLanguageEmotionEmotionrecognitionrecognition
GestureGesturerecognitionrecognition
ExpressionExpressionrecognitionrecognition
MobileMobileBaseBase--bandband
1 TOPS/W1 TOPS/W
Philips/IMECPhilips/IMEC
Embedded Applications Pull
2
Moore’s Law Revisited
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 20200
200
400
600
800
1000
60
50
40
30
20
10
0
1200
Num
ber
of P
roce
ssin
g En
gine
s
Logi
c, M
emor
y Si
ze (N
orm
aliz
ed to
200
5)
Number of Processing Engines(Right Axis)
Total Logic Size(Normalized to 2005, Left Axis)
Total Memory Size(Normalized to 2005, Left Axis)
16 23 32 46 63 79101
133 161212
268
348
424
526
669
878
[Martin06]
AmarasingheAmarasinghe 0606
Rapport Rapport KilocoreKilocore
The Multicore Revolution
3
“Traditional” Bus-based SoCs fit in one tile !!Extreme NRE cost: complexity, unwieldy technologyHigh risk, large investment decreasing design starts
Flexibility is key programmability, configurabilityResource management (computation, communication storage) is essential to achieve efficiency
Allocation and scheduling of NUMEROUS, HETEROGENOUS resourcesMarket success hinges on optimal exploitation of available resources
Heterogeneous Multicore
I/0
I/0
PE
PE PE PE
SRAM SRAM
DRAM
I/O
I/OPERIPHERALS
3D stacked m
ain mem
ory
PE
LocalMemory
hierarchy
CPU
i/o
Applications Software opt. Middleware, RTOS, API,Run-Time Controller
MappingVDD,VT,fclk
Resource Management IS OptimizationSearch spaceThe set of “all” possible allocations and
schedules
ConstraintsSolutions that we are not willing to accept
Cost functionA property we are interested in (execution
time, power, reliability…)
Many approaches to optimal RM: off-line vs. onlinecomplete (exact) vs. heuristic (approximate)
4
When & Why Complete, OfflineOptimization?
Plenty of design-time knowledgeApplications pre-characterized at design timeDynamic transitions between different pre-characterized scenarios
Aggressive exploitation of system resourcesReduces overdesign (lowers cost)Strong performance guarantees
Applicable for many embedded applications
Optimization-centric RM in Practice
MPSoCSystem
ProgrammingModelApplication
Modeling Modeling
Abstraction Gap!…
PPE
PPU Storage Subsystem
PPU
Bus/if
Element Interconnect Bus (EIB)
DRAMMemory
DMA
SPE 0
MFC
SPU
MMU
Bus/if
Synch.
Local Storage
DMA
SPE 7
MFC
SPU
MMU
Bus/if
Synch.
Local Storage T1
T2 T3
T4 T5 T6
T7
T8
Optimization
Time
Res
ourc
es
T1 T2
T3
T4
T5 T7
Deadline
T8
Allocation+scheduling
Optimality Gap!
ImplementationExecution Gap!
5
Bridging the Abstraction Gap
Target architecture Homogeneous computation tiles:
ARM cores (including instruction and data caches);Tightly coupled software-controlled scratch-pad memories (SPM);
AMBA AHB;DMA engine;RTEMS OS;Power models for 0.13μm
Variable Voltage/Frequency cores with discrete (Vdd,f) pairs
Frequency dividers scale down system clockVoltage switching overhead is modeled
Cores use non-cacheable shared memory to communicateSemaphore and interrupt facilities are used for synchronization
Tile TileTile Tile …Sync. Sync. Sync. Sync.
PrivateMem
PrivateMem
PrivateMem
PrivateMem
SharedMem
AMBA AHB INTERCONNECT
PrivateMem..
Prog.REG
CLOCK TREEGENERATOR
SystemC
LOC
K
CLOCK NCLOCK 3
CLOCK 2CLOCK 1
INTSlave
… Int_
CLKTile TileTile Tile …
Sync. Sync. Sync. Sync.
PrivateMem
PrivateMem
PrivateMem
PrivateMem
SharedMem
AMBA AHB INTERCONNECT
PrivateMem..
Prog.REG
CLOCK TREEGENERATOR
SystemC
LOC
K
CLOCK NCLOCK 3
CLOCK 2CLOCK 1
INTSlave
… Int_
CLK
Model accuracy is essential handle abstraction with care!
6
Bus modellingBus congestion causes performance unpredictability
Unary model
Max bandwidthBus
time
t1 t2 t1 t2 t1 t2
Characterize the range of applicability of the modelVirtual platform;
Force the system to work under the additive regime.
Coarse-grain additivetime
BusMax bandwidth
t1
t2Achievable bandwidth
Millions of bus transactions: Model blowup!
Tuning of the additive model
Requesting more than 65% of the maximum bandwidth causes the additive model to fail;Benefits of working in additive regime:
Task execution time almost independent of bus utilization;Performance predictability greatly enhanced.
Task
Exe
cutio
n Ti
me
7
Task graphTask dependenciesExecution times express in clock cycles: WCN(Ti)Communication time (writes & reads) expressed as: WCN(WTiTj) and WCN(RTiTj)These values can be back-annotated from functional simulation or computed using WCET analysis tools (e.g. AbsINT)Node type
Normal; Fork, And; Branch, Or
Application model
Task1
Task2
Task3
Task4
Task5
Task6
WCN(WT1T2)WCN(RT1T2)WCN(T1)
WCN(WT1T3)WCN(RT1T3)
WCN(T2) WCN(WT2T4)WCN(RT2T4)
WCN(WT3T5)WCN(RT3T5)
WCN(WT4T6)WCN(RT4T6)
WCN(WT5T6)WCN(RT5T6)
WCN(T3)
WCN(T4)
WCN(T5)
WCN(T6)
ReadingPhase
Exec.Phase
WritingPhase
Syst
em B
us
Priv
ate
Mem
Priv
ate
Mem
ARM Core
Int controller
SPM
Semaphores
ARM Core
Int controller
Semaphores
SPM
#2#1
Task memory requirements
Communicating tasks might run:On the same processor → negligible communication costOn different processors → costly message exchange procedure
Task storage can be allocated by Optimizer:On the local SPMOn the remote Private Memory
Each task has three kinds of memory requirements
Program DataInternal StateCommunication queues
8
Syst
em B
us
Priv
ate
Mem
Priv
ate
Mem
ARM Core
Int controller
SPM
Semaphores
ARM Core
Int controller
Semaphores
SPM
Task memory requirements
Each task has three kinds of memory requirements:
Program Data;Internal State;Communication queues.
#2
#1
Communicating tasks might run:On the same processor → negligible communication costOn different processors → costly message exchange procedure
Task storage can be allocated by Optimizer:On the local SPMOn the remote Private Memory
Application Development Flow
CTGCalibration
Phase
Simulator/WCET
OptimizationPhase
Optimizer
ApplicationProfiles
Optimal SWApplication
Implementation
Alloca
tion
Sched
uling
ApplicationDevelopment
Support
PlatformExecution
9
MAX error lower than 10%AVG error equal to 4.51%, with standard deviation of 1.94All deadlines are met
Optimizer
Optimal Allocation & Schedule
Virtual Platform validation -0.05
0
0.05
0.1
0.15
0.2
0.25
-5% -4% -3% -2% -1% 0% 1% 2% 3% 4% 5% 6% 7% 8% 9% 10% 11%
250 instances
Validation of optimizer solutions Throughput
Prob
abili
ty (%
)Throughput difference (%)
MAX error lower than 10%;AVG error equal to 4.80%, with standard deviation of 1.71;
Optimizer
Optimal Allocation & Schedule
Virtual Platform validation
250 instances
Validation of optimizer solutions Power
-0.05
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
-5% -4% -3% -2% -1% 0% 1% 2% 3% 4% 5% 6% 7% 8% 9% 10% 11%
250 instances
Prob
abili
ty (%
)
Energy consumption difference (%)
10
Bridging the Optimality Gap
The Optimization Challenge
The problem of allocating, scheduling for task graphs on multi-processors in a distributed real-time system is NP-hard.
T1
T2 T3
T4 T5 T6
T7
T8
…Proc. 1 Proc. 2 Proc. N
INTERCONNECT
Private
Mem
Private
Mem
Private
Mem…
T1 T2 T3T4 T5 T6T8 T7
Time
Res
ourc
es
T1 T2
T3
T4
T5 T7
Deadline
T8
Allocation
Schedule
11
Scheduling & Voltage Scaling
deadlinet
P
τ1 τ2 τ3
Energy/speed trade-offs:varying the voltages
Vbs
CPUVdd
f1 f2 f3
Different voltages:different frequencies
Mapping and scheduling: given (fastest freq.)
Power
deadlinetτ1 τ2 τ3
SlackVoltage and Frequency scalingmake the problem even harder!
Current off-line approachessolve mapping, scheduling and voltage
selection separately (sequentially)
Optimization frameworkDeterministic & stochastic task graphsConstraints
Resources: computation, communication, storageTiming: task deadlines, makespan
Objective functionsPerformance (e.g. Makespan)Power (energy)Bus utilization
General modeling framework highly unstructured optimization problems
No black-box/generic optimizer can solve them efficientlyWe developed a flexible algorithmic frameworkwich is tuned on specific problems
12
Logic Based Benders DecompositionObj. Functioneg. system energy
In general, it depends on MP & SP variables
AllocationLB for cost
Allocation& Freq. Assign.:
INTEGER PROGRAMMING
Scheduling:CONSTRAINT PROGRAMMING
No good: unfeasible SPOptimality cut: SP solution is optimal UNLESS a betterone exists with a different allocation
Resource constraints
Timing constraint
SP Relaxation Constraint(eg. sum of ExecT on a Proc does not exceed deadline)
Subproblem (SP)
Master problem (MP)
Obj. Function:
(Minimize freq switching overheads)
Iterations stop when MP becomes unfeasible!
All+SchUB for cost
Computational scalability
Simplified CP and IP formulationsHybrid approach clearly outperforms pure CP and IP techniquesSearch time bounded to 1000 sec.
CP and IP can found a solution only in 50%- of the instancesHybrid approach always found a solution
Deterministic task graphs, mapping & scheduling
16 25 36 49 64 81 100 1 2 3 4 5 6 7
13
Computational Scalability
Hundreds of of decision variablesMuch beyond ILP solver or CP solver capability
Deterministic task graphs, mapping & scheduling & v,f selectionStochastic task graphs, mapping & scheduling & min bus usage
Does it Matter?
Throughput required: 1 frame/10ms.With 2 processors and 4 possible frequency & voltage settings:
Task Graph:10 computational tasks;15 communication tasks.
Without optimizations:50.9μJ
With optimizations:17.1 μJ - 66,4%
14
Optimality gapComparison with heuristic 2-phase solution (GA)
“timing barrier”
gap significant when constraints are tight
Bridging the Execution Gap
15
Programming environment
A software development toolkit to help programmers in software implementation:
a generic customizable application template OFFLINE SUPPORT;a set of high-level APIs ONLINE SUPPORT in RT-OS
The main goals are:predictable application execution after the optimization step;guarantees on high performance and constraint satisfaction.
Starting from a high level task and data flow graph, software developers can easily and quickly build their application infrastructure.Programmers can intuitively translate high level representation into C-code using our facilities and library
//Node Behaviour: 0 AND ; 1 OR; 2 FORK; 3 BRANCHuint node_behaviour[TASK_NUMBER] = {2,3,3,..};
#define N_CPU 2uint task_on_core[TASK_NUMBER] = {1,1,2,1};int schedule_on_core[N_CPU][TASK_NUMBER] = {{1,2,4,8}..};
uint queue_consumer [..] [..] = {{0,1,1,0,..},{0,0,0,1,1,.},{0,0,0,0,0,1,1..},{0,0,0,0,..}..};
//Node Type: 0 NORMAL; 1 BRANCH ; 2 STOCHASTICuint node_type[TASK_NUMBER] = {1,2,2,1,..};
ExampleNumber of nodes (e.g 12)Graph of activitiesNode type
Normal, Branch, Conditional, Terminator
Node behaviourOr, And, Fork, Branch
Number of CPUs : 2Task AllocationTask SchedulingArc priorities
Time
Res
ourc
es
N1 B2
B3
C4
C7
Deadline
N8
T2 T3
T4 T5 T6 T7
T8 T9 T10
T11
T12
T1N1
B2 B3
C4 C5 C6 C7
N8 N9 N10
N11
T12
fork
or
or
and
branch branch
P1P2
N11
N10
T12
a1a2
a3 a4 a5 a6
a7 a8 a9 a10
a11 a12
B3 C7 N10
T12
a13
a14
#define TASK_NUMBER 12
16
Queue ordering optimization
Communication ordering affects system performance
T1
T2T4
CPU1 CPU2
…
C3C1
T3
…C
2
Wait!
RUN!
T5 T6… …
C4 C5
Queue ordering optimization
Communication ordering affects system performance
T1
T2T4
T5 T6
CPU1 CPU2
… … …
C3C1
T3
…
C2
Wait!
RUN!
C4 C5
17
T4 re-activated
Synchronization among tasks
T1
T2 T4C2
T3
C1
C3
Proc. 1
T1
Proc. 2
T2T3 T4
T4 is suspended
Non blocking semaphores
© 2005 IBM Corporation34
Toshiba
Putting it all togetherCELLFLOW:Targeting the Cell BE
Project goal: develop a high-level programming environment with optimal allocation and scheduling for the CELL SDK
18
Cell BE Processor Architecture
Heterogeneous, distributed memory multiprocessor with explicit DMA over a ring-NoC
SPE0LS
(256KB)
DMA
SPE1LS
(256KB)
DMA
MICMemoryInterfaceController
XIO
SPE2LS
(256KB)
DMA
SPE3LS
(256KB)
DMA
SPE4LS
(256KB)
DMA
SPE5LS
(256KB)
DMA
SPE6LS
(256KB)
DMA
PPEL1 (32 KB I/D)
L2(512 KB)
Flex-IO1
Flex-IO0
I/O
I/O
I/O
Focus on SPEs1. Statically scheduled, non-preemptive2. Explicit allocation and de-allocation of local store
Task & Application Models
Task is split into 3 phases:Reading input queuesTask ExecutionWriting output queues
Tasks communicate through queuesFIFO BufferingSemaphore synchronization
No task preemption
ReadingPhase
Exec.Phase
WritingPhase
Void Main_task(int id,void* input,void* output,void state)
{//InitializationInit_task_structures(); Init_queues(); …//Task Core//Reading phaseRead_input();
….
//Task ExecutionExec();
….//Writing phase
Write_output(); }
19
uint queue_consumer [..] [..] = {{0,1,1,0,..},{0,0,0,1,1,.},{0,0,0,0,0,1,1..},{0,0,0,0,..}..};
#define N_CPU 2uint task_on_core[TASK_NUMBER] = {1,1,2,1};int schedule_on_core[N_CPU][TASK_NUMBER] = {{1,2,4,8}..};
Framework
CPU CPU
INTERCONNECT
MemMem
T1 T2 T3T4 T5 T6T8 T7
RunTimeSupport
RunTimeSupport
Number of nodes : 12Graph of activitiesNumber of CPU : 2AllocationScheduling
#define TASK_NUMBER 12Eclipse Plugin - Available resource management engines: • optimum static schedulers based
on integrated IP and CP solvers (Benders decomposition) • fast & suboptimal list-based scheduler for dynamic scheduling
P1P2
• Allocation;• Scheduling;• Communication;• Synchronization.
Run Time SupportRun Time Support
ResourcesResourcesOptimizerOptimizer
Comm.Comm.APIAPI
Synch.Synch.APIAPI
MappingMappingAPIAPI
GUIGUI
Appl.Appl.TemplateTemplate
On-line Support
Automatic Application Configuration:Mailbox systemTask Generation and ConfigurationBuffer allocations
Main Memory
Root Segment 0Scheduler
Segment 1
Local Storage
SPE … SPESPEPPE
Allocate queue
Queue info
Task 0 infoTask 1 info
Task N info
…
…
…
…
Allocation
info
Physicaladdresses
No-OS on SPEs:Local scheduling support
Local Memory Limitation:Code overlays
Task 0 Task 1 Task 2
20
Run Time SupportRun Time Support
ResourcesResourcesOptimizerOptimizer
Comm.Comm.APIAPI
Synch.Synch.APIAPI
MappingMappingAPIAPI
GUIGUI
Appl.Appl.TemplateTemplate
Multi-stage Benders Decomposition
When the SCHED problem is solved, one or more cuts (A) are generated to forbid the current memory device allocation and the process is restarted from the MEM stage;
if the scheduling problem is feasible, an upper bound on the value of the next solution is also posted.
When the MEM & SCHED sub-problem ends, more cuts (B) are generated to forbid the current task-to-SPE assignment.
When the SPE stage becomes infeasible the process is over converging to the optimal solution for the problem overall.
SPE Alloc
MEM Alloc
SCHEDA
B
Goal: minimize makespan
21
SPE Allocation
1,...,0)(
1,...,0;1,...,0}1,0{
1,...,01
1,...,0
..min
1
0
1
0
1
0
−=∀≤
−=∀−=∀∈
−=∀=
−=∀≥
∑
∑
∑
−
=
−
=
−
=
pjdlineTiDMIN
pjniT
niT
pjTz
tsz
n
iij
ij
p
jij
n
iij
Given a graph with n tasks, m arcs and a platform with p processing Elements
Each task can be assigned to a single PE;The makespan objective function depends only on scheduling decision variables.
We adopt an heuristic objective function:to spread tasks as much as possible on different SPEs, which often provides good
makespan values pretty quickly.It forces the objective variable z to be
greater than the total number of tasks allocated on any PE.DMIN(i) is the minimum possible duration of task i and dline is a deadline
Constraints on the total duration of tasks on a single SPE.To discard trivially infeasible solutions.
Needed to express the objective function
Schedulability test
SPE allocation choices are by themselves very relevant:a bad SPE assignment is sufficient to make the scheduling problem unfeasible.
if the given allocation with minimal task durations is already infeasible for the scheduling component, then it is useless to complete it with the memory assignment that cannot lead to any feasible solution overall.
SPE Alloc
MEM Alloc
SCHED
SCHED Test
22
Memory device allocation
∑∑
∑∑ ∑
−=−=
≠=
==−=
−
=
≤−+−+−+
=∀−=∀
++=
==≠≤+
−=∀∈−=∀∈−=∀∈
),(),(
)()()(
),()(
),(
1
0
)()1()()1()()1()(_:)(..;1,...,0
)()()()(_
)()()()(1
1,...,0}1,0{1,...,0}1,0{1,...,0}1,0{
hrhr
khrhr
tajr
tair
kpehpejkpetta
r
jhpeta
p
ir
rr
rr
r
r
i
CrcommWimemMrcommRjusagebasejipetsipj
WrcommMiimemRrcommjusagebase
kpehpeifWRkpehpeifWR
mrRmrWniM Mi = 1 if task i allocates its computation data on the
local memory of the SPE it is assigned to
Wr = 1 if the communication buffer is on SPE pe(h) (that of the producer),Rr = 1 if the buffer is on SPE pe(k) (that of the consumer).
Pr CnWr=1;Rr=0;
Pr CnWr=0;Rr=1;
Pr CnWr=0;Rr=0;
Pr CnWr=1;Rr=1;
Pr CnWr=0;Rr=0;
∑∑
∑∑ ∑
−=−=
≠=
==−=
−
=
≤−+−+−+
=∀−=∀
++=
==≠≤+
−=∀∈−=∀∈−=∀∈
),(),(
)()()(
),()(
),(
1
0
)()1()()1()()1()(_:)(..;1,...,0
)()()()(_
)()()()(1
1,...,0}1,0{1,...,0}1,0{1,...,0}1,0{
hrhr
khrhr
tajr
tair
kpehpejkpetta
r
jhpeta
p
ir
rr
rr
r
r
i
CrcommWimemMrcommRjusagebasejipetsipj
WrcommMiimemRrcommjusagebase
kpehpeifWRkpehpeifWR
mrRmrWniM
mem(i) is the amount of memory required to store internal data of task I; comm(r) is the size of the communication buffer associated to arc r. The base_usage(j) expression is the amount of memory needed to store all data permanently allocated on the local device of processor j.
Memory device allocation
23
Scheduling subproblem
)()(1,...,0
)()(2,...,)()()()(
)()(2,...,0
1
1
1
rr
rlrl
rhi
irh
rlrl
rdstartwrendmr
wrstartwrendkhlwrstartexecendexecstartrdend
rdstartrdendhl
≤−=∀
=−=∀==
=−=∀
+
−
+
Rd Rd Rd Wr WrExec
Rd Wr WrExec
Each communication buffer must be written before it can be read.
• One activity for each:• execution phase (exec)• buffer reading/writing operation (rd,wr).
• Task are not preemptive;
Exact vs. Heuristic SchedulerHeuristic: RR resource allocation + List schedulingUp to 40% makespan difference15% in average
Example: exact vs. heuristic Gantt charts
Only a combination of Allocation+Scheduling achieves optimality
24
TD vs pure-CP
Set of instances where task durations depend on allocation decisions
Set of instances where task durations do not depend on allocation decisions
TD vs BD
Up to the 20 − 21 group, TD is much more efficient than BD. Starting from group 22−23, the high number of timed out instances biases the average execution time. TD is doing considerably better until group 24 − 25.
After that, most instances are not solved within the time limit by any of the approachesTD has a lower execution time, despite it generally performs more iterations than BD:
TD works by solving many easy sub-problemsBD performs fewer and slower iterations.
25
Cellflow: Benchmark Results
0
1
2
3
4
5
6
7
0 1 2 3 4 5 6 7
# SPEsSp
eed
up
Theoretical LimitMat-multSW RadioFFT
Mat-mult SW Radio FFT
• The Mat-mult benchmark scales almost perfectly:• efficiency of the runtime environment • almost negligible overhead.
• Good speed-ups also for the FFT ;• The software radio benchmark shows good speedup until only three SPEs:
• A critical path limits the performance boost • Functional pipelining can help
On 1 SPE
Real-life applications
Average time (μs) and speedups Single Block computation time (μs)
FFT FDT iFFTDisp Coll
On N SPEs
…..FFT FDT iFFT
Disp Coll
FFT FDT iFFT
FFT FDT iFFT
1 SPU 2 SPU 4 SPU 6 SPU0
200
400
600
800
1000
1200
1400
1600
1800
2000
DMA_PUTIFFTCVE-CVMLFFTDMA_GETattesa
1 SPU 2 SPU 4 SPU 6 SPU0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
RT Radar Processing
26
Ongoing & Future WorkExtensions
Scheduling parallel DMA activityRobust schedules with non WCET executionFull dataflow (FIFO buffers) support
Performance tuning of the middlewareInteraction with high level tools:
Parallelization toolsData distribution toolsModel-based environments
Dynamic resource managementHybrid approaches (offline + online)
Integration with advanced architecturesPredictable (QoS) communication protocolsMultihop (NoC) interconnects
Computational Efficiency:distribution of the allocation/scheduling time ratio for the solvers
TD Solved TD Unsolved
BD Solved BD Unsolved
The solution time for instances not solved within the limit appears to be strongly unbalanced with most of the time absorbed by the scheduling component.
For the BD solver where substantially all the process time is spent in solving allocation subproblems.
Recommended