Optimization-centric embedded systems design: addressing...

Optimization-centric embeddedsystems design: addressingthe multi-core challenge

L. Benini, DEIS Università di Bologna,

Luca.benini@unibo.it

Year of IntroductionYear of Introduction20052005 20072007 20092009 20112011 20132013 20152015

5GOPS/W5GOPS/W

100 GOPS/W100 GOPS/W

SignSignrecognitionrecognition

A/VA/Vstreamingstreaming

AdaptiveAdaptiverouteroute

CollisionCollisionavoidanceavoidance

AutonomousAutonomousdrivingdriving

3D projected3D projecteddisplaydisplay

HMI by motionHMI by motionGesture detectionGesture detection

UbiquitousUbiquitousnavigationnavigation

SiSi XrayXray

GbitGbit radioradio

UWBUWB

802.11n802.11n

Structured Structured encodingencoding

Structured Structured decodingdecoding

3D TV3D TV 3D gaming3D gaming

H264H264encodingencoding

H264H264decodingdecoding

ImageImagerecognitionrecognition

Full recognitionFull recognition(security)(security)

AutoAutopersonalizationpersonalization

dictationdictation

3D ambient3D ambientinteractioninteraction

LanguageLanguageEmotionEmotionrecognitionrecognition

GestureGesturerecognitionrecognition

ExpressionExpressionrecognitionrecognition

MobileMobileBaseBase--bandband

1 TOPS/W1 TOPS/W

Philips/IMECPhilips/IMEC

Embedded Applications Pull

Moore’s Law Revisited

2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 20200

Number of Processing Engines(Right Axis)

Total Logic Size(Normalized to 2005, Left Axis)

Total Memory Size(Normalized to 2005, Left Axis)

16 23 32 46 63 79101

133 161212

[Martin06]

AmarasingheAmarasinghe 0606

Rapport Rapport KilocoreKilocore

The Multicore Revolution

“Traditional” Bus-based SoCs fit in one tile !!Extreme NRE cost: complexity, unwieldy technologyHigh risk, large investment decreasing design starts

Flexibility is key programmability, configurabilityResource management (computation, communication storage) is essential to achieve efficiency

Allocation and scheduling of NUMEROUS, HETEROGENOUS resourcesMarket success hinges on optimal exploitation of available resources

Heterogeneous Multicore

PE PE PE

SRAM SRAM

I/OPERIPHERALS

3D stacked m

ain mem

LocalMemory

hierarchy

Applications Software opt. Middleware, RTOS, API,Run-Time Controller

MappingVDD,VT,fclk

Resource Management IS OptimizationSearch spaceThe set of “all” possible allocations and

schedules

ConstraintsSolutions that we are not willing to accept

Cost functionA property we are interested in (execution

time, power, reliability…)

Many approaches to optimal RM: off-line vs. onlinecomplete (exact) vs. heuristic (approximate)

When & Why Complete, OfflineOptimization?

Plenty of design-time knowledgeApplications pre-characterized at design timeDynamic transitions between different pre-characterized scenarios

Aggressive exploitation of system resourcesReduces overdesign (lowers cost)Strong performance guarantees

Applicable for many embedded applications

Optimization-centric RM in Practice

MPSoCSystem

ProgrammingModelApplication

Modeling Modeling

Abstraction Gap!…

PPU Storage Subsystem

Bus/if

Element Interconnect Bus (EIB)

DRAMMemory

Bus/if

Synch.

Local Storage

Bus/if

Synch.

Local Storage T1

T4 T5 T6

Optimization

Deadline

Allocation+scheduling

Optimality Gap!

ImplementationExecution Gap!

Bridging the Abstraction Gap

Target architecture Homogeneous computation tiles:

ARM cores (including instruction and data caches);Tightly coupled software-controlled scratch-pad memories (SPM);

AMBA AHB;DMA engine;RTEMS OS;Power models for 0.13μm

Variable Voltage/Frequency cores with discrete (Vdd,f) pairs

Frequency dividers scale down system clockVoltage switching overhead is modeled

Cores use non-cacheable shared memory to communicateSemaphore and interrupt facilities are used for synchronization

Tile TileTile Tile …Sync. Sync. Sync. Sync.

PrivateMem

SharedMem

AMBA AHB INTERCONNECT

PrivateMem..

Prog.REG

CLOCK TREEGENERATOR

SystemC

CLOCK NCLOCK 3

CLOCK 2CLOCK 1

INTSlave

… Int_

CLKTile TileTile Tile …

Sync. Sync. Sync. Sync.

PrivateMem

SharedMem

AMBA AHB INTERCONNECT

PrivateMem..

Prog.REG

CLOCK TREEGENERATOR

SystemC

CLOCK NCLOCK 3

CLOCK 2CLOCK 1

INTSlave

… Int_

Model accuracy is essential handle abstraction with care!

Bus modellingBus congestion causes performance unpredictability

Unary model

Max bandwidthBus

t1 t2 t1 t2 t1 t2

Characterize the range of applicability of the modelVirtual platform;

Force the system to work under the additive regime.

Coarse-grain additivetime

BusMax bandwidth

t2Achievable bandwidth

Millions of bus transactions: Model blowup!

Tuning of the additive model

Requesting more than 65% of the maximum bandwidth causes the additive model to fail;Benefits of working in additive regime:

Task execution time almost independent of bus utilization;Performance predictability greatly enhanced.

Task graphTask dependenciesExecution times express in clock cycles: WCN(Ti)Communication time (writes & reads) expressed as: WCN(WTiTj) and WCN(RTiTj)These values can be back-annotated from functional simulation or computed using WCET analysis tools (e.g. AbsINT)Node type

Normal; Fork, And; Branch, Or

Application model

WCN(WT1T2)WCN(RT1T2)WCN(T1)

WCN(WT1T3)WCN(RT1T3)

WCN(T2) WCN(WT2T4)WCN(RT2T4)

WCN(T3)

WCN(T4)

WCN(T5)

WCN(T6)

ReadingPhase

Exec.Phase

WritingPhase

ARM Core

Int controller

Semaphores

ARM Core

Int controller

Semaphores

Task memory requirements

Communicating tasks might run:On the same processor → negligible communication costOn different processors → costly message exchange procedure

Task storage can be allocated by Optimizer:On the local SPMOn the remote Private Memory

Each task has three kinds of memory requirements

Program DataInternal StateCommunication queues

ARM Core

Int controller

Semaphores

ARM Core

Int controller

Semaphores

Task memory requirements

Each task has three kinds of memory requirements:

Program Data;Internal State;Communication queues.

Communicating tasks might run:On the same processor → negligible communication costOn different processors → costly message exchange procedure

Task storage can be allocated by Optimizer:On the local SPMOn the remote Private Memory

Application Development Flow

CTGCalibration

Simulator/WCET

OptimizationPhase

Optimizer

ApplicationProfiles

Optimal SWApplication

Implementation

Alloca

ApplicationDevelopment

Support

PlatformExecution

MAX error lower than 10%AVG error equal to 4.51%, with standard deviation of 1.94All deadlines are met

Optimizer

Optimal Allocation & Schedule

Virtual Platform validation -0.05

-5% -4% -3% -2% -1% 0% 1% 2% 3% 4% 5% 6% 7% 8% 9% 10% 11%

250 instances

Validation of optimizer solutions Throughput

)Throughput difference (%)

MAX error lower than 10%;AVG error equal to 4.80%, with standard deviation of 1.71;

Optimizer

Optimal Allocation & Schedule

Virtual Platform validation

250 instances

Validation of optimizer solutions Power

-5% -4% -3% -2% -1% 0% 1% 2% 3% 4% 5% 6% 7% 8% 9% 10% 11%

250 instances

Energy consumption difference (%)

Bridging the Optimality Gap

The Optimization Challenge

The problem of allocating, scheduling for task graphs on multi-processors in a distributed real-time system is NP-hard.

T4 T5 T6

…Proc. 1 Proc. 2 Proc. N

INTERCONNECT

Private

Mem…

T1 T2 T3T4 T5 T6T8 T7

Deadline

Allocation

Schedule

Scheduling & Voltage Scaling

deadlinet

τ1 τ2 τ3

Energy/speed trade-offs:varying the voltages

CPUVdd

f1 f2 f3

Different voltages:different frequencies

Mapping and scheduling: given (fastest freq.)

deadlinetτ1 τ2 τ3

SlackVoltage and Frequency scalingmake the problem even harder!

Current off-line approachessolve mapping, scheduling and voltage

selection separately (sequentially)

Optimization frameworkDeterministic & stochastic task graphsConstraints

Resources: computation, communication, storageTiming: task deadlines, makespan

Objective functionsPerformance (e.g. Makespan)Power (energy)Bus utilization

General modeling framework highly unstructured optimization problems

No black-box/generic optimizer can solve them efficientlyWe developed a flexible algorithmic frameworkwich is tuned on specific problems

Logic Based Benders DecompositionObj. Functioneg. system energy

In general, it depends on MP & SP variables

AllocationLB for cost

Allocation& Freq. Assign.:

INTEGER PROGRAMMING

Scheduling:CONSTRAINT PROGRAMMING

No good: unfeasible SPOptimality cut: SP solution is optimal UNLESS a betterone exists with a different allocation

Resource constraints

Timing constraint

SP Relaxation Constraint(eg. sum of ExecT on a Proc does not exceed deadline)

Subproblem (SP)

Master problem (MP)

Obj. Function:

(Minimize freq switching overheads)

Iterations stop when MP becomes unfeasible!

All+SchUB for cost

Computational scalability

Simplified CP and IP formulationsHybrid approach clearly outperforms pure CP and IP techniquesSearch time bounded to 1000 sec.

CP and IP can found a solution only in 50%- of the instancesHybrid approach always found a solution

Deterministic task graphs, mapping & scheduling

16 25 36 49 64 81 100 1 2 3 4 5 6 7

Computational Scalability

Hundreds of of decision variablesMuch beyond ILP solver or CP solver capability

Deterministic task graphs, mapping & scheduling & v,f selectionStochastic task graphs, mapping & scheduling & min bus usage

Does it Matter?

Throughput required: 1 frame/10ms.With 2 processors and 4 possible frequency & voltage settings:

Task Graph:10 computational tasks;15 communication tasks.

Without optimizations:50.9μJ

With optimizations:17.1 μJ - 66,4%

Optimality gapComparison with heuristic 2-phase solution (GA)

“timing barrier”

gap significant when constraints are tight

Bridging the Execution Gap

Programming environment

A software development toolkit to help programmers in software implementation:

a generic customizable application template OFFLINE SUPPORT;a set of high-level APIs ONLINE SUPPORT in RT-OS

The main goals are:predictable application execution after the optimization step;guarantees on high performance and constraint satisfaction.

Starting from a high level task and data flow graph, software developers can easily and quickly build their application infrastructure.Programmers can intuitively translate high level representation into C-code using our facilities and library

//Node Behaviour: 0 AND ; 1 OR; 2 FORK; 3 BRANCHuint node_behaviour[TASK_NUMBER] = {2,3,3,..};

#define N_CPU 2uint task_on_core[TASK_NUMBER] = {1,1,2,1};int schedule_on_core[N_CPU][TASK_NUMBER] = {{1,2,4,8}..};

uint queue_consumer [..] [..] = {{0,1,1,0,..},{0,0,0,1,1,.},{0,0,0,0,0,1,1..},{0,0,0,0,..}..};

//Node Type: 0 NORMAL; 1 BRANCH ; 2 STOCHASTICuint node_type[TASK_NUMBER] = {1,2,2,1,..};

ExampleNumber of nodes (e.g 12)Graph of activitiesNode type

Normal, Branch, Conditional, Terminator

Node behaviourOr, And, Fork, Branch

Number of CPUs : 2Task AllocationTask SchedulingArc priorities

Deadline

T4 T5 T6 T7

T8 T9 T10

C4 C5 C6 C7

N8 N9 N10

branch branch

a3 a4 a5 a6

a7 a8 a9 a10

a11 a12

B3 C7 N10

#define TASK_NUMBER 12

Queue ordering optimization

Communication ordering affects system performance

CPU1 CPU2

T5 T6… …

Queue ordering optimization

Communication ordering affects system performance

CPU1 CPU2

… … …

T4 re-activated

Synchronization among tasks

T2 T4C2

Proc. 1

Proc. 2

T2T3 T4

T4 is suspended

Non blocking semaphores

Toshiba

Putting it all togetherCELLFLOW:Targeting the Cell BE

Project goal: develop a high-level programming environment with optimal allocation and scheduling for the CELL SDK

Cell BE Processor Architecture

Heterogeneous, distributed memory multiprocessor with explicit DMA over a ring-NoC

SPE0LS

(256KB)

SPE1LS

(256KB)

MICMemoryInterfaceController

SPE2LS

(256KB)

SPE3LS

(256KB)

SPE4LS

(256KB)

SPE5LS

(256KB)

SPE6LS

(256KB)

PPEL1 (32 KB I/D)

L2(512 KB)

Flex-IO1

Flex-IO0

Focus on SPEs1. Statically scheduled, non-preemptive2. Explicit allocation and de-allocation of local store

Task & Application Models

Task is split into 3 phases:Reading input queuesTask ExecutionWriting output queues

Tasks communicate through queuesFIFO BufferingSemaphore synchronization

No task preemption

ReadingPhase

Exec.Phase

WritingPhase

Void Main_task(int id,void* input,void* output,void state)

{//InitializationInit_task_structures(); Init_queues(); …//Task Core//Reading phaseRead_input();

//Task ExecutionExec();

….//Writing phase

Write_output(); }

uint queue_consumer [..] [..] = {{0,1,1,0,..},{0,0,0,1,1,.},{0,0,0,0,0,1,1..},{0,0,0,0,..}..};

#define N_CPU 2uint task_on_core[TASK_NUMBER] = {1,1,2,1};int schedule_on_core[N_CPU][TASK_NUMBER] = {{1,2,4,8}..};

Framework

CPU CPU

INTERCONNECT

MemMem

T1 T2 T3T4 T5 T6T8 T7

RunTimeSupport

Number of nodes : 12Graph of activitiesNumber of CPU : 2AllocationScheduling

#define TASK_NUMBER 12Eclipse Plugin - Available resource management engines: • optimum static schedulers based

on integrated IP and CP solvers (Benders decomposition) • fast & suboptimal list-based scheduler for dynamic scheduling

• Allocation;• Scheduling;• Communication;• Synchronization.

Run Time SupportRun Time Support

ResourcesResourcesOptimizerOptimizer

Comm.Comm.APIAPI

Synch.Synch.APIAPI

MappingMappingAPIAPI

GUIGUI

Appl.Appl.TemplateTemplate

On-line Support

Automatic Application Configuration:Mailbox systemTask Generation and ConfigurationBuffer allocations

Main Memory

Root Segment 0Scheduler

Segment 1

Local Storage

SPE … SPESPEPPE

Allocate queue

Queue info

Task 0 infoTask 1 info

Task N info

Allocation

Physicaladdresses

No-OS on SPEs:Local scheduling support

Local Memory Limitation:Code overlays

Task 0 Task 1 Task 2

Run Time SupportRun Time Support

ResourcesResourcesOptimizerOptimizer

Comm.Comm.APIAPI

Synch.Synch.APIAPI

MappingMappingAPIAPI

GUIGUI

Appl.Appl.TemplateTemplate

Multi-stage Benders Decomposition

When the SCHED problem is solved, one or more cuts (A) are generated to forbid the current memory device allocation and the process is restarted from the MEM stage;

if the scheduling problem is feasible, an upper bound on the value of the next solution is also posted.

When the MEM & SCHED sub-problem ends, more cuts (B) are generated to forbid the current task-to-SPE assignment.

When the SPE stage becomes infeasible the process is over converging to the optimal solution for the problem overall.

SPE Alloc

MEM Alloc

SCHEDA

Goal: minimize makespan

SPE Allocation

1,...,0)(

1,...,0;1,...,0}1,0{

1,...,01

1,...,0

−=∀≤

−=∀−=∀∈

−=∀=

−=∀≥

pjdlineTiDMIN

Given a graph with n tasks, m arcs and a platform with p processing Elements

Each task can be assigned to a single PE;The makespan objective function depends only on scheduling decision variables.

We adopt an heuristic objective function:to spread tasks as much as possible on different SPEs, which often provides good

makespan values pretty quickly.It forces the objective variable z to be

greater than the total number of tasks allocated on any PE.DMIN(i) is the minimum possible duration of task i and dline is a deadline

Constraints on the total duration of tasks on a single SPE.To discard trivially infeasible solutions.

Needed to express the objective function

Schedulability test

SPE allocation choices are by themselves very relevant:a bad SPE assignment is sufficient to make the scheduling problem unfeasible.

if the given allocation with minimal task durations is already infeasible for the scheduling component, then it is useless to complete it with the memory assignment that cannot lead to any feasible solution overall.

SPE Alloc

MEM Alloc

SCHED Test

Memory device allocation

∑∑

∑∑ ∑

−=−=

==−=

≤−+−+−+

=∀−=∀

==≠≤+

−=∀∈−=∀∈−=∀∈

),(),(

)()()(

)()1()()1()()1()(_:)(..;1,...,0

)()()()(_

)()()()(1

1,...,0}1,0{1,...,0}1,0{1,...,0}1,0{

kpehpejkpetta

jhpeta

CrcommWimemMrcommRjusagebasejipetsipj

WrcommMiimemRrcommjusagebase

kpehpeifWRkpehpeifWR

mrRmrWniM Mi = 1 if task i allocates its computation data on the

local memory of the SPE it is assigned to

Wr = 1 if the communication buffer is on SPE pe(h) (that of the producer),Rr = 1 if the buffer is on SPE pe(k) (that of the consumer).

Pr CnWr=1;Rr=0;

Pr CnWr=0;Rr=1;

Pr CnWr=0;Rr=0;

Pr CnWr=1;Rr=1;

Pr CnWr=0;Rr=0;

∑∑

∑∑ ∑

−=−=

==−=

≤−+−+−+

=∀−=∀

==≠≤+

−=∀∈−=∀∈−=∀∈

),(),(

)()()(

)()1()()1()()1()(_:)(..;1,...,0

)()()()(_

)()()()(1

1,...,0}1,0{1,...,0}1,0{1,...,0}1,0{

kpehpejkpetta

jhpeta

CrcommWimemMrcommRjusagebasejipetsipj

WrcommMiimemRrcommjusagebase

kpehpeifWRkpehpeifWR

mrRmrWniM

mem(i) is the amount of memory required to store internal data of task I; comm(r) is the size of the communication buffer associated to arc r. The base_usage(j) expression is the amount of memory needed to store all data permanently allocated on the local device of processor j.

Memory device allocation

Scheduling subproblem

)()(1,...,0

)()(2,...,)()()()(

)()(2,...,0

rdstartwrendmr

wrstartwrendkhlwrstartexecendexecstartrdend

rdstartrdendhl

≤−=∀

=−=∀==

=−=∀

Rd Rd Rd Wr WrExec

Rd Wr WrExec

Each communication buffer must be written before it can be read.

• One activity for each:• execution phase (exec)• buffer reading/writing operation (rd,wr).

• Task are not preemptive;

Exact vs. Heuristic SchedulerHeuristic: RR resource allocation + List schedulingUp to 40% makespan difference15% in average

Example: exact vs. heuristic Gantt charts

Only a combination of Allocation+Scheduling achieves optimality

TD vs pure-CP

Set of instances where task durations depend on allocation decisions

Set of instances where task durations do not depend on allocation decisions

TD vs BD

Up to the 20 − 21 group, TD is much more efficient than BD. Starting from group 22−23, the high number of timed out instances biases the average execution time. TD is doing considerably better until group 24 − 25.

After that, most instances are not solved within the time limit by any of the approachesTD has a lower execution time, despite it generally performs more iterations than BD:

TD works by solving many easy sub-problemsBD performs fewer and slower iterations.

Cellflow: Benchmark Results

0 1 2 3 4 5 6 7

# SPEsSp

Theoretical LimitMat-multSW RadioFFT

Mat-mult SW Radio FFT

• The Mat-mult benchmark scales almost perfectly:• efficiency of the runtime environment • almost negligible overhead.

• Good speed-ups also for the FFT ;• The software radio benchmark shows good speedup until only three SPEs:

• A critical path limits the performance boost • Functional pipelining can help

On 1 SPE

Real-life applications

Average time (μs) and speedups Single Block computation time (μs)

FFT FDT iFFTDisp Coll

On N SPEs

…..FFT FDT iFFT

Disp Coll

FFT FDT iFFT

1 SPU 2 SPU 4 SPU 6 SPU0

DMA_PUTIFFTCVE-CVMLFFTDMA_GETattesa

1 SPU 2 SPU 4 SPU 6 SPU0

RT Radar Processing

Ongoing & Future WorkExtensions

Scheduling parallel DMA activityRobust schedules with non WCET executionFull dataflow (FIFO buffers) support

Performance tuning of the middlewareInteraction with high level tools:

Parallelization toolsData distribution toolsModel-based environments

Dynamic resource managementHybrid approaches (offline + online)

Integration with advanced architecturesPredictable (QoS) communication protocolsMultihop (NoC) interconnects

Computational Efficiency:distribution of the allocation/scheduling time ratio for the solvers

TD Solved TD Unsolved

BD Solved BD Unsolved

The solution time for instances not solved within the limit appears to be strongly unbalanced with most of the time absorbed by the scheduling component.

For the BD solver where substantially all the process time is spent in solving allocation subproblems.

Optimization-centric embedded systems design: addressing...

Documents

Wireless Embedded Systems (0120442x) Data-Centric and Content-Based Networking

Cache Optimization for Embedded Processor Cores: An ...ziyang.eecs.umich.edu/~dickrp/ides/papers/ghosh-embedded-cache.pdf · Cache Optimization for Embedded Processor Cores: An Analytical

Harnessing Subscriber-Centric Optimization for the · PDF fileHarnessing Subscriber-Centric Optimization ... and the maturing of 3G meant that operators had to grapple ... characterized

LISARM: embedded ARM platform design and optimization

Design Optimization of Printed Circuit Board Embedded ...orbit.dtu.dk/files/59266910/Design Optimization of Printed Circuit... · Board Embedded Inductors through Genetic Algorithms

Embedded TechCon Practical Techniques for Embedded System Optimization Processes Rob Oshana robert.Oshana@freescale.com

A Power-Centric Timing Optimization Flow - Synopsys · A Power-Centric Timing Optimization Flow for a Quad-Core ARM Cortex-A7 Processor Dale Lomelino Staff Applications Consultant

Media Mix Optimization - The Starting Point for Customer-Centric Communications

Memory optimization techniques for embedded systems

Linear programming embedded particle swarm optimization ...amcath.ccadet.unam.mx/jart/vol7_1/linear_7.pdf · Linear programming embedded particle swarm optimization for ... operaciones,

CVXGEN: a code generator for embedded convex optimization

Jaimin-Boot Time Optimization in Embedded Linux.pdf

POWER OPTIMIZATION FOR EMBEDDED SYSTEM … Optimization...POWER OPTIMIZATION FOR EMBEDDED SYSTEM IDLE TIME IN THE PRESENCE OF PERIODIC INTERRUPT SERVICES Gang Zeng, Hiroyuki Tomiyama,

DEVICE-CENTRIC LOW-POWER SCHEDULING FOR REAL-TIME EMBEDDED SYSTEMS

Programming-Model Centric Debugging for multicore embedded … · HAL Id: tel-01010061

MODESTO: Data-centric Analytic Optimization of Complex Stencil … · 2018. 2. 24. · MODESTO: Data-centric Analytic Optimization of Complex Stencil Programs on Heterogeneous Architectures

View-Centric Performance ... - University of Chicagopeople.cs.uchicago.edu/~shanlu/paper/panorama.pdf · View-Centric Performance Optimization for Database-Backed Web Applications

CAD-centric multidisciplinary optimization for industrial

Embedded C - Optimization techniques

Embedded Optimization for Resource Constrained Platforms ...grk1855.tu-dortmund.de/cms/Medienpool/forschungsumgebung/...Embedded Optimization for Resource Constrained Platforms Collaborative