Time-Predictability on Multi-/Many-Core Platformswopu/other/cesec_slides.pdf · Time-Predictability on Multi-/Many-Core Platforms Wolfgang Pufﬁtsch Department of Applied Mathematics

Time-Predictability onMulti-/Many-Core Platforms

Wolfgang Puffitsch

Department of Applied Mathematics and Computer ScienceTechnical University of Denmark

CESEC Summer School, ToulouseJune 30, 2015

partially based on slides by Jens Sparsø

Wolfgang Puffitsch, DTU 1/50

Overview

p1 Multi-/Many-Core Platforms

p2 Networks-on-Chip

p3 Predictability on General-Purpose Platforms

p4 Time-Predictable Computer Architectures

p5 Conclusion


Overview


p2 Networks-on-Chip



p5 Conclusion


Multi-/Many-Core Platforms

I Everyone wants higher performanceI More functionalityI Fewer computing nodes (costs, weight, . . . )

I Moore’s law gives us more transistors

I Many transistors needed for smallsingle-core performance increase

I More cores promise better trade-offI Higher raw performanceI Higher performance per Watt


Converging ArchitecturesI General-purpose computing

I Chip Multi-Processors (CMPs)I Speed is convenience (best effort)I Identical processors (homogeneous)

I Embedded systemsI Multi-Processor System-on-Chip (MPSoC)I Application-specific systemsI Often heterogeneous (accelerators, DSPs, . . . )I Real-time requirements (soft/hard)

I Converging architecturesI “Cores” communicating via a packet-switched

Network-on-Chip (NoC)


How Many are “Many” Cores?

I No hard line between a “multi”-core and a“many”-core

I 8? 16? 32?

I Burns’ classification (scheduling)

I OneI A few (homogeneous)I Lots (and heterogeneous)I Too many

I Common characteristic of many-coreplatforms: NoC




I 8? 16? 32?

I Burns’ classification (scheduling)

I OneI A few (homogeneous)I Lots (and heterogeneous)I Too many





I 8? 16? 32?

I Burns’ classification (scheduling)I One

I A few (homogeneous)I Lots (and heterogeneous)I Too many





I 8? 16? 32?

I Burns’ classification (scheduling)I OneI A few (homogeneous)

I Lots (and heterogeneous)I Too many





I 8? 16? 32?

I Burns’ classification (scheduling)I OneI A few (homogeneous)I Lots (and heterogeneous)

I Too many





I 8? 16? 32?

I Burns’ classification (scheduling)I OneI A few (homogeneous)I Lots (and heterogeneous)I Too many





I 8? 16? 32?

I Burns’ classification (scheduling)I OneI A few (homogeneous)I Lots (and heterogeneous)I Too many



Example: Tilera TilePro64

www.tilera.com, 2008


www.tilera.com

Example: Intel Xeon PhiMany-core x86 architecture, 2012


Example: Kalray MPPA

www.kalray.eu, 2012


www.kalray.eu

Example MPSoC I

Clermidy, F. and Bernard, C. and Lemaire, R. and Martin, J. andMiro-Panades, I. and Thonnart, Y. and Vivet, P. and Wehn, N.A 477mW NoC-based digital baseband for MIMO 4G SDRProc. IEEE International Solid-State Circuits Conferencepp. 278–279, 2012


Example MPSoC II

M. Balsby, Oticon A/SKeynote at NOCS ’12


Overview


p2 Networks-on-Chip



p5 Conclusion


Bus-Based Multiprocessors

Processor Memory

Sensor Radio

Bus


Scaling to Large Systems?

Processor Memory

Sensor Radio

Bus



Processor Memory

Sensor Radio

More Memory

Bus



Processor Memory

Sensor Radio

More Memory

USB Port

Bus



Processor Memory

Sensor Radio

More Memory

USB Port

Graphics Proc

Bus



Processor Memory

Sensor Radio

More Memory

USB Port

Graphics Proc

DSP

Bus



Processor Memory

Sensor Radio

More Memory

USB Port

Graphics Proc

DSP

2nd Processor

Bus



Processor Memory

Sensor Radio

More Memory

USB Port

Graphics Proc

DSP

2nd Processor

. . .

Bus



Processor Memory

Sensor Radio

More Memory

USB Port

Graphics Proc

DSP

2nd Processor

. . .. . .

. . .

Bus



Processor Memory

Sensor Radio

More Memory

USB Port

Graphics Proc

DSP

2nd Processor

. . .. . .

. . .

. . .

. . .

Bus



Processor Memory

Sensor Radio

More Memory

USB Port

Graphics Proc

DSP

2nd Processor

. . .. . .

. . .

. . .

. . .

. . .

. . .

Bus



Processor Memory

Sensor Radio

More Memory

USB Port

Graphics Proc

DSP

2nd Processor

. . .. . .

. . .

. . .

. . .

. . .

. . .

Bus

Hmmm. . .


From Bus to NoC

IP DSP CPU

Mem I/O IP

Bus


From Bus to NoC

IP DSP CPU

Mem I/O IP

Bus

– Bus interface– Read/write

transactions


From Bus to NoC

IP DSP CPU

Mem I/O IP

Bus


transactions


From Bus to NoC

IP DSP CPU

Mem I/O IP


transactions


From Bus to NoC

IP DSP CPU

Mem I/O IP


transactionsNA

NA

NA

NA

NA

NA

R

R

R

R

R

R

IP IP block

NA network adapter

R router

link


From Bus to NoC

IP DSP CPU

Mem I/O IP

– Packet-switchednetworkNA

NA

NA

NA

NA

NA

R

R

R

R

R

R

IP IP block

NA network adapter

R router

link


From Bus to NoC

IP DSP CPU

Mem I/O IP

NA

NA

NA

NA

NA

NA

R

R

R

R

R

R

IP IP block

NA network adapter

R router

link

Read request


From Bus to NoC

IP DSP CPU

Mem I/O IP

NA

NA

NA

NA

NA

NA

R

R

R

R

R

R

IP IP block

NA network adapter

R router

link

Read response


Advantages of NoCs

I Scalability

I Wires / technology

I Design methodology: “Lego bricks”


Advantages of NoCs

I Scalability

I Wires / technology

I Design methodology: “Lego bricks”

TileCPU

$ M

NAR


NoC Topology: Mesh


NoC Topology: Bitorus


NoC Topology: Bitorus


NoC Topology: Tree


Overview


p2 Networks-on-Chip



p5 Conclusion


Arbitration in GP NoC

I Only one packet can use a link at a timeI Similar to bus

I Router decides which packet to transmitI Round-robin, priorities, . . .

I Packets buffered in routers

I Flow control/back pressure to avoid bufferoverflows


Contention in Simple GP NoC

RAM



RAM



RAM



RAM



RAM



RAM



RAM



RAM



RAM



RAM



RAM



RAM



RAM



RAM



RAM



RAM



RAM



RAM



RAM



RAM



RAM



RAM



RAM



RAM



RAM



RAM



RAM



RAM



RAM



RAM



RAM



RAM



RAM


Contention

I Example could be even worseI Arbitration in every router

I Effectively multiple levels of arbitrationsI May behave worse than a bus

I Bandwidth at end-points limitedI Packets may queue up in front of end-point

I Traffic to different end-points may affecteach other

I On-chip communication may be affected bycontention on external memory


Understanding the Platform

I Example assumed simplified model

I No formal models availableI Implementation details not disclosed

I HW vendors may acknowledge flaws andprovide work-arounds, but full disclosuredifficult/impossible

I Stress-testing to evaluate platformI N cores generate noiseI Accessing local or external memoriesI Measure round-trip times (reads from on-chip

memory)


Measurements: Intel SCC

0.001

0.01

0.1

1

10

100

1000

10000

0 5 10 15 20 25 30 35 40 45 50

Tim

e f

or

rea

din

g (

ms)

Number of cores producing noise

MPA readMPA writeRAM readRAM write


Measurements: TILEmpower-Gx36

0.1

1

10

100

1000

10000

100000

0 5 10 15 20 25 30 35

Tim

e f

or

rea

din

g (

us)

Number of cores producing noise

MPA readMPA writeRAM readRAM write


Controlling Contention

I Limiting contention is key to achievepredictable behavior

⇒ Must understand application behaviorI Which tasks are communicating?I How is memory accessed?

I Avoid accesses to external memoryI Timing difficult to predictI On-chip memory less susceptible to

contention


Scheduling

I Task migration causes network trafficI Overhead may be very highI May cause contentionI May interfer with predictability

I Many cores availableI Use more cores instead of loading one coreI In extreme cases one task per core

⇒ Partitioned scheduling a good option


Time Synchronization

I Surprisingly difficult to establish globalnotion of time on COTS platforms

I Core-local clocks driven by same oscillator

I . . . but may have arbitrary offsets⇒ Initial time synchronization needed for

global notion of timeI Synchronization via NoCI Perfect synchronization impossible


Heuristic Mapping [PNP13]

I Application written in multi-ratesynchronous language Prelude

I Periods, deadlines, dependencies,communication patterns

I Communication via on-chip memoryI Heuristic to map application to Intel SCC

I Bounded contentionI Schedulability

I Partitioned scheduling using global ticksI Leave gaps to accomodate clock imprecision

and communication latencies


Mapping Intuition8 cores access same end-point






Constraint Solving [PNP15]

I Evaluation of Intel SCC, TI TMS320C6678,Tilera TILEmpower-Gx36

I Execution model that fits these platformsI Partitioned off-line schedulingI Data and code in local memoriesI Communication via message-passingI Pre-computed communication delays

I Constraint solving for mappingI Bounded contentionI Off-line scheduleI Scales to hundreds of tasks


Time-Triggered Execution

I Split tasks into communication andexecution phases

I Schedule communication to avoid/reduceconflicts on shared resources

I Eliminate/limit contention

I Strict model advocated by Kopetz

I Requires tight timing synchronization

I Analysis of extended (relaxed) model:[SPC+11]


Memguard [YYP+13]

I Split available memory bandwidthbetween tasks

I Every task assigned a memory budgetI Use bandwidth that can be guaranteed

I Task stalled when exceeding the budget

I Replenish budget periodically

I No more requests than bandwidthavailable ⇒ limit contention

I Best-effort use of remaining bandwidth


Optimistic Execution [KRF+14]

I Mixed-criticality systemI Critical tasks and non-critical tasks

I Determine WCET in isolation and undercontention

I Monitor progress of critical tasksI Stall non-critical tasks when in danger of

missing deadlineI Switch to mode with limited contention


Overview


p2 Networks-on-Chip



p5 Conclusion


Time-Predictable ArchitecturesI In hard real-time systems, the worst case

is what countsI General-purpose architectures designed

for average caseI Speed is a convenienceI Caches, branch predictionI Poor worst-case behavior

I Time-predictable architecturesI Optimize worst-case behaviorI Guaranteed performanceI May sacrifice average-case performanceI Enable worst-case execution time analysis


WCET Bounds

COTS Proc

TP Proc 1

TP Proc 2

exec time

exec time

exec time

BCET ACET WCET Bound

Pessimism




Kalray MPPA

I www.kalray.eu

I MPPA-256: 16 clusters with 16 cores eachI Designed to be timing composable

I No weird timing effects, can combinesub-analyses

I Trade off between predictability andaverage-case performance

I Limit data rates of network interfacesI Network calculus to compute latency/rate


www.kalray.eu

Network Calculus

I Initially developed for off-chip networks

I Arriving traffic described by arrival curve

I Outgoing traffic described bydeparture/service curve

I Math. . .

I Backlog in routers (buffer sizes)

I Delay in routers


Backlog and Delay

1 32 4 5 7 8 9

cumulative

time

data amount

b(A,D)

d(A,D)

A

d(A,D,t)

b(A,D,t)

D

t

figure by Marc Boyer


T-CREST Platform [SAA+15]

I Many-core platform withtime-predictability as design goal

I Time-predictable processor coresI Separate NoCs for memory and on-chip

communicationI Memory: many-to-one ⇒ treeI On-chip: many-to-many ⇒ bitorus, mesh, . . .I Time-division multiplexing

I Predictable memory controllerI Open source: http://patmos.compute.dtu.dk


http://patmos.compute.dtu.dk

T-CREST Overview

T-CREST Chip

Patmos

DecM$ +

SPM

S/D$

NI

Patmos

DecM$ +

SPM

S/D$

NI

R

Patmos

DecM$ +

SPM

S/D$

NI

R

MemoryTree

MemoryController

SDRAMMemory

R


Time-Division Multiplexing

I Static network scheduleI Time at which packet may use link

predeterminedI Slots always reserved, may be unusedI Guaranteed bandwidth, but does not adapt

dynamically

I No buffering, no flow controlI Significantly cheaper routers

I Only wait for slot, free flow in NoC


Argo NoC

I Message passing instead of sharedaddress space

I DMA transfers instead of bus interfaceI Communication is explicit

I Communication via SPMsI Reads and writes to local SPMI Only writes to remote SPM (“push”) via DMA

I Integration of DMA controller and TDMschedule

I Small, efficient hardware

I Is explicit communication the future?


Argo NoC

I Message passing instead of sharedaddress space

I DMA transfers instead of bus interfaceI Communication is explicit

I Communication via SPMsI Reads and writes to local SPMI Only writes to remote SPM (“push”) via DMA

I Integration of DMA controller and TDMschedule

I Small, efficient hardware

I Is explicit communication the future?


Overview


p2 Networks-on-Chip



p5 Conclusion


Conclusions

I Many cores are the way to go

I NoCs provide scalability

I Contention may severely affect timingI Limiting contention is key to achieve

predictabilityI Mapping, time-triggered, budgets

I Time-predictable architecturesI Kalray MPPA, T-CREST platform


References I

Angeliki Kritikakou, Christine Rochange, Madeleine Faugère, Claire Pagetti, Matthieu Roy, Sylvain Girbal,

and Daniel Gracia Pérez.Distributed run-time WCET controller for concurrent critical tasks in mixed-critical systems.In Proceedings of the 22nd International Conference on Real-Time Networks and Systems, page 139.ACM, 2014.

Wolfgang Puffitsch, Eric Noulard, and Claire Pagetti.

Mapping a multi-rate synchronous language to a many-core processor.In Proceedings of the 19th Real-Time and Embedded Technology and Applications Symposium, RTAS ’13,pages 293–302. IEEE, 2013.

Wolfgang Puffitsch, Eric Noulard, and Claire Pagetti.

Off-line mapping of multi-rate dependent task sets to many-core platforms.Real-Time Systems, 2015.accepted for publication.

Martin Schoeberl, Sahar Abbaspour, Benny Akesson, Neil Audsley, Raffaele Capasso, Jamie Garside, Kees

Goossens, Sven Goossens, Scott Hansen, Reinhold Heckmann, Stefan Hepp, Benedikt Huber, AlexanderJordan, Evangelia Kasapaki, Jens Knoop, Yonghui Li, Daniel Prokesch, Wolfgang Puffitsch, Peter Puschner,André Rocha, Cláudio Silva, Jens Sparsø, and Alessandro Tocchi.T-CREST: Time-predictable multi-core architecture for embedded systems.Journal of Systems Architecture, 0(0):–, 2015.published online, to appear in print.


References II

Andreas Schranzhofer, Rodolfo Pellizzoni, Jian-Jia Chen, Lothar Thiele, and Marco Caccamo.

Timing analysis for resource access interference on adaptive resource arbiters.In Proceedings of the 17th Real-Time and Embedded Technology and Applications Symposium, RTAS ’11,pages 213–222. IEEE, 2011.

Heechul Yun, Gang Yao, Rodolfo Pellizzoni, Marco Caccamo, and Lui Sha.

Memguard: Memory bandwidth reservation system for efficient performance isolation in multi-coreplatforms.In Proceedings of the 19th Real-Time and Embedded Technology and Applications Symposium, RTAS ’13,pages 55–64. IEEE, 2013.


Documents

Time-Predictability on Multi-/Many-Core Platformswopu/other/cesec_slides.pdf · Time-Predictability on Multi-/Many-Core Platforms Wolfgang Pufﬁtsch Department of Applied Mathematics