Programming Safety-Critical Embedded Systems

1

Programming Safety-Critical Embedded Systems

Work mainly bySidharta Andalam and Eugene Yip

Main supervisor: Advisor:Dr. Partha Roop Dr. Alain Girault(UoA) (INRIA)

2

Outline

• Introduction• Synchronous Languages• PRET-C• ForeC

3

Outline


4

Introduction

• Safety-critical systems:

– Perform specific real-time tasks.– Comply with strict safety standards

[IEC 61508, DO 178]– Time-predictability useful in real-time designs.

[Paolieri et al 2011] Towards Functional-Safe Timing-Dependable Real-Time Architectures.

Embedded Systems

Safety-critical concerns

Timing/Functionality requirements

Timing analysis

5

Introduction

Domain of application

Processor

Embedded Desktop

Single-core

Multicore

Manycore

C

RTOS(VxWorks)

UPCX10

Intel Cilk Plus

SharCGrace

SHIMSigma C

ForkLight

Esterel SCADESimulink Protothreads

OpenMPOpenCLPthreads

ParC

PRET-C

ForeC

6

Outline


7

Synchronous Languages

• Deterministic concurrency (formal semantics).– Concurrent control behaviours.– Typically compiled away.

• Execution model similar to digital circuits.– Threads execute in lock-step to a global clock.– Threads communicate via instantaneous signals.

[Benveniste et al 2003] The Synchronous Languages 12 Years Later.

Global ticks

Inputs

Outputs1 2 3 4

8


Physical time1s 2s 3s 4s

Time for a tick

Must validate:max(Reaction time) < min(Time for each tick)

Reaction time

Specified by the system’s timing requirements


9


• Esterel, Lustre, Signal• Synchronous extensions to C:

– PRET-C– Reactive Shared Variables– Synchronous C– Esterel C Language

[Roop et al 2009] Tight WCRT Analysis of Synchronous C Programs.[Boussinot 1993] Reactive Shared Variables Based Systems.[Hanxleden et al 2009] SyncCharts in C - A Proposal for Light-Weight, Deterministic Concurrency.[Lavagno et al 1999] ECL: A Specification Environment for System-Level Design.

Retain the essence of C and add deterministic concurrency and thread communication.

10

Outline


11

PRET-CStages

1. PRET-C: Simple synchronous extension to C (using macros).2. TCCFG: Intermediate format.3. TCCFG’: Updated after cache analysis.4. Model Checking: Binary search for the WCRT.

PRET-C

void main() { while(1) { abort PAR(sampler,display); when(reset); EOT; }}

TCCFG

Cache analysis Model Checker

WCRT

Final Output

12

PRET-C

• Simple set of synchronous extensions to C:– Light-weight multi-threading.– Macro-based implementation.– Thread-safe shared memory accesses.– Amenable to timing analysis for ensuring time-

predictability.

PRET-CStatement DescriptionReactiveInput I Declares I as a reactive input coming from the

environment.ReactiveOutput O Declares O as a reactive output emitted to the

environment.PAR(T1, ..., Tn) Synchronously executes threads T1 to Tn in parallel. Thread

Ti has higher execution priority over Ti+1.

EOT Marks the end of a tick.[weak] abort P when C Terminates P when C is true.

The semantics of PRET-C is presented using structural operational style,along with proofs for reactivity and determinism [IEEE TC 2013 March]

PRET-CCode

...PAR(T1,T2)...

T1: A; EOT; C; EOT

T2: B; EOT; D; EOT

A

B

C

D

Time

T1

T2

Global Tick Global Tick

Local tick Local tick

15

Outline


16

Introduction

• Safety-critical systems:– Shift from single-core to multicore processors.– Cheaper, better power vs. execution performance.

Coren

Core0

System bus

Resource Resource

Shared

Shared Shared

[Blake et al 2009] A Survey of Multicore Processors.[Cullmann et al 2010] Predictability Considerations in the Design of Multi-Core Embedded Systems.

17

Introduction

• Parallel programming:– From super computers to mainstream computers.– Frameworks designed for systems without

resource constraints or safety-concerns.• Optimised for average-case performance (FLOPS), not

time-predictability.– Threaded programming model.

• Pthreads, OpenMP, Intel Cilk Plus, ParC, ...• Non-deterministic thread interleaving makes

understanding and debugging hard.

[Lee 2006] The Problem with Threads.

18

Introduction

• Parallel programming:– Programmer responsible for shared resources.– Concurrency errors:

• Deadlock, Race condition, Atomic violation, Order violation.

[McDowell et al 1989] Debugging Concurrent Programs.[Lu et al 2008] Learning from Mistakes: A Comprehensive Study on Real World Concurrency Bug Characteristics.

19

Introduction

• Synchronous languages– Esterel, Lustre, Signal– Synchronous extensions to C:

• PRET-C• Reactive Shared Variables• Synchronous C• Esterel C Language


Sequential execution semantics. Unsuitable for parallel execution.

20

Introduction

• Synchronous languages– Esterel, Lustre, Signal– Synchronous extensions to C:

• PRET-C• Reactive Shared Variables• Synchronous C• Esterel C Language


Compilation produces sequential programs. Unsuitable for parallel execution.

21

ForeC

“Foresee” ForeC • C-based, multi-threaded, synchronous

language. Inspired by PRET-C and Esterel.• Deterministic parallel execution on embedded

multicores.• Fork/join parallelism and shared memory

thread communication.• Program behaviour independent of chosen

thread scheduling.

22

ForeC

Thread distribution

ForeCsource code CCFG

Static scheduling

Compiled program

CCFG with assembly

Architecture model

Reachability Computed WCRT

Compilation Timing AnalysisProgramming

23

ForeC

• Additional constructs to C:– pause: Synchronisation barrier. Pauses the

thread’s execution until all threads have paused.– par( st1, ..., stn ): Forks each statement to

execute as a parallel thread. Each statement is implicitly scoped.

– [weak] abort st when [immediate] exp: Preempts the statement st when exp evaluates to a non-zero value. exp is evaluated in each global tick before st is executed.

24

ForeC

• Additional variable type-qualifiers to C:– input and output: Declares a variable whose

value is updated or emitted to the environment at each global tick.

25

ForeC

• Additional variable type-qualifiers to C:– shared: Declares a shared variable that can be

accessed by multiple threads.

26

ForeC

• Additional variable type-qualifiers to C:– shared: Declares a shared variable that can be

accessed by multiple threads. 1. Threads make local copies of shared variables that they

may use at the start of their local ticks.2. Threads only modify their local copies during execution.3. If a par statement terminates:

• Modified copies from the child threads are combined (using a commutative & associative function) and assigned to the parent.

3. If the global tick ends:• The modified copies are combined and assigned to the actual

shared variables.

a

b

27

Execution Exampleshared int sum = 1 combine with plus;

int plus(int copy1, int copy2) { return (copy1 + copy2);}

void main(void) { par(f(1), f(2));}

void f(int i) { sum = sum + i; pause; ...}

Synchronisation

Fork-join

Shared variable

Commutative and associative combine function

28

Execution Example 1shared int sum = 1 combine with plus;




Global

sum = 1

29





Global

sum = 1Global tick start

30





Global Local

f1 f2

sum = 1

sum1 = 1 sum2 = 1

Global tick start

31





Global Local

f1 f2

sum = 1

sum1 = 1sum1 = 2

sum2 = 1sum2 = 3

Global tick start

32





Global Local

f1 f2

sum = 1

sum1 = 1sum1 = 2

sum2 = 1sum2 = 3

Global tick start

Global tick end

33





Global Local

f1 f2

sum = 1

sum1 = 1sum1 = 2

sum2 = 1sum2 = 3

sum = 5

Global tick start

Global tick end

34





Global Local

f1 f2

sum = 1

sum1 = 1sum1 = 2

sum2 = 1sum2 = 3

sum = 5

sum1 = 5. . .

sum2 = 5. . .

Global tick start

Global tick end

Global tick start

Execution Example 2

Sum a set of data.shared int v=0 combine with plus;int[4] data={1,2,3,4};

void main(void) { f(data);}void f(int *data) { par(add(0,data), add(2,data));}void add(int x, int *data) { v=data[x] + data[x+1];}

Execution Example 2shared int v=0 combine with plus;int[4] data={1,2,3,4}; int[4] data1={5,6,7,8};

void main(void) { f(data);}void f(int *data) { par(add(0,data), add(2,data));}void add(int x, int *data) { v=data[x] + data[x+1];}

Sum sets of data in parallel.

Execution Example 2shared int v=0 combine with plus;int[4] data={1,2,3,4}; int[4] data1={5,6,7,8};

void main(void) { par(f(data), f(data1));}void f(int *data) { par(add(0,data), add(2,data));}void add(int x, int *data) { v=data[x] + data[x+1];}

Sum sets of data together in parallel.

Execution Example 2

main

f f

add add add add

v

Execution Example 2

main

f f

add add add add

v v

Execution Example 2int[4] data={1,2,3,4}; int[4] data1={5,6,7,8};

void main(void) { par(f(data), f(data1));}void f(int *data) { shared int v=0 combine with plus; par(add(0,data,&v), add(2,data,&v));}void add(int x, int *data, shared int *const v combine with +) { *v=data[x] + data[x+1];}

41

Execution Example

Shared variables:– Threads modify local copies of shared variables.

• Isolation of thread execution allows threads to truly execute in parallel.

• Thread interleaving does no affect the program’s behaviour.

– Prevents most concurrency errors.• Deadlock, Race condition: No locks.• Atomic and order violation: Local copies.

– Copies for a shared variable can be split into groups and combined in parallel.

42

Execution Example

Shared variables:– Programmer has to define a suitable combine

function for each shared variable.• Must ensure the combine function is indeed

commutative & associative.– Notion of “combine functions” is not entirely new:

• Intel Cilk Plus, OpenMP, MPI, UPC, X10• Esterel, Reactive Shared Variables

[Intel Cilk Plus] http://software.intel.com/en-us/intel-cilk-plus [OpenMP] http://openmp.org[MPI] http://www.mcs.anl.gov/research/projects/mpi/ [Unified Parallel C] http://upc.lbl.gov/ [X10] http://x10-lang.org/[Berry et al 1992] The Esterel Synchronous Programming Language: Design, Semantics and Implementation.[Boussinot 1993] Reactive Shared Variables Based Systems.

43


Execution Example

Shared variables: – Programmer has to define a suitable combine




cilk::reducer_opcilk::holder_op

shared varreduction(operator: var)

MPI_ReduceMPI_Gather

shared varcollectives

Aggregates

44


Execution Example

Shared variables: – Programmer has to define a suitable combine




Valued signalsCombine operator

shared varCombine operator

45

Shared Variable Design Patterns

• Point-to-point• Broadcast• Software pipelining• Divide and conquer

– Scatter/Gather– Map/Reduce

46

Point-to-pointshared int sum = 0 combine with plus;

void main(void) { par( f(), g() );}

void f(void) { while (1) { sum = comp1(); pause; }}

void g(void) { while (1) { comp2(sum); pause; }}

New value of sum is received in the next global tick.

Combine operation is not required.

47

Broadcastshared int sum = 0 combine with plus;

void main(void) { par( f(), g(), g() );}



Multiple receivers.



48

Software Pipeliningshared int s1 = 0, s2 = 0 combine with plus;

void main(void) { par( stage1(), stage2(), stage3() );}

void stage1(void) { while (1) { s1 = comp1(); pause; }}void stage2(void) { pause; while (1) { s2 = comp2(s1); pause; }}

Outputs from each stage are buffered.

Use the delayed behaviour of shared variables to buffer each stage.

void stage3(void) { pause; pause; while (1) { comp3(s2); pause; }}

49

Divide and Conquerinput int[1024] image;shared int edges = 0 combine with plus;

void main(void) { par( analyse(0, 511), analyse(512, 1023) );}

void analyse(int start, int end) { while (1) { edges = 0; for (i = start; i < end; ++i) { ... image[i] ... ; edges++; } pause; }}

Count the number of edges in an image.

50

Scheduling

• Light-Weight Static Scheduling:– Take advantage of multicore performance while

delivering time-predictability.– Generate code to execute directly on hardware

(bare metal/no OS).– Thread allocation and scheduling order on each

core decided at compile time by the programmer.• Develop a WCRT-aware scheduling heuristic.• Thread isolation allows for scheduling flexibility.

– Cooperative (non-preemptive) scheduling.

51

Scheduling

• Cores synchronise to fork/join threads and end each global tick.

• One core to perform housekeeping tasks at the end of the global tick:– Combining shared variables.– Emitting outputs.– Sampling inputs and trigger the next global tick.

52

Results

Multicore simulator (Xilinx MicroBlaze):– Based on http://www.jwhitham.org/c/smmu.html

and extended to be cycle-accurate and support multiple cores and a TDMA bus.

Core0

TDMA Shared Bus

Global memory

Datamemory

Instruction memory Core

nDatamemory

Instruction memory16KB

16KB

32KB5 cycles

1 cycle

5 cycles/core(Bus schedule round = 5 * no. cores)

http://www.jwhitham.org/c/smmu.html


53

WCRT Execution Results

Able to achieve speed ups for all programs. The benefit of multicore execution diminishes with increasing number of cores due to overheads (Bus, memory accesses, scheduling routines).

1 2 3 40

2,000

4,000

6,000

8,000

10,000

12,000

14,000

16,000

18,000

20,000

FmRadio

Cores

1 2 3 4 5 6 70

1,000

2,000

3,000

4,000

5,000

6,000

7,000

Fly by Wire

Cores

1 2 3 4 5 6 7 80

10,000

20,000

30,000

40,000

50,000

60,000

70,000

80,000

Life

Cores1 2 3 4 5 6 7 8

0

5,000

10,000

15,000

20,000

25,000

30,000

35,000

Matrix

Cores

1 2 3 4 5 6 7 8 9 100

20,000

40,000

60,000

80,000

100,000

120,000

140,000

160,000

802.11a

Cores

54

Programming PTARM using ForeCshared int sum = 1 combine with plus;




Execution of ForeCint main(void) {

SET_THREAD_LOCATION(0, _pt_hwt0);SET_THREAD_LOCATION(1, _pt_hwt1);SET_THREAD_LOCATION(2, _pt_idle);SET_THREAD_LOCATION(3, _pt_idle);

_pt_hwt0:initialize code;goto main;

_pt_hwt1:wait for par;goto f(2);

_pt_idle: goto _pt_idle; continues ...

Execution of ForeCint main(void) {

SET_THREAD_LOCATION(0, _pt_hwt0);SET_THREAD_LOCATION(1, _pt_hwt1);SET_THREAD_LOCATION(2, _pt_idle);SET_THREAD_LOCATION(3, _pt_idle);

_pt_hwt0:initialize code;goto main;

_pt_hwt1:wait for par;goto f_2;

_pt_idle: goto _pt_idle;continues ...

Execution of ForeCmain:

fork f_1 and f_2; _par_resume:

return 0;

f_1: sum = 1;synchronization code;thread termination code;

f_2:sum = 2;synchronization code;thread termination code;

}

58

Non-Realtime Threads in ForeC

• A non-realtime thread (NRT): – no strict timing requirements.– possibly unbounded execution time.– asynchronous computation.– E.g., file archiving, compression, data analysis.

59

Non-Realtime Threads in ForeC

Splitting the execution time of NRTs into periods.

• Guarantee f() to execute for at least min_t and at most max_t in each global tick. – When the period elapses, the execution pauses.– Execution resumes in the next global tick.

// Non-realtime thread.void nrt(void) { do { f(); } until (min_t, max_t);}

60

Non-Realtime Threads in ForeC// Non-realtime thread.void nrt(void) { // Set deadline equal to // the current time + min_t. setDeadline(min_t);

// Enable timing exception // and register a handler. enableException(max_t, handler);

// Execute the body. f();

// The body is finished executing. // Disable the timing exception. disableException(); goto end;

// Timing exception handler. handler: { // Save the execution context. pause; setDeadline(min_t); // Restore the execution context. } end:;}

// Non-realtime thread.void nrt(void) { do { f(); } until (min_t, max_t);}

PTARM modifications

• Boot-up– Modified to allow loading of multiple hardware

threads.• Exceptions

– Added the exception handler in boot loader• Context Saving

– Modified VHDL to save PC to LR– Saves registers onto stack in exception routine

Tick Precise Allocation Device

Matthew KuoMain supervisor: Partha Roop

Introduction

Cache

Performance Timing Precision

• Traditionally Caches– to bridge the memory gap– Small fast piece of memory

• Temporal locality• Spatial locality

– Hardware Controlled• Hard real time systems

– Compute the WCRT• Needs to model the architecture• Caches models

– Complex – Not tight

Introduction

Scratchpad


• Small piece of memory• Software controlled• Requires an allocation algorithm

– ILP– Greedy

• Hard real time systems– Easy to compute tight the WCRT– Reduces the average case performance

• May also be worse than cache for worst case performance• Not as efficient as caches

Introduction

Cache Scratchpad


Introduction

Cache Scratchpad


TickPAD

Tick Precise Allocation Device

• TickPAD - Tick Precise Allocation Device• Memory controller

– Hybrid between caches and scratchpads• software controlled memory like a scratchpad• Hardware controlled features

• Hard real-time synchronous programs

TickPAD System Specifications

0x00 0x04 0x08 0x0C0x00

4 Instructions

1 Cache Line

Takes 1 burst transfer from main memory

buffer

4 x 32 bits

Buffers are 1 cache line in size

TickPAD – scratchpad memory for synchronous programs


To accelerate linear code


• For predictable temporal locality – Statically allocated

• Dynamically loaded


• Stores the resumptions address of active threads

• Stores the instructions at the resumption of the next active thread– To reduce context switching overhead at

state/tick boundaries


Stores a set of commands to be executed by the TickPAD controller. Command – the type of operation Address – the PC value at which the

command is activated Operand- stores data need for the

command

A buffer to store operands fetched from main memory Command requiring 2+ operands

Spatial Memory Pipeline

• Exploit spatial locality– Predictability prefetch the next line of instructions

ToggleBrach

Instruction Check

TAG

ADDR[TAG]ADDR[Block Offset]

Instruction[32]

Tick FIFO

Control Logic WriteEnTAG

Main Memory

Associative Loop Memory

Dem

ux

Dem

ux

Demux

Demux

Dem

ux

SMP Buffer 1

SMP Buffer 2


Command Buffer

hasBranchclk

Address[32]


Execute Buffer

Fetch Buffer

Processor Execution 310 320

310 320

320 330 Disabled

Linear Code Branch

Stall

330

330 3B0

3B0

FetchingFetching

Fetching

Stall Stall

ToggleBrach

Instruction Check

TAG


Instruction[32]

Tick FIFO


Main Memory


Dem

ux

Dem

ux

Demux

Demux

Dem

ux

SMP Buffer 1

SMP Buffer 2


Command Buffer

hasBranchclk

Address[32]


Execute Buffer

Fetch Buffer


310 320

320 330 Disabled

Linear Code Branch

Stall

330

330 3B0

3B0

FetchingFetching

Fetching

Stall Stall

ToggleBrach

Instruction Check

TAG


Instruction[32]

Tick FIFO


Main Memory


Dem

ux

Dem

ux

Demux

Demux

Dem

ux

SMP Buffer 1

SMP Buffer 2


Command Buffer

hasBranchclk

Address[32]

310 314 318 31C


Execute Buffer

Fetch Buffer


310 320

320 330 Disabled

Linear Code Branch

Stall

330

330 3B0

3B0

FetchingFetching

Fetching

Stall Stall

ToggleBrach

Instruction Check

TAG


Instruction[32]

Tick FIFO


Main Memory


Dem

ux

Dem

ux

Demux

Demux

Dem

ux

SMP Buffer 1

SMP Buffer 2


Command Buffer

hasBranchclk

Address[32]


Execute Buffer

Fetch Buffer


310 320

320 330 Disabled

Linear Code Branch

Stall

330

330 3B0

3B0

FetchingFetching

Fetching

Stall Stall

ToggleBrach

Instruction Check

TAG


Instruction[32]

Tick FIFO


Main Memory


Dem

ux

Dem

ux

Demux

Demux

Dem

ux

SMP Buffer 1

SMP Buffer 2


Command Buffer

hasBranchclk

Address[32]


Execute Buffer

Fetch Buffer


310 320

320 330 Disabled

Linear Code Branch

Stall

330

330 3B0

3B0

FetchingFetching

Fetching

Stall Stall

ToggleBrach

Instruction Check

TAG


Instruction[32]

Tick FIFO


Main Memory


Dem

ux

Dem

ux

Demux

Demux

Dem

ux

SMP Buffer 1

SMP Buffer 2


Command Buffer

hasBranchclk

Address[32]


Execute Buffer

Fetch Buffer


310 320

320 330 Disabled

Linear Code Branch

Stall

330

330 3B0

3B0

FetchingFetching

Fetching

Stall Stall

ToggleBrach

Instruction Check

TAG


Instruction[32]

Tick FIFO


Main Memory


Dem

ux

Dem

ux

Demux

Demux

Dem

ux

SMP Buffer 1

SMP Buffer 2


Command Buffer

hasBranchclk

Address[32]


Execute Buffer

Fetch Buffer


310 320

320 330 Disabled

Linear Code Branch

Stall

330

330 3B0

3B0

FetchingFetching

Fetching

Stall Stall

ToggleBrach

Instruction Check

TAG


Instruction[32]

Tick FIFO


Main Memory


Dem

ux

Dem

ux

Demux

Demux

Dem

ux

SMP Buffer 1

SMP Buffer 2


Command Buffer

hasBranchclk

Address[32]


ToggleBrach

Instruction Check

TAG


Instruction[32]

Tick FIFO


Main Memory


Dem

ux

Dem

ux

Demux

Demux

Dem

ux

SMP Buffer 1

SMP Buffer 2


Command Buffer

hasBranchclk

Address[32]

Execute Buffer

Fetch Buffer


310 320

320 330 Disabled

Linear Code Branch

Stall

330

330 3B0

3B0

FetchingFetching

Fetching

Stall Stall


ToggleBrach

Instruction Check

TAG


Instruction[32]

Tick FIFO


Main Memory


Dem

ux

Dem

ux

Demux

Demux

Dem

ux

SMP Buffer 1

SMP Buffer 2


Command Buffer

hasBranchclk

Address[32]

Execute Buffer

Fetch Buffer


310 320

320 330 Disabled

Linear Code Branch

Stall

330

330 3B0

3B0

FetchingFetching

Fetching

Stall Stall


ToggleBrach

Instruction Check

TAG


Instruction[32]

Tick FIFO


Main Memory


Dem

ux

Dem

ux

Demux

Demux

Dem

ux

SMP Buffer 1

SMP Buffer 2


Command Buffer

hasBranchclk

Address[32]

Execute Buffer

Fetch Buffer


310 320

320 330 Disabled

Linear Code Branch

Stall

330

330 3B0

3B0

FetchingFetching

Fetching

Stall Stall


ToggleBrach

Instruction Check

TAG


Instruction[32]

Tick FIFO


Main Memory


Dem

ux

Dem

ux

Demux

Demux

Dem

ux

SMP Buffer 1

SMP Buffer 2


Command Buffer

hasBranchclk

Address[32]

Execute Buffer

Fetch Buffer


310 320

320 330 Disabled

Linear Code Branch

Stall

330

330 3B0

3B0

FetchingFetching

Fetching

Stall Stall


ToggleBrach

Instruction Check

TAG


Instruction[32]

Tick FIFO


Main Memory


Dem

ux

Dem

ux

Demux

Demux

Dem

ux

SMP Buffer 1

SMP Buffer 2


Command Buffer

hasBranchclk

Address[32]

Execute Buffer

Fetch Buffer


310 320

320 330 Disabled

Linear Code Branch

Stall

330

330 3B0

3B0

FetchingFetching

Fetching

Stall Stall

Command Table

• A Look Up table to dynamically load– Tick Instruction Buffer– Tick Queue– Associative Loop Memory

• Statically Allocated• Command are executed when the PC matches

the address stored on the command

TickPAD Design flow

ReachabilityAnalysis

PRET-CProgram

Graph Construction

TickPAD Allocation Analysis

TickPAD Timing Analysis

TCCFGTickPAD

Configuration File

Updated TCCFG

Worst Case Reaction Time

1 2

3

TickPAD Design flow


PRET-CProgram

Graph Construction



TCCFGTickPAD

Configuration File

Updated TCCFG


1 2

3

TickPAD Design flow


PRET-CProgram

Graph Construction



TCCFGTickPAD

Configuration File

Updated TCCFG


1 2

3

Command Table Allocation

Node Command Address

FORK Store Tick Address Queue x N Address of FORK

EOT Store Tick Address QueueLoad Tick Instruction Buffer

Address of EOT

KILL Load Tick Instruction Buffer Address of Kill

Loops Discard Loop Associative MemoryStore Loop Associative Memory

Address at start of Loop




Address of EOT





2F0300

310

2

3

4

5

22

23

28

2C0

2A02B0

4F0500510

520

9809909A0




Address of EOT









Address of EOT








Address of EOT




Tick Address Queue Tick Instruction Buffer

• Reduce cost of context switching• Make context switching points appear as

linear code– Paired using Spatial Memory Pipeline

Tick Queue

Tick Buffer




Stores an ordered list of the resumptions addresses of each thread

Tick Queue

Tick Buffer




Stores the instructions of the next active thread

Tick Queue

Tick Buffer




2B0

Tick Queue

Tick Buffer

Stores the instructions of the next active thread

2B0 2B4 2B8 2BC

2F0300

310

2

3

4

5

22

23

28

2C0

2A02B0

4F0500510

520

9809909A0

Commands1.Discard and Store Associative Loop Memory 2.Fetch Tick Address Queue and Fill Tick Instruction Buffer3.Load Tick Address Queue

PC: 2B0


Tick Queue

Tick Buffer

2F0300

310

2

3

4

5

22

23

28

2C0

2A02B0

4F0500510

520

9809909A0


*980*4F0*2FO

Tick Queue

Tick Buffer

PC: 2C0


2F0300

310

2

3

4

5

22

23

28

2C0

2A02B0

4F0500510

520

9809909A0


*980*4F0

2FO

Tick Queue

Tick Buffer

PC: 2C0


2F0300

310

2

3

4

5

22

23

28

2C0

2A02B0

4F0500510

520

9809909A0


*980*4F0

2FO

Tick Queue

Tick Buffer

PC: 2F0


2F0300

310

2

3

4

5

22

23

28

2C0

2A02B0

4F0500510

520

9809909A0


*980*4F0

2FO

Tick Queue

Tick Buffer

PC: 300


2F0300

310

2

3

4

5

22

23

28

2C0

2A02B0

4F0500510

520

9809909A0


*310*980*4F0

2FO

Tick Queue

Tick Buffer

PC: 310


2F0300

310

2

3

4

5

22

23

28

2C0

2A02B0

4F0500510

520

9809909A0


*310*980

4F0

Tick Queue

Tick Buffer

PC: 310


2F0300

310

2

3

4

5

22

23

28

2C0

2A02B0

4F0500510

520

9809909A0


*310*980

4F0

Tick Queue

Tick Buffer

PC: 310


2F0300

310

2

3

4

5

22

23

28

2C0

2A02B0

4F0500510

520

9809909A0


*310*980

4F0

Tick Queue

Tick Buffer

PC: 4F0



• Statically Allocated– Greedy– ILP

• Fetches Loop Before Executing– Predictable – easy and tight to model– Exploits temporal locality

Results

Results

8.5% compared to locked scratchpad memory 12.3%compared to thread interleaved of scratchpad

Results

Results - Synthesis

Conclusions

• C-based synchronous languages for writing deterministic, time-predictable software.– PRET-C: Single-cores– ForeC: Multicores

• Can achieve WCRT speedup while providing time-predictability.

• Very precise and fast timing analysis for PRET-C and ForeC programs using reachability.

Conclusions

• A new time precise memory architecture - TickPAD

• Showed the use TickPAD is comparable to using the cache and scratchpad memories

• Future direction– The use of TickPAD for data caches– Implement TickPAD on Precise Timed Architecture

116

Questions?

117

Outline

• Introduction• ForeC Language• Timing Analysis• Results• Conclusions

118

Timing Analysis

Compute the program’s worst-case reaction time (WCRT).

Physical time1s 2s 3s 4s

Time for a tick

Must validate:max(Reaction time) < min(Time for each tick)

Reaction time

Specified by the system’s timing requirements


119

Timing Analysis

Existing approaches for synchronous programs:• Integer Linear Programming (ILP)• “Coarse-grained” Reachability (Max-Plus)• Model Checking

One existing approach for analysing the WCRT of synchronous programs on multicores:• [Ju et al 2010] Timing Analysis of Esterel Programs on General-Purpose

Multiprocessors.• Uses ILP, no tightness result, all experiments performed 4-core processor.

120

Timing Analysis

Existing approaches for synchronous programs.• Integer Linear Programming (ILP)

– Execution time of the program described as a set of integer equations.

– Solving ILP is NP-complete.

[Ju et al 2010] Timing Analysis of Esterel Programs on General-Purpose Multiprocessors.

121

Timing Analysis

Existing approaches for synchronous programs.• “Coarse-grained” Reachability (Max-Plus)

– Compute the WCRT of each thread.– Using the thread WCRTs, the WCRT of the program

is computed.– Assumes there is a global tick where all threads

execute their worst-case.

[M. Boldt et al 2008] Worst Case Reaction Time Analysis of Concurrent Reactive Programs.

122

Timing Analysis

Existing approaches for synchronous programs.• Model Checking

– Computes the execution time along all possible execution paths.

– State-space explosion problem.– Binary search: Check the WCRT is less than “x”.– Trades-off analysis time for precision.– Counter example: Execution trace for the WCRT.

[P. S. Roop et al 2009] Tight WCRT Analysis of Synchronous C Programs.

123

Timing Analysis

Proposed “fine-grained” Reachability approach:• Only consider local ticks that can execute

together in the same global tick.• Timed execution trace for the WCRT.• To handle the state-space explosion:

– Reduce the program’s CCFG before analysis.

Program binary

(annotated)

Find all global ticks

(Reachability)WCRT

Reconstruct the program’s

CCFG

124

Timing Analysis

Programs executed on the following multicore architecture:

Core0

TDMA Shared Bus

Global memory

Datamemory


nDatamemory

Instruction memory

125

Timing Analysis

Computing the execution time:1. Overlapping of thread execution time from

parallelism and inter-core synchronizations.2. Scheduling overheads.3. Variable delay in accessing the shared bus.

126

Timing Analysis

1. Overlapping of thread execution time from parallelism and inter-core synchronisations.

• An integer counter to track each core’s execution time.• Synchronisation occurs when forking/joining, and ending

the global tick.• Advance the execution time of participating cores.

Core 1: Core 2:main f2

f1

Core 1 Core 2main

f2f1

f1 f2

main

127

Timing Analysis

2. Scheduling overheads.– Synchronisation: Fork/join and global tick.

• Via global memory.– Thread context-switching.

• Copying of shared variables at the start the thread’s local tick via global memory.

SynchronisationThread context-switch

Core 1 Core 2main

f2f1

Global tick

128

Timing Analysis

2. Scheduling overheads.– Required scheduling routines statically known.– Analyse the scheduling control-flow.– Compute the execution time for each scheduling

overhead. Core 1 Core 2main

f1

Core 1 Core 2main

f2f1f2

129

Timing Analysis

3. Variable delay in accessing the shared bus.– Global memory accessed by scheduling routines.– TDMA bus delay has to be considered.

Core 1 Core 2main

f1 f2

130

Timing Analysis


121212121212

Core 1 Core 2

slotsCore 1 Core 2

main

f1 f2

131

Timing Analysis


121212121212

Core 1 Core 2main

f1 f2

Core 1 Core 2main

f1 f2

132

Timing Analysis

CCFG optimisations:– merge: Reduces the number of CFG nodes that

need to be traversed.– merge-b: Reduces the number of alternate paths

in the CFG. (Reduces the number of global ticks)– Precision of the analysis is unaffected because we

are not performing value analysis to prune infeasible paths.

133

Timing Analysis

CCFG optimisations:– merge: Reduces the number of CFG nodes that

need to be traversed.– merge-b: Reduces the number of alternate paths

in the CFG. (Reduces the number of global ticks)

cost = 1

cost = 4

cost = 3

cost = 1

cost= 1 + 3= 4

cost= 1 + 4 + 1= 6

cost = 6

merge merge-b

134

Outline


135

Results

For the proposed reachability-based timing analysis, we demonstrate:

– the precision of the computed WCRT.– the efficiency of the analysis, in terms of analysis

time.

136

Results

Timing analysis tool:

Program binary

(annotated)

Fine-grained Reachability(Proposed)

Coarse-grained

Reachability(Max-Plus)

Taking into account the 3 factors

WCRTProgram CCFG (optimisations)

137

Results

Multicore simulator (Xilinx MicroBlaze):– Based on http://www.jwhitham.org/c/smmu.html

and extended to be cycle-accurate and support multiple cores and a TDMA bus.

Core0

TDMA Shared Bus

Global memory

Datamemory


nDatamemory

Instruction memory16KB

16KB

32KB5 cycles

1 cycle

5 cycles/core(Bus schedule round = 5 * no. cores)



138

Results

• Mix of control/data computations, thread structure and computation load.

* [Pop et al 2011] A Stream-Computing Extension to OpenMP.# [Nemer et al 2006] A Free Real-Time Benchmark.

*

*#

Benchmark programs.

139

Results

• Each benchmark program was distributed over varying number of cores.– Up to the maximum number of parallel threads.

• Observed the WCRT:– Test vectors to elicit different execution paths.

• Computed the WCRT:– Proposed– Max-Plus

140

802.11a ResultsObserved:• WCRT decreases

until 5 cores.• Global memory

increasingly expensive.

• Scheduling overheads.

1 2 3 4 5 6 7 8 9 100

20,000

40,000

60,000

80,000

100,000

120,000

140,000

160,000

180,000

200,000Observed

Proposed

MaxPlus

Cores

WC

RT

(clo

ck cy

cles

)

141

802.11a Results

1 2 3 4 5 6 7 8 9 100

20,000

40,000

60,000

80,000

100,000

120,000

140,000

160,000

180,000

200,000Observed

Proposed

MaxPlus

Cores

WC

RT

(clo

ck cy

cles

)

Proposed:• ~2% over-

estimation.• Benefit of fine-

grained reachability.

142

802.11a ResultsMax-Plus:• Loss of execution

context: Uses only the thread WCRTs.

• Assumes one global tick where all threads execute their worst-case.

• Max execution time of the scheduling routines.1 2 3 4 5 6 7 8 9 10

0

20,000

40,000

60,000

80,000

100,000

120,000

140,000

160,000

180,000

200,000Observed

Proposed

MaxPlus

Cores

WC

RT

(clo

ck cy

cles

)

143

802.11a ResultsBoth approaches:• Estimation of

synchronisation cost is conservative. Assumed that the receive only starts after the last sender.

1 2 3 4 5 6 7 8 9 100

20,000

40,000

60,000

80,000

100,000

120,000

140,000

160,000

180,000

200,000Observed

Proposed

MaxPlus

Cores

WC

RT

(clo

ck cy

cles

)

144

802.11a Results

1 2 3 4 5 6 7 8 9 100

500

1,000

1,500

2,000

2,500

Cores

Ana

lysi

s Tim

e (s

econ

ds)

Max-Plus takes less than 2 seconds.Proposed

145

802.11a Results

1 2 3 4 5 6 7 8 9 100

500

1,000

1,500

2,000

2,500

Cores

Ana

lysi

s Tim

e (s

econ

ds)

Proposed (merge)

ProposedMax-Plus takes less than 2 seconds.

merge:• Reduction of ~9.34x

146

802.11a Results

1 2 3 4 5 6 7 8 9 100

500

1,000

1,500

2,000

2,500

Cores

Ana

lysi

s Tim

e (s

econ

ds)

Proposed (merge)

Proposed (merge-b)

ProposedMax-Plus takes less than 2 seconds.

merge:• Reduction of ~9.34xmerge-b:• Reduction of ~342x• Less than 7 sec.

147

Results

Reduction in states reduction in analysis time

Number of global ticks explored.

148

Results

Proposed:• ~1 to 8% over-estimation.• Loss in precision mainly from over-estimating the synchronisation

costs.

1 2 3 40

5,000

10,000

15,000

20,000

25,000

30,000

35,000

40,000

45,000

FmRadio

Cores

1 2 3 4 5 6 70

1,000

2,000

3,000

4,000

5,000

6,000

7,000

Fly by Wire

Cores

1 2 3 4 5 6 7 80

20,000

40,000

60,000

80,000

100,000

120,000

140,000

Life

Cores1 2 3 4 5 6 7 8

0

5,000

10,000

15,000

20,000

25,000

30,000

35,000

Matrix

ObservedProposedMaxPlus

Cores

149

Results

Max-Plus:• Over-estimation very dependent on program structure.• FmRadio and Life very imprecise. Loops iterating over par

statement(s) multiple times. Over-estimations accumulate.• Matrix quite precise. Executes in one global tick. Thus, thread

WCRT assumption is valid.

1 2 3 40

5,000

10,000

15,000

20,000

25,000

30,000

35,000

40,000

45,000

FmRadio

Cores

1 2 3 4 5 6 70

1,000

2,000

3,000

4,000

5,000

6,000

7,000

Fly by Wire

Cores

1 2 3 4 5 6 7 80

20,000

40,000

60,000

80,000

100,000

120,000

140,000

Life

Cores1 2 3 4 5 6 7 8

0

5,000

10,000

15,000

20,000

25,000

30,000

35,000

Matrix

ObservedReachabilityMaxPlus

Cores

150

Results

• Our tool generates a timed execution trace for the computed WCRT:– For each core: Thread start/end time, context-

switching, fork/join, ...– Can be used to tune the thread distribution.

• Was used to manually find good thread distributions for each benchmark program.

Outline


Conclusions

• ForeC language for deterministic parallel programming of embedded multicores.

• Based on the synchronous framework, but amenable to parallel execution.

• Can achieve WCRT speedup while providing time-predictability.

• Very precise and fast timing analysis for parallel programs using reachability.

Future work

• Complete the formal semantics of ForeC.

Thread distribution

ForeCsource code CCFG

Static scheduling

Compiled program

CCFG with assembly

Architecture model

Reachability Computed WCRT

Compilation Timing AnalysisProgrammingAutomatic WCRT-aware scheduling.

Cache hierarchy.

Prune additional infeasible paths using value analysis.

154

Questions?

155

Design Patterns

• Point-to-point• Broadcast• Software pipelining• Divide and conquer

– Scatter/Gather– Map/Reduce

156

Point-to-pointshared int sum = 0 combine with plus;

void main(void) { par( f(), g() );}





157

Broadcastshared int sum = 0 combine with plus;

void main(void) { par( f(), g(), g() );}



Multiple receivers.



158

Software Pipeliningshared int s1 = 0, s2 = 0 combine with plus;

void main(void) { par( stage1(), stage2(), stage3() );}

void stage1(void) { while (1) { s1 = comp1(); pause; }}void stage2(void) { pause; while (1) { s2 = comp2(s1); pause; }}

Outputs from each stage are buffered.

Use the delayed behaviour of shared variables to buffer each stage.

void stage3(void) { pause; pause; while (1) { comp3(s2); pause; }}

159

Divide and Conquerinput int[1024] image;int edges = 0;

void main(void) { analyse(0, 1023);}


Count the number of edges in an image.

Sequential 1

160




Parallel 1

161

Divide and Conquerinput int[1024] image;int edges = 0;

void main(void) { analyse(0, 1023);}


Keep a running total of the number of edges in an image.

For the parallel version, it is not as easy as this.

Sequential 2

162



void analyse(int start, int end) { while (1) { edges = 0; for (i = start; i < end; ++i) { ... image[i] ... ; edges++; } pause; }} edges = (1+2) + (1+2) = 6

Parallel 2

163




Global Local

analyse(0,511)

analyse(512,1023)

edges = 0

edges = 0edges = 1

edges = 0edges = 2

edges = (1+2) + (1+2) = 6

Parallel 2

164




Global Local

analyse(0,511)

analyse(512,1023)

edges = 0

edges = 3

edges = 0edges = 1

edges = 0edges = 2

edges = (1+2) + (1+2) = 6

Parallel 2

165




Global Local

analyse(0,511)

analyse(512,1023)

edges = 0

edges = 3

edges = 0edges = 1

edges = 0edges = 2

edges = 3edges = 4

edges = 3edges = 5

edges = (1+2) + (1+2) = 6

Parallel 2

166




Global Local

analyse(0,511)

analyse(512,1023)

edges = 0

edges = 3

edges = 0edges = 1

edges = 0edges = 2

edges = 9

edges = 3edges = 4

edges = 3edges = 5

edges = (1+2) + (1+2) = 6

Parallel 2

167




Global Local

analyse(0,511)

analyse(512,1023)

edges = 0

edges = 3

edges = 0edges = 1

edges = 0edges = 2

edges = 9

edges = 3edges = 4

edges = 3edges = 5

edges = (1+2) + (1+2) = 6

We should track the running total separately from the number of new edges.

Parallel 2

168

Divide and Conquerinput int[1024] image;typedef struct { int total; int new } Edges;shared Edges edges = { .total = 0, .new = 0 } combine with accum;

Edges accum(Edges copy1, Edges copy2) { copy1.total = copy1.total + copy1.new + copy2.new; copy1.new = 0; return copy1;}


void analyse(int start, int end) { while (1) { edges.new = 0; for (i = start; i < end; ++i) { ... image[i] ... ; edges.new++; } pause; }}

edges = (1+2) + (1+2) = 6

Parallel 3

169





edges = (1+2) + (1+2) = 6

Global Local

analyse(0,511)

analyse(512,1023)

edges = { .total=0, .new=0}

edges = { .total=0, .new=0}edges = { .total=0, .new=1}


Parallel 3

170





edges = (1+2) + (1+2) = 6

Global Local

analyse(0,511)

analyse(512,1023)





Parallel 3

171




void analyse(int start, int end) { while (1) { edges.new = 0; for (i = start; i < end; ++i) { ... image[i] ... ; edges.new++; } pause; }} edges = (1+2) + (1+2) = 6

Global Local

analyse(0,511)

analyse(512,1023)







Parallel 3

172




void analyse(int start, int end) { while (1) { edges.new = 0; for (i = start; i < end; ++i) { ... image[i] ... ; edges.new++; } pause; }} edges = (1+2) + (1+2) = 6

Global Local

analyse(0,511)

analyse(512,1023)








Parallel 3

Introduction

• Existing parallel programming solutions.– Shared memory model.

• OpenMP, Pthreads• Intel Cilk Plus, Thread Building Blocks• Unified Parallel C, ParC, X10

– Message passing model.• MPI, SHIM

– Provides ways to manage shared resources but not prevent concurrency errors.

[OpenMP] http://openmp.org [Pthreads] https://computing.llnl.gov/tutorials/pthreads/ [X10] http://x10-lang.org/[Intel Cilk Plus] http://software.intel.com/en-us/intel-cilk-plus [Intel Thread Building Blocks] http://threadingbuildingblocks.org/[Unified Parallel C] http://upc.lbl.gov/ [Ben-Asher et al] ParC – An Extension of C for Shared Memory Parallel Processing.[MPI] http://www.mcs.anl.gov/research/projects/mpi/ [SHIM] SHIM: A Language for Hardware/Software Integration.

Introduction

• Deterministic runtime support.– Pthreads

• dOS, Grace, Kendo, CoreDet, Dthreads.– OpenMP

• Deterministic OMP– Concept of logical time.– Each logical time step broken into an execution

and communication phase.

[Bergan et al 2010] Deterministic Process Groups in dOS.[Olszewski et al 2009] Kendo: Efficient Deterministic Multithreading in Software. [Bergan et al 2010] CoreDet: A Compiler and Runtime System for Deterministic Multithreaded Execution.[Liu et al 2011] Dthreads: Efficient Deterministic Multithreading.[Aviram 2012] Deterministic OpenMP.

ForeC Language

• Behaviour of shared variables is similar to:– Intel Cilk+ (Reducers)– Unified Parallel C (Collectives)– DOMP (Workspace consistency)– Grace (Copy-on-write)– Dthreads (Copy-on-write)

Documents

Programming Safety-Critical Embedded Systems