53
DRAFTS 1 DRAFTS Distributed Real-time Applications Fault Tolerant Scheduling Claudio Pinello ( [email protected] )

1 DRAFTS DRAFTS Distributed Real-time Applications Fault Tolerant Scheduling Claudio Pinello ([email protected])[email protected]

  • View
    225

  • Download
    1

Embed Size (px)

Citation preview

DRAFTS1

DRAFTSDistributed Real-time Applications

Fault Tolerant Scheduling

Claudio Pinello ([email protected])

DRAFTS2

Motivation

• Drive-by-Wire applications

DRAFTS3

Motivation

• No rods increased passive safety

• Interior design freedom

BMW, Daimler, Cytroen, Chrysler, Bertone, SKF, etc…

DRAFTS4

Problem Overview

• Fault tolerance: redundancy is key

• Safety: system failure must be as unlikely as in traditional systems

DRAFTS5

Faults

• SW faults: bugs– can be reduced by disciplined coding– even better by code generation

• HW faults– harsh environment– many units (>50 uProcessors in a car;

subsystems with 10-15 uP’s)

DRAFTS6

Fault Model

• Silent Faults– faults result in omission errors

• Detectable Faults– faults result in detectably corrupted data (e.g.

CRC-protected channels)• Non-silent Faults

– faults result in value errors • Byzantine Faults

– malicious attacks, non-silent faults, unbounded delays, etc…

DRAFTS7

Software Redundancy

• Space redundancy– execute replicas on different HW– send results on different/multiple channels

DRAFTS8

N-copies Solution

• Pros:– reduced cost

• Cons:– degradation, 1x speed– multiple designs

Abstractinput

FineCTRL

ArbiterBest AbstractOut

Iterator

CoarseCTRL

Plant

Abstractinput

FineCTRL

ArbiterBest AbstractOut

Iterator

CoarseCTRL

Plant

Abstractinput

FineCTRL

ArbiterBest AbstractOut

Iterator

CoarseCTRL

Plant

Abstractinput

FineCTRL

ArbiterBest AbstractOut

Iterator

CoarseCTRL

Plant

Abstractinput AbstractOut

Iterator

Plant

Abstractinput AbstractOut

Iterator

Plant

• Pros:– design once

• Cons:– N-x costs, 1x speed

DRAFTS9

Redundancy Management

• Managing a distributed system with multiple results requires careful programming– keep N-copies synchronized– exchange and apply results– detect and isolate faults – recover

DRAFTS10

Possible solutions

Off-The-Shelf solutions• TTP-based

architectures • FT-CORBA middle-

ware

Synthesis• Debugged and

portable libraries

Development tools

DRAFTS11

Automotive Domain

• Production costs dominate NRE costs– multi-vendor supply-chain– interest in full utilization of architectures

• Validation and certification are critical– validate process– validate product

DRAFTS12

Shortcomings of OTS solutions

• TTP– proprietary communication network– network redundancy default is 2-way– active replication potential underutilization

of resources

• FT CORBA– fairly large overhead middleware

DRAFTS13

Synthesis-based Solution

• Synthesize only needed glue-code– at the extreme: get rid of OS

• Customizable replication mechanisms – use passive replicas

• Treat architecture as a distributed execution machine– exploit parallelism to speed up execution

DRAFTS14

Schedule Synthesis

Abstractinput

FineCTRL

ArbiterBest AbstractOut

Iterator

CoarseCTRL

Plant

CPU

CPU

CPU

CPU

CPU

CPU

Mapping

FineCTRL

Iterator

CoarseCTRLSens

Sens

SensAct

Act

Plant

Input

Input

CoarseCTRL

ArbiterBest

ArbiterBest

Output

Output

IteratorIterator

CPU

CPU

CPU

CPU

CPU

CPUAct

Input

ArbiterBest

Sens

Sens

Sens Input

CoarseCTRL CoarseCTRL

FineCTRL

Act

OutputOutput

ArbiterBest

DRAFTS15

Synthesis-based Solution

• Enables fast architecture exploration

DRAFTS16

Contributions

• Programming Model

• Metropolis platform

• Schedule synthesis tool and optimization strategy

• Verification Tools

DRAFTS17

Programming Model

• Definition of a programming model that– Is amenable to specifying feedback controllers– Is convenient for analysis, simulation and

synthesis – Supports degraded functionality/accuracy– Supports redundancy– Deterministic

DRAFTS18

Static Data-flow Model

• Pros:– Deterministic

behavior • Actors perform

deterministic computation (no internal states)

• Requires all inputs to fire an actor

– Explicit parallelism– Good for periodic

algorithms

• Shortcomings:– Requires all inputs to fire an

actor, but source actors may fail!

A B

C

DRAFTS19

Pendulum Example

Abstractinput

FineCTRL

ArbiterBest AbstractOut

Iterator

CoarseCTRL

Plant

Bang-Bang

Linear

DRAFTS20

Model Extensions

• Node Criticality • Node Typing (sensor, input, arbiter, etc.)• Some types (input and arbiter) can fire

with missing inputs• Tokens have “Epoch” and “Valid” fields• Specialized single-place buffer links

– manage redundant sources (and destinations)

DRAFTS21

Data Tokens: Epoch

• iteration index of the periodic algorithm

• Actors ask for “current” inputs

• Using >= we can account for missing results (self-synchronization)

EpochData Valid

DRAFTS22

Data Tokens: Valid

• Valid models the effect of fault detection:– True: data was received/produced correctly– False: data was not received on time or was

corrupted

• Firing rules (and actors) may use it to change their behavior

EpochData Valid

DRAFTS23

FTDataFlow modeling

• Metropolis used as framework to develop the set of tools

• FTDF is a platform library in Metropolis– modeling, simulation, fault injection– supports semi-automatic replication– results visualization

DRAFTS24

Actor Classes

• DF_SENactor sensor actor• DF_INactor input actor• DF_AINactor abstract input actor• DF_FUNactor data-flow actor• DF_ARBactor arbiter actor• DF_AOUTactor abstract output actor• DF_OUTactor output actor • DF_ACTactor actuator actor

• DF_MEM state memory• DF_Injector fault injection

DRAFTS25

Pendulum Example

Abstractinput

FineCTRL

ArbiterBest AbstractOut

Iterator

CoarseCTRL

Plant

Inject

DRAFTS26

Simulation output

Fault

DRAFTS27

Summary on FTDF

• Extended SDF to deal with– missing/redundant inputs– different criticality– functionality types

• Developed Metropolis platform– modeling, simulation, fault-injection,

visualization of results– support for adding redundancy

DRAFTS28

Architecture Model

• Architecture – Connectivity:

bipartite graph – Computation and

communication times:actor/cpu data/channel matrices of execution and transmission times

• Same as SynDEx model

CPU

CPU

CPU

CPU

CPU

CPU

DRAFTS29

Fault Behavior

• Failure patterns– Subsets of Arch-Graph that may fail

simultaneously

• For each failure pattern specify criticality level – i.e. which functionalities must be

guaranteed– typically for empty failure pattern all

functionality must be guaranteed

DRAFTS30

Synthesis Problem

• Given – Application– Architecture– Fault Behavior

• Derive– Redundancy– Schedule

Abstractinput

FineCTRL

ArbiterBest AbstractOut

Iterator

CoarseCTRL

PlantCPU

CPU

CPU

CPU

CPU

CPU

Mapping

FineCTRL

Iterator

CoarseCTRLSens

Sens

SensAct

Act

Plant

Input

Input

CoarseCTRL

ArbiterBest

ArbiterBest

Output

Output

IteratorIterator

CPU

CPU

CPU

CPU

CPU

CPUAct

Input

ArbiterBest

Sens

Sens

SensInput

CoarseCTRL CoarseCTRL

FineCTRL

ActOutput

Output

ArbiterBest

DRAFTS31

Pendulum Example

• Actuator/Sensor location

• Tolerate any single fault– {empty} all functionality– {one CPU} may drop FineController, and

sensor/actuator on that CPU– {one Channel} may drop FineController

CPU CPUCPU

SensAct

Sens SensAct

DRAFTS32

Refined I/O

FineCTRL

Iterator

CoarseCTRLSens

Sens

Sens

Act

Act

Plant

Input ArbiterBest Output

DRAFTS33

Full Replication

FineCTRL

Iterator

CoarseCTRLSens

Sens

Sens

Act

Act

Plant

Input

Input

CoarseCTRL

ArbiterBest

ArbiterBest

Output

Output

Iterator

Iterator

DRAFTS34

Simulation output

DRAFTS35

Schedule Synthesis Strategy

• Leverage existing dataflow scheduling tools (e.g. SynDEx) to achieve a distributed static schedule that is also fault-tolerant

• At design time (off-line)– devise redundant schedule

• At run-time– trivial reconfiguration: skip actors that

cannot fire

DRAFTS36

Generating Schedules

1. Full architecture1. Schedule all functionalities

2. For each failure pattern1. Mark the faulty architecture components

(critical functionalities cannot run there)

2. Schedule all functionalities

3. Merge the schedules

Maximum performance

Add redundancy

DRAFTS37

Generating Schedules

1. Full architecture1. Schedule all functionalities

2. For each failure pattern1. Mark the faulty architecture components

2. Schedule the critical functionalities

3. Merge the schedules

DRAFTS38

Merge into FTS

• Care must be taken to deal with multiple routings, clear non optimality

[ECU0]Sensor1

[ECU1]Actuator2

[ECU1]Sensor2

[ECU0]Function2(optional)

[ECU0]Actuator1

[ECU0]Arbiter

[ECU0]Output driver(requires 1)

[ECU1]Function1(required)

[ECU1]Input receiver(requires 1)

[ECU1]Arbiter

[ECU1]Output driver(requires 1)

[ECU0]Function1(required)

[ECU0]Input receiver(requires 1)

[ECU0]Function2(optional)

DRAFTS39

Heuristic 1: Limit CPU Load

1. Full architecture1. Schedule all functionalities

2. For each failure pattern1. Mark the faulty architecture components (critical

functionalities cannot run there)

2. Re-schedule only critical functionalities (constrain non critical as in full architecture)

3. Merge the schedulesRedundancy

for critical only

DRAFTS40

Heuristic 2: Limit Bus Load

• Prune redundant communication

[ECU0]Sensor1

[ECU1]Actuator2

[ECU1]Sensor2

[ECU0]Function2(optional)

[ECU0]Actuator1

[ECU0]Arbiter

[ECU0]Output driver(requires 1)

[ECU1]Function1(required)

[ECU1]Input receiver(requires 1)

[ECU0]Function1(required)

[ECU0]Input receiver(requires 1)

[ECU1]Arbiter

[ECU1]Output driver(requires 1)

Heuristic 3: passive replicas (limit CPU load)

DRAFTS41

Total Orders

• For each processor and for each channel find a total order that is compatible with the partial order of FTS

• Prototype: “any compatible total order”

DRAFTS42

Schedule optimization

• Exploit architectural redundancy as a performance boost (in absence of faults)– replicas overloading and deallocation– passive replicas– graceful degradation: reduced

functionality (and resource demands) under faults

DRAFTS43

Active Replicas

DA

B

C

Behavior:

Architecture:

Active Replication:

P2P1

A

B

C

D

A

B

C

DCPUCPU

DRAFTS44

Deallocation & Degradation

DA

B

C

Behavior:

Architecture:

Deallocation:

P2P1

A

B C

D

A

BC

D

DD

B->D C->D

C1 C2

K

P

K P

CPUCPU

DRAFTS45

Aggressive Heuristics

• Some heuristics can be certified to not break fault-tolerance/fault behavior

• Others may need verification of the results– E.g. human inspection and modification

DRAFTS46

(Off-line) Verification

Functional Verification– For each failure pattern the corresponding

functionality is correctly executed

• Timing Verification/Analysis– Worst case iteration time under each fault

DRAFTS47

Functional Verification

• Apply equivalence checking methods to FT Schedule, under all fault scenarios (failure patterns)

• Based on application DAGs & Architecture graph

DRAFTS49

Functional Verification (example - continued)

Input receiver(requires 1)

Sensor1

Arbiter

Function1(required)

Sensor2

Output driver(requires 1)

Function2(optional)

Actuator1

Task Graph – Actuator1

Input receiver(requires 1)

Sensor1

Arbiter

Function1(required)

Sensor2

Output driver(requires 1)

Function2(optional)

Actuator2

Task Graph – Actuator2

[ECU0]Sensor1

[ECU1]Actuator2

[ECU0]Function1(required)

[ECU1]Sensor2

[ECU0]Function2(optional)

[ECU0]Actuator1

[ECU0]Input receiver(requires 1)

[ECU0]Arbiter

[ECU0]Output driver(requires 1)

[ECU1]Function1(required)

[ECU1]Arbiter

[ECU1]Input receiver(requires 1)

[ECU1]Output driver(requires 1)

?

?

• For the full functionality case, the arbiter must include both functions.• The output function only requires one of the actuators be visible.• In the other graphs (which include failures) , the arbiter only needs the single required input (Function1)

Source: Sam Williams

DRAFTS50

F.Verification comments

• Takes milliseconds to run small cases. Few minutes for large schedules

• Tool was written in PERL (performance was sufficient)

• Schedule Verification is performed offline (not time critical)

• Credits: Sam Williams

DRAFTS51

Conclusions

• Contributions– Programming Model FTDF– Metropolis platform– Schedule synthesis tool (in collaboration with

INRIA)– Schedule optimization strategy– Functional verification (in collaboration with Sam

Williams)– Replica determinism analysis (not shown here)

DRAFTS52

Future Work

• Experiments on DBW example • Timing verification (realtime calculus)• Interface/migrate the synthesis and

verification tools with/to Metropolis• Integrate optimization in synthesis• Code generation (in collaboration with

Mark McKelvin)

DRAFTS53

DBW example

DRAFTS54

Now…

• Interested in helping out?