View
225
Download
1
Embed Size (px)
Citation preview
DRAFTS1
DRAFTSDistributed Real-time Applications
Fault Tolerant Scheduling
Claudio Pinello ([email protected])
DRAFTS3
Motivation
• No rods increased passive safety
• Interior design freedom
BMW, Daimler, Cytroen, Chrysler, Bertone, SKF, etc…
DRAFTS4
Problem Overview
• Fault tolerance: redundancy is key
• Safety: system failure must be as unlikely as in traditional systems
DRAFTS5
Faults
• SW faults: bugs– can be reduced by disciplined coding– even better by code generation
• HW faults– harsh environment– many units (>50 uProcessors in a car;
subsystems with 10-15 uP’s)
DRAFTS6
Fault Model
• Silent Faults– faults result in omission errors
• Detectable Faults– faults result in detectably corrupted data (e.g.
CRC-protected channels)• Non-silent Faults
– faults result in value errors • Byzantine Faults
– malicious attacks, non-silent faults, unbounded delays, etc…
DRAFTS7
Software Redundancy
• Space redundancy– execute replicas on different HW– send results on different/multiple channels
DRAFTS8
N-copies Solution
• Pros:– reduced cost
• Cons:– degradation, 1x speed– multiple designs
Abstractinput
FineCTRL
ArbiterBest AbstractOut
Iterator
CoarseCTRL
Plant
Abstractinput
FineCTRL
ArbiterBest AbstractOut
Iterator
CoarseCTRL
Plant
Abstractinput
FineCTRL
ArbiterBest AbstractOut
Iterator
CoarseCTRL
Plant
Abstractinput
FineCTRL
ArbiterBest AbstractOut
Iterator
CoarseCTRL
Plant
Abstractinput AbstractOut
Iterator
Plant
Abstractinput AbstractOut
Iterator
Plant
• Pros:– design once
• Cons:– N-x costs, 1x speed
DRAFTS9
Redundancy Management
• Managing a distributed system with multiple results requires careful programming– keep N-copies synchronized– exchange and apply results– detect and isolate faults – recover
DRAFTS10
Possible solutions
Off-The-Shelf solutions• TTP-based
architectures • FT-CORBA middle-
ware
Synthesis• Debugged and
portable libraries
Development tools
DRAFTS11
Automotive Domain
• Production costs dominate NRE costs– multi-vendor supply-chain– interest in full utilization of architectures
• Validation and certification are critical– validate process– validate product
DRAFTS12
Shortcomings of OTS solutions
• TTP– proprietary communication network– network redundancy default is 2-way– active replication potential underutilization
of resources
• FT CORBA– fairly large overhead middleware
DRAFTS13
Synthesis-based Solution
• Synthesize only needed glue-code– at the extreme: get rid of OS
• Customizable replication mechanisms – use passive replicas
• Treat architecture as a distributed execution machine– exploit parallelism to speed up execution
DRAFTS14
Schedule Synthesis
Abstractinput
FineCTRL
ArbiterBest AbstractOut
Iterator
CoarseCTRL
Plant
CPU
CPU
CPU
CPU
CPU
CPU
Mapping
FineCTRL
Iterator
CoarseCTRLSens
Sens
SensAct
Act
Plant
Input
Input
CoarseCTRL
ArbiterBest
ArbiterBest
Output
Output
IteratorIterator
CPU
CPU
CPU
CPU
CPU
CPUAct
Input
ArbiterBest
Sens
Sens
Sens Input
CoarseCTRL CoarseCTRL
FineCTRL
Act
OutputOutput
ArbiterBest
DRAFTS16
Contributions
• Programming Model
• Metropolis platform
• Schedule synthesis tool and optimization strategy
• Verification Tools
DRAFTS17
Programming Model
• Definition of a programming model that– Is amenable to specifying feedback controllers– Is convenient for analysis, simulation and
synthesis – Supports degraded functionality/accuracy– Supports redundancy– Deterministic
DRAFTS18
Static Data-flow Model
• Pros:– Deterministic
behavior • Actors perform
deterministic computation (no internal states)
• Requires all inputs to fire an actor
– Explicit parallelism– Good for periodic
algorithms
• Shortcomings:– Requires all inputs to fire an
actor, but source actors may fail!
A B
C
DRAFTS19
Pendulum Example
Abstractinput
FineCTRL
ArbiterBest AbstractOut
Iterator
CoarseCTRL
Plant
Bang-Bang
Linear
DRAFTS20
Model Extensions
• Node Criticality • Node Typing (sensor, input, arbiter, etc.)• Some types (input and arbiter) can fire
with missing inputs• Tokens have “Epoch” and “Valid” fields• Specialized single-place buffer links
– manage redundant sources (and destinations)
DRAFTS21
Data Tokens: Epoch
• iteration index of the periodic algorithm
• Actors ask for “current” inputs
• Using >= we can account for missing results (self-synchronization)
EpochData Valid
DRAFTS22
Data Tokens: Valid
• Valid models the effect of fault detection:– True: data was received/produced correctly– False: data was not received on time or was
corrupted
• Firing rules (and actors) may use it to change their behavior
EpochData Valid
DRAFTS23
FTDataFlow modeling
• Metropolis used as framework to develop the set of tools
• FTDF is a platform library in Metropolis– modeling, simulation, fault injection– supports semi-automatic replication– results visualization
DRAFTS24
Actor Classes
• DF_SENactor sensor actor• DF_INactor input actor• DF_AINactor abstract input actor• DF_FUNactor data-flow actor• DF_ARBactor arbiter actor• DF_AOUTactor abstract output actor• DF_OUTactor output actor • DF_ACTactor actuator actor
• DF_MEM state memory• DF_Injector fault injection
DRAFTS25
Pendulum Example
Abstractinput
FineCTRL
ArbiterBest AbstractOut
Iterator
CoarseCTRL
Plant
Inject
DRAFTS27
Summary on FTDF
• Extended SDF to deal with– missing/redundant inputs– different criticality– functionality types
• Developed Metropolis platform– modeling, simulation, fault-injection,
visualization of results– support for adding redundancy
DRAFTS28
Architecture Model
• Architecture – Connectivity:
bipartite graph – Computation and
communication times:actor/cpu data/channel matrices of execution and transmission times
• Same as SynDEx model
CPU
CPU
CPU
CPU
CPU
CPU
DRAFTS29
Fault Behavior
• Failure patterns– Subsets of Arch-Graph that may fail
simultaneously
• For each failure pattern specify criticality level – i.e. which functionalities must be
guaranteed– typically for empty failure pattern all
functionality must be guaranteed
DRAFTS30
Synthesis Problem
• Given – Application– Architecture– Fault Behavior
• Derive– Redundancy– Schedule
Abstractinput
FineCTRL
ArbiterBest AbstractOut
Iterator
CoarseCTRL
PlantCPU
CPU
CPU
CPU
CPU
CPU
Mapping
FineCTRL
Iterator
CoarseCTRLSens
Sens
SensAct
Act
Plant
Input
Input
CoarseCTRL
ArbiterBest
ArbiterBest
Output
Output
IteratorIterator
CPU
CPU
CPU
CPU
CPU
CPUAct
Input
ArbiterBest
Sens
Sens
SensInput
CoarseCTRL CoarseCTRL
FineCTRL
ActOutput
Output
ArbiterBest
DRAFTS31
Pendulum Example
• Actuator/Sensor location
• Tolerate any single fault– {empty} all functionality– {one CPU} may drop FineController, and
sensor/actuator on that CPU– {one Channel} may drop FineController
CPU CPUCPU
SensAct
Sens SensAct
DRAFTS32
Refined I/O
FineCTRL
Iterator
CoarseCTRLSens
Sens
Sens
Act
Act
Plant
Input ArbiterBest Output
DRAFTS33
Full Replication
FineCTRL
Iterator
CoarseCTRLSens
Sens
Sens
Act
Act
Plant
Input
Input
CoarseCTRL
ArbiterBest
ArbiterBest
Output
Output
Iterator
Iterator
DRAFTS35
Schedule Synthesis Strategy
• Leverage existing dataflow scheduling tools (e.g. SynDEx) to achieve a distributed static schedule that is also fault-tolerant
• At design time (off-line)– devise redundant schedule
• At run-time– trivial reconfiguration: skip actors that
cannot fire
DRAFTS36
Generating Schedules
1. Full architecture1. Schedule all functionalities
2. For each failure pattern1. Mark the faulty architecture components
(critical functionalities cannot run there)
2. Schedule all functionalities
3. Merge the schedules
Maximum performance
Add redundancy
DRAFTS37
Generating Schedules
1. Full architecture1. Schedule all functionalities
2. For each failure pattern1. Mark the faulty architecture components
2. Schedule the critical functionalities
3. Merge the schedules
DRAFTS38
Merge into FTS
• Care must be taken to deal with multiple routings, clear non optimality
[ECU0]Sensor1
[ECU1]Actuator2
[ECU1]Sensor2
[ECU0]Function2(optional)
[ECU0]Actuator1
[ECU0]Arbiter
[ECU0]Output driver(requires 1)
[ECU1]Function1(required)
[ECU1]Input receiver(requires 1)
[ECU1]Arbiter
[ECU1]Output driver(requires 1)
[ECU0]Function1(required)
[ECU0]Input receiver(requires 1)
[ECU0]Function2(optional)
DRAFTS39
Heuristic 1: Limit CPU Load
1. Full architecture1. Schedule all functionalities
2. For each failure pattern1. Mark the faulty architecture components (critical
functionalities cannot run there)
2. Re-schedule only critical functionalities (constrain non critical as in full architecture)
3. Merge the schedulesRedundancy
for critical only
DRAFTS40
Heuristic 2: Limit Bus Load
• Prune redundant communication
[ECU0]Sensor1
[ECU1]Actuator2
[ECU1]Sensor2
[ECU0]Function2(optional)
[ECU0]Actuator1
[ECU0]Arbiter
[ECU0]Output driver(requires 1)
[ECU1]Function1(required)
[ECU1]Input receiver(requires 1)
[ECU0]Function1(required)
[ECU0]Input receiver(requires 1)
[ECU1]Arbiter
[ECU1]Output driver(requires 1)
Heuristic 3: passive replicas (limit CPU load)
DRAFTS41
Total Orders
• For each processor and for each channel find a total order that is compatible with the partial order of FTS
• Prototype: “any compatible total order”
DRAFTS42
Schedule optimization
• Exploit architectural redundancy as a performance boost (in absence of faults)– replicas overloading and deallocation– passive replicas– graceful degradation: reduced
functionality (and resource demands) under faults
DRAFTS43
Active Replicas
DA
B
C
Behavior:
Architecture:
Active Replication:
P2P1
A
B
C
D
A
B
C
DCPUCPU
DRAFTS44
Deallocation & Degradation
DA
B
C
Behavior:
Architecture:
Deallocation:
P2P1
A
B C
D
A
BC
D
DD
B->D C->D
C1 C2
K
P
K P
CPUCPU
DRAFTS45
Aggressive Heuristics
• Some heuristics can be certified to not break fault-tolerance/fault behavior
• Others may need verification of the results– E.g. human inspection and modification
DRAFTS46
(Off-line) Verification
Functional Verification– For each failure pattern the corresponding
functionality is correctly executed
• Timing Verification/Analysis– Worst case iteration time under each fault
DRAFTS47
Functional Verification
• Apply equivalence checking methods to FT Schedule, under all fault scenarios (failure patterns)
• Based on application DAGs & Architecture graph
DRAFTS49
Functional Verification (example - continued)
Input receiver(requires 1)
Sensor1
Arbiter
Function1(required)
Sensor2
Output driver(requires 1)
Function2(optional)
Actuator1
Task Graph – Actuator1
Input receiver(requires 1)
Sensor1
Arbiter
Function1(required)
Sensor2
Output driver(requires 1)
Function2(optional)
Actuator2
Task Graph – Actuator2
[ECU0]Sensor1
[ECU1]Actuator2
[ECU0]Function1(required)
[ECU1]Sensor2
[ECU0]Function2(optional)
[ECU0]Actuator1
[ECU0]Input receiver(requires 1)
[ECU0]Arbiter
[ECU0]Output driver(requires 1)
[ECU1]Function1(required)
[ECU1]Arbiter
[ECU1]Input receiver(requires 1)
[ECU1]Output driver(requires 1)
?
?
• For the full functionality case, the arbiter must include both functions.• The output function only requires one of the actuators be visible.• In the other graphs (which include failures) , the arbiter only needs the single required input (Function1)
Source: Sam Williams
DRAFTS50
F.Verification comments
• Takes milliseconds to run small cases. Few minutes for large schedules
• Tool was written in PERL (performance was sufficient)
• Schedule Verification is performed offline (not time critical)
• Credits: Sam Williams
DRAFTS51
Conclusions
• Contributions– Programming Model FTDF– Metropolis platform– Schedule synthesis tool (in collaboration with
INRIA)– Schedule optimization strategy– Functional verification (in collaboration with Sam
Williams)– Replica determinism analysis (not shown here)
DRAFTS52
Future Work
• Experiments on DBW example • Timing verification (realtime calculus)• Interface/migrate the synthesis and
verification tools with/to Metropolis• Integrate optimization in synthesis• Code generation (in collaboration with
Mark McKelvin)