Upload
neorah
View
21
Download
1
Tags:
Embed Size (px)
DESCRIPTION
ORTEGA: An Efficient and Flexible Online Fault Tolerance Architecture for Real-Time Control Systems. Xue Liu, Qixin Wang, Sathish Gopalakrishnan, Wenbo He, Lui Sha, Hui Ding, Kihwal Lee. Outline. Motivation and related work ORTEGA goals ORTEGA architecture Details of ORTEGA designs - PowerPoint PPT Presentation
Citation preview
1
ORTEGA: An Efficient and Flexible Online Fault Tolerance Architecture for Real-Time Control Systems
Xue Liu, Qixin Wang, Sathish Gopalakrishnan, Wenbo He, Lui Sha,
Hui Ding, Kihwal Lee
2
Outline
Motivation and related work ORTEGA goals ORTEGA architecture Details of ORTEGA designs Implementation and evaluation Demo
3
Motivations
Cyber-Physical Systems Real-world systems involves not only computer
science, but knowledge related to various disciplines.
Not only the computer system becomes more complex, the complexity of integrated system (i.e. the cyber-physical system) grows even faster.
Major challenge: how to let engineers of drastically different backgrounds collaborate with each other?
4
Motivations
Control Systems Conventional analog control systems Digital control systems
Computer Systems Real-time scheduling Fault tolerance Reliable/online software upgrade
We need to design a framework so that computer engineers and control engineers can easily collaborate and integrate their knowledge
Kxu
BuAxx
)()(
)()()(
khKxkhu
khBudsekhxehkhxh
0
AsAh
5
Motivations
Control Systems Conventional analog control systems Digital control systems
Computer Systems Real-time scheduling Fault tolerance Reliable/online software upgrade
We need to design a framework so that computer engineers and control engineers can easily collaborate and integrate their knowledge
Kxu
BuAxx
)()(
)()()1(
kKxku
kGukFxkx
6
Related work: Simplex architecture
Demand: Low cost development of upgraded control
systems for mission critical control applications instead of multi-versioning, just develop one version Focus on the control theories
Runtime upgrade/testing of the single version buggy new system.
Applications: Aircraft control (F-16, Seto et. al, 2000) Submarine control (NSSN, new attack
submarine program at US navy)
7
Simplex for real-time control
Simple high assurancecontrol subsystem (HAC)
Complex high performancecontrol subsystem (HPC)
Plant
Decision
Simplex Architecture
8
Simplex for real-time control
AxBKxxA
BuxAx
The above LTI control system is stable iff there exists a P>0, such that the Lyapunov function
0)( xPAPAx TT
Given LTI control system:
9
Simplex for real-time control
Maximum Stability Region (Recovery
Region)
Stability Region
Lyapunov Functions
State Constraints
We can choose smaller solution ellipsoid (i.e. xTPx < xTPmaxx) to leave margins to guard against model/actuator/measurement errors.
10
Drawbacks of Simplex
P1: Lack of Efficiency Analytically redundant high assurance controller
(HAC) runs in parallel with complex controller (HPC)
Lowers system performance, increase operating costs Limits the application of Simplex in only safety-critical
domains P2: Lack of Flexibility
Enforces the same execution period on HAC and HPC
In practice, different controllers may use different periods for different performance considerations
For example: fast HAC recovery
11
Design goals of ORTEGA
On-demand Real-TimE GuArd (ORTEGA) A new efficient fault tolerance software
architecture designed for real-time control systems
More efficient resource usage (P1) Through on-demand real-time recovery
Flexible design (P2) Allows HAC and HPC to run at different rates Through new design and schedulability analysis
Applicable to a wider range of real-time control systems
12
ORTEGA Architecture
13
On-demand execution of HAC
At any time, only one of the HAC or HPC is running to control the plant
Decision module (DM) uses a mutex semaphore to control which of the HAC and HPC is running When the HPC is running well, the HAC blocks on the
semaphore; Only when a fault is detected in the HPC, the DM
releases the semaphore to allow HAC to take over Decision logic is based on stability regions
Determined through Linear Matrix Inequality theory Details later
14
CPU savings of ORTEGA
HPC’s timing parameters: {Cp, Tp}; HAC’s timing parameters: {Ca, Ta};
Pr: the percentage of time for recovery (HAC) during a total time of T
• Total CPU resource usage under Simplex
• Total CPU resource usage under ORTEGA
• CPU resource usage savings:
15
No Free Lunch: An extra period of delay
up to Ta incurred due to the on-demand execution of HAC
k
Fault Detected
p p a a
p a
pa
p
k
Simplex
pa papa
p
p
a
pp
1st time of HAC ‘s control output takes action
1st time of HAC ‘s control output takes action
t
t
t2
t1
ORTEGA
Simplex
ORTEGA
16
Handle the extra delay by state projections
Stability Regiont: decision time
a
projected state x(t+Ta)
tt+Ta
current state x(t)
(1)Extra delay causes disturbances when fault occurs (infrequent)
(2)But the gain in resource usage is large.
Resource usage reduction v.s. extra delay :
17
Recovery region design
Maximum Stability Region (Recovery
Region)
Stability Region
Lyapunov Functions
State Constraints
• The decision module uses recovery region to determine when to switch to HAC
• Recovery region is defined as the maximum region in which the HAC can make the plant stable
18
Determine recovery region (1)
( ) ( )u k Kx k
( 1) ( ) (*)x k Fx k ( )F F GK
1 1 , . (1)Tmx m q State constraints:
Digital controllers:
Stability region: The discrete LTI control system is stable iff there exists a P>0, such that 0 PFPF T
19
Determine recovery region (1)
( ) ( )u k Kx k
( 1) ( ) (*)x k Fx k ( )F F GK
1 1 , . (1)Tmx m q State constraints:
Digital controllers:
Stability region: Stability region of the system with respect to P is defined as { | 1}Tx x Px
20
Determine recovery region (2)
Area of recovery region1logdetMaximize P
0s t P
1 1 1Tm mP m q
Theorem: Determine the maximum stability region of digital implemented closed loop system with constraints (1) can be
transformed to the following MAXDET (LMI) problem.
0TPF PF Stability
State constraints
Maximum Stability Region (Recovery
Region)
Stability Region
Lyapunov Functions
State Constraints
Maximum Stability Region (Recovery
Region)
Stability Region
Lyapunov Functions
State Constraints
21
Recovery region v.s. control loop period
Stability Index A(T): Area of the maximum stability region
• It is a function of the control loop period T. The smaller the controller loop period, the larger the maximum stability region.
Example: an inverted pendulum
0 0 1 0 0
0 0 0 1 0
0 2 7528 10 9526 0 0043 1 9432
0 28 5812 24 9179 0 0441 4 4385
x x u
( ) [5 7807, 42 2087,14 0953, 8 6016] ( )u k x k
Controller
0 0.005 0.01 0.015 0.02 0.025 0.03 0.0350.5
1
1.5
2
2.5
3
3.5
4x 10
-7
Control Loop PeriodS
tab
ility
In
de
x
The smaller the period, the larger the recovery region.
System model
ORTEGA allows larger recovery region (more flexible)
22
Implementation and evaluation
• Inverted pendulum from Quanser
• CPU: Pentium II 350MHz
• OS: Linux kernel 2.4.18-3 with RMS
• HAC: field tested state feedback controller
Evaluation of CPU savings
• If HAC and HPC both run at 50Hz, ORTEGA’s CPU saving is 29.29%
• If HAC runs at 50Hz, HPC runs at 20Hz, ORTEGA’s CPU saving is 50.87%
23
Evaluation of fault tolerance
Infinite loop bug Non-performing bug Maximum control output bug Divided by zero bug Bang-Bang type bug Positive feedback bug Tricky design bug …
24
Evaluation of fault tolerance
25
Evaluation of fault tolerance
26
Thank You Thank You
Q&A Q&A
27
Backup Slides
Simplex: software engineering economics: the more effort, the more reliable
)/exp()(
)/exp()exp()(
EkCtER
EkCtttR
Reliability:0 failure
happened during [0,
t] Failure Rate
Complexity
Effort
Simplex: comparison with N-version
23 )3/(3)3/()( ERERER
Simplex: roots in recover-block: only one version must be correct
3))3/(1(1)( ERER
Simplex: recovery-blcok: dividing into more alternatives doesn’t always gain.
Simplex: recovery-blcok: reducing complexity gains
One alternative has complexity 1, ½, and 1/10
Simplex: a two-alternatives recovery block with reduced complexity wins if a reliable acceptance test is possible.
Simplex: recovery-blcok: reducing complexity gains
RB2: 2 alter, same complexity C=1, perfect acceptance test
*RB2L5: 2 alter, C1=1, C2=1/5, imperfect acceptance test whose reliability = alter. 2
35
Schedulability analysis of ORTEGA
36
Mode-Change Problem Incurred by Recovery
Example: Suppose one plant 1p : (C1
p,T1p) = (3,5); 1
a : (C1a,T1
a) = (4,10) ;
with another real time task 2 : (C2,T2) = (6,15).
Unschedulable of tasks due to the recovery
• Before the recovery at t=10, {1p , 2 } = {(3,5), {6,15}} is schedulable;
• After the recovery transition, {1a , 2 } = {(4,10), {6,15}} is also schedulable;
• However, during the transition of recovery, 2 misses its deadline at t=15!
Mode-change in fixed priority scheduling is a well-recognized difficult problem by the real-time community
(3,5) ->(4,10)
5 10 15
(6,15)
5 10 15
20
Miss deadline!
Mode changeincurred by recovery
0
0
1p 1
a
2
37
Schedulability Analysis
Schedulability Analysis: We adopt the work by Real and Crespo (2004)
i
k
RRk
p kp
x
wi(x)
ka k
a
kp k
a
Idea: Analyze the transitional scheduling overhead incurred by the recovery.
(I) Schedulability analysis of steady state task set
(II) Schedulability analysis of old-mode tasks with transitional scheduling overhead (due to the mode change)
(III) Schedulability analysis of new-mode tasks with transitional scheduling overhead (due to the mode change)
38
Fault Tolerance and Scheduling Co-design
-- one FT-enabled task case
Maximize the recovery region subject to
schedulability constraint
Find the smallest (optimal) control loop period Tk*a, s.t. the task set is schedulable under random recoveries
Given the schedulability test, we can use binary
search algorithm to find Tk*a
39
Sampling time h,
Zero-order hold( ) ,AhF h e
0( )
h AsG h e dsB
P2: Recovery Region for Digital Controllers
Controller ( ) ( )u k Kx k
( 1) ( )x k Fx k ( )F F GK
Theorem (Lyapunov): A discrete time LTI system shown above is stable iff there exists a matrix P>0, such that
0.TPF PF
40
{ | 1}Tx x Px
Stability Region (Continued)
Stability region of the system ( 1) ( )k Fx kx
with respect to P is defined as:
Stability Region with Constraints
1 1Tia x i l State constraints
Control input constraints 1 1Tjb u j r
1 1 . (1)Tmx m … q Can be combined in the
closed loop system as
Lemma: The stability region defined above satisfy constraints (1) iff 1 1,T
m mP 1 .m … q