Upload
pearl-mills
View
215
Download
1
Embed Size (px)
Citation preview
1
Application-Level Fault Tolerance for
Embedded Real-Time Systems
Israel KorenDepartment of Electrical & Computer Engineering
University of Massachusetts Amherst, MA, 01003
Supported in part by DARPA, NASA/JPL and NSF
2 of 38
Introduction Fault Tolerance can be incorporated at two levels:
System Level: encompasses all types of redundancy of system HW and SW components and recovery actions taken by the system (application independent)
Application level: encompasses redundancy and recovery actions within the application software itself
For general-purpose systems the first is preferable For large-scale real-time applications system-level fault
tolerance alone is too expensive and may be insufficient Massive hardware and/or software redundancy is usually
too expensive for embedded systems Recovery overhead associated with movement of large
process checkpoints increases the chances of missing a deadline
UMass - Architecture and Real-Time Systems Lab
3 of 38
Application-Level Fault Tolerance (ALFT)
Key Idea: Exploit application semantics to implement low overhead fault tolerance
Redundancy can be tuned to the extent of fault-tolerance required - scalable fault-tolerance
Allowing more overhead for ALFT produces higher quality results
Trade off fault- tolerance against computation overhead Application-Level Fault Tolerance (ALFT) can complement
existing system- or algorithm-level fault-tolerance by leveraging information available only at the application level
We have integrated our ALFT techniques with four large-scale real-time applications from Honeywell and NASA
UMass - Architecture and Real-Time Systems Lab
4 of 38
ALFT - General Approach
UMass - Architecture and Real-Time Systems Lab
• Each processor performs, in addition to its own work (P,primary) , a scaled-down copy of its neighbor's work (S,secondary)• Upon detecting a faulty neighbor, the node provides its secondary results as substitution
Node 1
Node 2
Node 3
Node 4
P1 S4
P2 S1
P3 S2
P4 S3
• When recovered, the interrupted process begins calculations with data which its secondary has computed on its behalf
Fault
5 of 38
Issues to be resolved
How to scale down the secondary? Precision vs. overhead
Should we always run the secondaries? The answers are application dependent
UMass - Architecture and Real-Time Systems Lab
6 of 38
Benchmark Applications
Real-Time applications used for benchmarking:
Applications from Honeywell RTHT (real-time hypothesis tracking) ABF (adaptive beam forming)
Applications from NASA’s REE suite OTIS (orbital thermal imaging
spectrometer) NGST (next generation space telescope)
UMass - Architecture and Real-Time Systems Lab
7 of 38
The RTHT Application
Real-Time Hypothesis Tracking: tracks objects moving about in a 2-D coordinate plane (using data from radar), to distinguish between real targets and noise clutter
UMass - architecture and Real-Time Systems Lab
8 of 38
RTHT Processes
Each process tracks targets through the creation and extension of hypotheses which include a figure of likelihood
When a target object makes it through more and more consecutive frames, its hypothesized track becomes more likely to be real
Umass - Architecture and Real-Time Systems Lab
9 of 38
RTHT with ALFT
Umass - Architecture and Real-Time Systems Lab
Without the secondary a Cold-Start would be required if the node recovers but does not take part in the compilation
Secondary extends the top p% of hypotheses
10 of 38
RTHT Results
30 real targets, 80 false alarms and two application processes A single fault, lasting one frame, occurs at Frame No. 15 With a redundancy of just 15%, we can track all the real
targets, despite the faulty nodeUmass - Architecture and Real-Time Systems Lab
Nu
mb
er
of
Targ
ets
Tra
cked
11 of 38
Why only 15%?
Hypotheses are sorted in order of likelihood The hypotheses extended by the secondary are the
ones most likely to be real targets
Umass - Architecture and Real-Time Systems Lab
12 of 38
Secondary time overhead
An even smaller computational load is imposed by the secondary The extension of hypotheses that are most likely to be real, takes
less time
Umass - Architecture and Real-Time Systems Lab
Rati
o o
f Seco
nd
ary
Execu
tion T
ime t
o
Pri
mary
Percentage of Secondary Overlap
13 of 38
The ABF Application
The Adaptive Beam Forming Application detects sound as it impinges on a linear array of sonar sensors
Umass - Architecture and Real-Time Systems Lab
Linear Array of Sonar Sensors
Plane wave arriving at a
rray
14 of 38
ABF Processes
Each process works on a distinct subset of frequency range, and dynamically updates a set of weights every frame
A beam that emphasizes the sound coming from each direction is formed using these weights
Umass - Architecture and Real-Time Systems Lab
Direction (angle) of arrival (degrees)
Mag
nit
ud
e (
db
)
15 of 38
ABF with ALFT Two methods of secondary reduction:
Limited Field of View : search only in certain directions (windows) Reduced Granularity : search full field at lower granularity
A blend of the two methods
Magnit
ude (
dB
)
Example Output: Combined Techniques
Direction of Arrival (Angle) - Degrees
16 of 38
ABF Results
Four beams of sound at 32 frequency ranges Two application processes A single node failure in Frame 20 Table shows minimum redundancy required to not lose
track of any beam Combining the two techniques reduces the computational
overhead, while maintaining similar results
Umass - Architecture and Real-Time Systems Lab
17%
30%
35%
Computational Overhead
15%
30%
33%
Secondary Overlap
Combined - 30% FOV and 50% Granularity
Limited FOV
Reduced Granularity
Redundancy Technique
17 of 38
ABF - Secondary Overhead
The computational load curves are linear (unlike RTHT) due to uniform dataset priority
Still, a reasonably small amount of extra computation is necessary to mask the fault
Umass - Architecture and Real-Time Systems Lab
Percentage of Secondary Overlap
Rati
o o
f Sec.
Execu
tion T
ime t
o
Pri
mary
18 of 38
Adding Fault Detection
Faults do not always completely disable a node Malformed and corrupted data are more likely
Hardware-disabling faults are easy to detect with watchdog hardware and “I am alive” messages
Faulty data is difficult to detect without application syntax
Fault detection is a necessary condition for ALFT to schedule which secondary tasks to run
Adding fault detection: employ acceptance filters to validate the primary’s output
Secondary tasks can provide verification for ambiguous (possibly faulty) data
Umass - Architecture and Real-Time Systems Lab
19 of 38
Validation Through Secondaries
The “better” data is chosen according to the following logic grid:
Run Secondary
Primary*PrimaryPrimaryFaulty
SecondaryPrimaryPrimaryAmbiguous
SecondarySecondaryPrimaryFaultless
FaultyAmbiguousFaultless
Primary
Sec
onda
ry
Umass - Architecture and Real-Time Systems Lab
20 of 38
Acceptance Filters
Faults are detected by passing results through one or more acceptance filters
Filters are unique to applications with certain data characteristics
Value bound tests are applicable to most applications
Sanity check tests require knowledge of the expected output behavior and format
Results from Primary
Filter 1
Secondary Task Queue
Filter 2
Data is OKPass Fail
Umass - Architecture and Real-Time Systems Lab
21 of 38
OTIS Characteristics
ALFTD was applied to OTIS (Orbital Thermal Imaging Spectrometer) - part of the REE suite
OTIS reads radiation values from various bands and calculates temperature data
Useful characteristics of OTIS’ output (temperature) Local Correlation: Data changes gradually over
an area Absolute Bounds: Data falls within some
expected realistic range
UMass - Architecture and Real-Time Systems Lab
22 of 38
ALFTD Filters for OTIS
Local Correlation and Absolute Bounds on the data led to the creation of two filters:
Spatial Locality Filter: If the difference between pixel (x,y) and (x-1,y) is greater than some threshold - the pixel may be the result of faulty data
Absolute Bounds Filter: Any pixel not falling in the value range of < value < may be the result of faulty data
The filter thresholds (, , ) are set based on sample datasets
UMass - Architecture and Real-Time Systems Lab
23 of 38
OTIS Datasets
“Blob” “Stripe” “Spots”
Faulty
Fault-free
UMass - Architecture and Real-Time Systems Lab
24 of 38
Filter Calibration
ALFTD filters require calibration Higher detection probability with low rate
of false alarms can be achieved with well-tuned filters
Calibration should be based on characteristics of the most frequent data
UMass - Architecture and Real-Time Systems Lab
25 of 38
Frequency Plots (Bounds Filter)
Frequency of temperature values
26 of 38
Frequency Plots (Spatial Locality Filter)
Frequency of differences between adjacent pixels
27 of 38
Fault Injection To test the detection capability we compared the fault-free
output to an erroneous output - generated using fault injection
Faults produce different kinds and intensities of errors Intensely faulty data (set-to-zero errors, memory
gibberish) is easily detected, as it seldom falls inside the prescribed filters
“Lightly” faulty data will not be detected but is negligible
Our experiments include moderately faulty data: offsets in value of up to 30%
These faults tend to blend in with non-faulty data, making them especially hard to detect
UMass - Architecture and Real-Time Systems Lab
28 of 38
Filter Adjustment
Filters can be adjusted in steps A single filter has a high (“right”) and low (“left”)
cutoffs The “left” and “right” bounds of data are usually
exclusive, therefore their detections act cumulatively For each filter - a tradeoff between the desired fault
detection rate and the number of false alarms Multiple filters are independently calibrated
Multiple filters may detect more faults than a single filter and have a lower false alarms rate
But - the subsets of faults detected will not necessarily be disjoint
UMass - Architecture and Real-Time Systems Lab
29 of 38
Detection Plots (Single Side)
Fault detections and false alarms for the left cutoff (“Blob”)
30 of 38
Detection Plots (Both Sides)
Overlaying the left and right filter cutoff plots - the impacts of the right and left cutoff values are asymmetric (“Blob”)
31 of 38
Fault Detections, Numerically
Columns = left cutoff, Rows = right cutoff
This table is used to find the possible configurations that satisfy a minimum required fault detection rate (80%)
300 304 306 310 314 318315 98.9% 99.1% 99.2% 99.2% 99.4% -317 96.6% 96.8% 96.8% 96.9% 97.1% -319 93.8% 93.9% 94.0% 94.0% 94.3% 98.5%321 91.0% 91.1% 91.2% 91.3% 91.5% 95.7%323 88.2% 88.3% 88.4% 88.4% 88.7% 92.9%325 83.6% 83.7% 83.8% 83.9% 84.1% 88.3%327 78.5% 78.7% 78.8% 78.8% 79.0% 83.3%329 71.2% 71.4% 71.5% 71.5% 71.7% 76.0%331 64.0% 64.2% 64.3% 64.3% 64.5% 68.8%333 61.4% 61.5% 61.6% 61.7% 61.9% 66.1%335 60.9% 61.0% 61.1% 61.2% 61.4% 65.6%337 60.2% 60.4% 60.4% 60.5% 60.7% 64.9%339 59.2% 59.4% 59.5% 59.5% 59.7% 64.0%
Bounds Filter: Fault Detections
UMass - Architecture and Real-Time Systems Lab
32 of 38
False Alarms, Numerically
Columns = left cutoff, Rows = right cutoff
Of the possible combinations chosen from the previous table, choose the one with the minimum number of false alarms
300 304 306 310 314 318315 92.4% 92.4% 92.4% 92.4% 96.3% -317 84.7% 84.7% 84.7% 84.7% 88.7% -319 78.6% 78.6% 78.6% 78.6% 82.5% 97.1%321 72.5% 72.5% 72.5% 72.5% 76.5% 91.0%323 64.8% 64.8% 64.8% 64.8% 68.7% 83.2%325 54.1% 54.1% 54.1% 54.1% 58.1% 72.6%327 41.2% 41.2% 41.2% 41.2% 45.2% 59.7%329 23.9% 23.9% 23.9% 23.9% 27.8% 42.3%331 5.0% 5.0% 5.0% 5.0% 9.0% 23.5%333 0.0% 0.0% 0.0% 0.0% 4.0% 18.5%335 0.0% 0.0% 0.0% 0.0% 3.9% 18.4%337 0.0% 0.0% 0.0% 0.0% 3.9% 18.4%339 0.0% 0.0% 0.0% 0.0% 3.9% 18.4%
Bounds Filter: False Alarms
UMass - Architecture and Real-Time Systems Lab
33 of 38
Multiple Filters By combining multiple filters, fault detection is improved
Sp
atia
l L
oca
lity
filt
er
Fault Detection False Alarms60.0% 70.0% 90.0% 60.0% 70.0% 90.0%
40.0% 63.7% 71.9% 89.6% 40.0% 15.7% 22.6% 76.0%50.0% 64.0% 72.1% 89.7% 50.0% 15.7% 22.6% 76.0%60.0% 67.5% 72.7% 90.2% 60.0% 15.7% 22.6% 76.0%70.0% 76.3% 80.1% 94.2% 70.0% 36.3% 42.2% 84.2%80.0% 84.1% 87.4% 96.8% 80.0% 59.9% 64.6% 90.5%90.0% 93.0% 94.3% 98.7% 90.0% 77.1% 79.0% 94.5%
Bounds filter
False Alarm run secondary unnecessarily
UMass - Architecture and Real-Time Systems Lab
34 of 38
ALFTD-corrected output (“Blob”)
Faulty Output
33% Overhead 50% Overhead
Fault-Free Output
25% Overhead
ALFTD- corrected Output
35 of 38
Difference Plots (“Blob”)
No Error Max Error
Faulty 25% Overhead 33% Overhead 50% Overhead
Faulty output versus fault-free output
UMass - Architecture and Real-Time Systems Lab
36 of 38
Conclusions
A high degree of fault tolerance at a minimal investment of system resources
Particularly useful in applications exhibiting data parallelism and some level of data redundancy or correlation
Scalable fault-tolerance Attractive alternative to more expensive schemes
such as hardware and/or software redundancy Can complement system-level fault tolerance
schemesUMass - Architecture and Real-Time Systems Lab
37 of 38
References
J. Haines, V.R. Lakamraju, I. Koren and C.M. Krishna, “Development of Application-Level Fault Tolerance in a Real-Time Benchmark," Proc. of EFTS'98, IEEE Workshop On Embedded Fault-Tolerant Systems, May 1998.
J. Haines, V.R. Lakamraju, I. Koren and C.M. Krishna, “Application- Level Fault Tolerance as a Complement to System-Level Fault Tolerance," The Journal of Supercomputing, Special Issue on “Embedded Fault-Tolerant Computing Systems,” Vol. 16, pp. 53-68, Kluwer Academic Publishers, MA, 2000.
E. Ciocca, I. Koren, C.M. Krishna, “Determining Acceptance Tests for Application-Level Fault Detection,” Proc. of the 2nd ASPLOS Workshop on Evaluating and Architecting System Dependability, pp. 47-53, Oct. 2002.
UMass - Architecture and Real-Time Systems Lab
38 of 38
Thank You!
C.M. Krishna
Vijay Lakamraju
Josh Haines
Eric Ciocca
39 of 38
Further Extension (Input Errors)
Real-time applications exposed to extreme environments can be affected by charged particles like alpha/cosmic rays
High likelihood of input data faults manifesting as bit flips Re-running the process or its secondary is useless as the
input remains the same Input data should be preprocessed to detect input errors and
attempt to correct them We have integrated preprocessing of input data in two NASA
applications - OTIS and NGST
UMass - Architecture and Real-Time Systems Lab
40 of 38
Next Generation Space Telescope
Multiple readouts during each period Use this redundancy to identify and recover from input data bit
errors Algorithms like optimal median smoothing and sliding-window bit
majority smoothing can be used
Ground StationSpace Station
UMass - Architecture and Real-Time Systems Lab
41 of 38
NGST - Results
Probability of a data bit flip
Rela
tive E
rror
(en
tire
data
set)
UMass - Architecture and Real-Time Systems Lab
42 of 38
Results for OTIS
Data redundancy in OTIS: multiple radiation mappings – one for each wavelength out of 128
Thermal data exhibits strong spatial locality and tight natural bounds can also be exploited by the preprocessing
Probability of a data bit flip
Rela
tive E
rror
UMass - Architecture and Real-Time Systems Lab