1 Application-Level Fault Tolerance for Embedded Real-Time Systems Israel Koren Department of Electrical & Computer Engineering University of Massachusetts

1

Application-Level Fault Tolerance for

Embedded Real-Time Systems

Israel KorenDepartment of Electrical & Computer Engineering

University of Massachusetts Amherst, MA, 01003

Supported in part by DARPA, NASA/JPL and NSF

2 of 38

Introduction Fault Tolerance can be incorporated at two levels:

System Level: encompasses all types of redundancy of system HW and SW components and recovery actions taken by the system (application independent)

Application level: encompasses redundancy and recovery actions within the application software itself

For general-purpose systems the first is preferable For large-scale real-time applications system-level fault

tolerance alone is too expensive and may be insufficient Massive hardware and/or software redundancy is usually

too expensive for embedded systems Recovery overhead associated with movement of large

process checkpoints increases the chances of missing a deadline

UMass - Architecture and Real-Time Systems Lab

3 of 38

Application-Level Fault Tolerance (ALFT)

Key Idea: Exploit application semantics to implement low overhead fault tolerance

Redundancy can be tuned to the extent of fault-tolerance required - scalable fault-tolerance

Allowing more overhead for ALFT produces higher quality results

Trade off fault- tolerance against computation overhead Application-Level Fault Tolerance (ALFT) can complement

existing system- or algorithm-level fault-tolerance by leveraging information available only at the application level

We have integrated our ALFT techniques with four large-scale real-time applications from Honeywell and NASA


4 of 38

ALFT - General Approach


• Each processor performs, in addition to its own work (P,primary) , a scaled-down copy of its neighbor's work (S,secondary)• Upon detecting a faulty neighbor, the node provides its secondary results as substitution

Node 1

Node 2

Node 3

Node 4

P1 S4

P2 S1

P3 S2

P4 S3

• When recovered, the interrupted process begins calculations with data which its secondary has computed on its behalf

Fault

5 of 38

Issues to be resolved

How to scale down the secondary? Precision vs. overhead

Should we always run the secondaries? The answers are application dependent


6 of 38

Benchmark Applications

Real-Time applications used for benchmarking:

Applications from Honeywell RTHT (real-time hypothesis tracking) ABF (adaptive beam forming)

Applications from NASA’s REE suite OTIS (orbital thermal imaging

spectrometer) NGST (next generation space telescope)


7 of 38

The RTHT Application

Real-Time Hypothesis Tracking: tracks objects moving about in a 2-D coordinate plane (using data from radar), to distinguish between real targets and noise clutter

UMass - architecture and Real-Time Systems Lab

8 of 38

RTHT Processes

Each process tracks targets through the creation and extension of hypotheses which include a figure of likelihood

When a target object makes it through more and more consecutive frames, its hypothesized track becomes more likely to be real

Umass - Architecture and Real-Time Systems Lab

9 of 38

RTHT with ALFT


Without the secondary a Cold-Start would be required if the node recovers but does not take part in the compilation

Secondary extends the top p% of hypotheses

10 of 38

RTHT Results

30 real targets, 80 false alarms and two application processes A single fault, lasting one frame, occurs at Frame No. 15 With a redundancy of just 15%, we can track all the real

targets, despite the faulty nodeUmass - Architecture and Real-Time Systems Lab

Nu

mb

er

of

Targ

ets

Tra

cked

11 of 38

Why only 15%?

Hypotheses are sorted in order of likelihood The hypotheses extended by the secondary are the

ones most likely to be real targets


12 of 38

Secondary time overhead

An even smaller computational load is imposed by the secondary The extension of hypotheses that are most likely to be real, takes

less time


Rati

o o

f Seco

nd

ary

Execu

tion T

ime t

o

Pri

mary

Percentage of Secondary Overlap

13 of 38

The ABF Application

The Adaptive Beam Forming Application detects sound as it impinges on a linear array of sonar sensors


Linear Array of Sonar Sensors

Plane wave arriving at a

rray

14 of 38

ABF Processes

Each process works on a distinct subset of frequency range, and dynamically updates a set of weights every frame

A beam that emphasizes the sound coming from each direction is formed using these weights


Direction (angle) of arrival (degrees)

Mag

nit

ud

e (

db

)

15 of 38

ABF with ALFT Two methods of secondary reduction:

Limited Field of View : search only in certain directions (windows) Reduced Granularity : search full field at lower granularity

A blend of the two methods

Magnit

ude (

dB

)

Example Output: Combined Techniques

Direction of Arrival (Angle) - Degrees

16 of 38

ABF Results

Four beams of sound at 32 frequency ranges Two application processes A single node failure in Frame 20 Table shows minimum redundancy required to not lose

track of any beam Combining the two techniques reduces the computational

overhead, while maintaining similar results


17%

30%

35%

Computational Overhead

15%

30%

33%

Secondary Overlap

Combined - 30% FOV and 50% Granularity

Limited FOV

Reduced Granularity

Redundancy Technique

17 of 38

ABF - Secondary Overhead

The computational load curves are linear (unlike RTHT) due to uniform dataset priority

Still, a reasonably small amount of extra computation is necessary to mask the fault


Percentage of Secondary Overlap

Rati

o o

f Sec.

Execu

tion T

ime t

o

Pri

mary

18 of 38

Adding Fault Detection

Faults do not always completely disable a node Malformed and corrupted data are more likely

Hardware-disabling faults are easy to detect with watchdog hardware and “I am alive” messages

Faulty data is difficult to detect without application syntax

Fault detection is a necessary condition for ALFT to schedule which secondary tasks to run

Adding fault detection: employ acceptance filters to validate the primary’s output

Secondary tasks can provide verification for ambiguous (possibly faulty) data


19 of 38

Validation Through Secondaries

The “better” data is chosen according to the following logic grid:

Run Secondary

Primary*PrimaryPrimaryFaulty

SecondaryPrimaryPrimaryAmbiguous

SecondarySecondaryPrimaryFaultless

FaultyAmbiguousFaultless

Primary

Sec

onda

ry


20 of 38

Acceptance Filters

Faults are detected by passing results through one or more acceptance filters

Filters are unique to applications with certain data characteristics

Value bound tests are applicable to most applications

Sanity check tests require knowledge of the expected output behavior and format

Results from Primary

Filter 1

Secondary Task Queue

Filter 2

Data is OKPass Fail


21 of 38

OTIS Characteristics

ALFTD was applied to OTIS (Orbital Thermal Imaging Spectrometer) - part of the REE suite

OTIS reads radiation values from various bands and calculates temperature data

Useful characteristics of OTIS’ output (temperature) Local Correlation: Data changes gradually over

an area Absolute Bounds: Data falls within some

expected realistic range


22 of 38

ALFTD Filters for OTIS

Local Correlation and Absolute Bounds on the data led to the creation of two filters:

Spatial Locality Filter: If the difference between pixel (x,y) and (x-1,y) is greater than some threshold - the pixel may be the result of faulty data

Absolute Bounds Filter: Any pixel not falling in the value range of < value < may be the result of faulty data

The filter thresholds (, , ) are set based on sample datasets


23 of 38

OTIS Datasets

“Blob” “Stripe” “Spots”

Faulty

Fault-free


24 of 38

Filter Calibration

ALFTD filters require calibration Higher detection probability with low rate

of false alarms can be achieved with well-tuned filters

Calibration should be based on characteristics of the most frequent data


25 of 38

Frequency Plots (Bounds Filter)

Frequency of temperature values

26 of 38

Frequency Plots (Spatial Locality Filter)

Frequency of differences between adjacent pixels

27 of 38

Fault Injection To test the detection capability we compared the fault-free

output to an erroneous output - generated using fault injection

Faults produce different kinds and intensities of errors Intensely faulty data (set-to-zero errors, memory

gibberish) is easily detected, as it seldom falls inside the prescribed filters

“Lightly” faulty data will not be detected but is negligible

Our experiments include moderately faulty data: offsets in value of up to 30%

These faults tend to blend in with non-faulty data, making them especially hard to detect


28 of 38

Filter Adjustment

Filters can be adjusted in steps A single filter has a high (“right”) and low (“left”)

cutoffs The “left” and “right” bounds of data are usually

exclusive, therefore their detections act cumulatively For each filter - a tradeoff between the desired fault

detection rate and the number of false alarms Multiple filters are independently calibrated

Multiple filters may detect more faults than a single filter and have a lower false alarms rate

But - the subsets of faults detected will not necessarily be disjoint


29 of 38

Detection Plots (Single Side)

Fault detections and false alarms for the left cutoff (“Blob”)

30 of 38

Detection Plots (Both Sides)

Overlaying the left and right filter cutoff plots - the impacts of the right and left cutoff values are asymmetric (“Blob”)

31 of 38

Fault Detections, Numerically

Columns = left cutoff, Rows = right cutoff

This table is used to find the possible configurations that satisfy a minimum required fault detection rate (80%)

300 304 306 310 314 318315 98.9% 99.1% 99.2% 99.2% 99.4% -317 96.6% 96.8% 96.8% 96.9% 97.1% -319 93.8% 93.9% 94.0% 94.0% 94.3% 98.5%321 91.0% 91.1% 91.2% 91.3% 91.5% 95.7%323 88.2% 88.3% 88.4% 88.4% 88.7% 92.9%325 83.6% 83.7% 83.8% 83.9% 84.1% 88.3%327 78.5% 78.7% 78.8% 78.8% 79.0% 83.3%329 71.2% 71.4% 71.5% 71.5% 71.7% 76.0%331 64.0% 64.2% 64.3% 64.3% 64.5% 68.8%333 61.4% 61.5% 61.6% 61.7% 61.9% 66.1%335 60.9% 61.0% 61.1% 61.2% 61.4% 65.6%337 60.2% 60.4% 60.4% 60.5% 60.7% 64.9%339 59.2% 59.4% 59.5% 59.5% 59.7% 64.0%

Bounds Filter: Fault Detections


32 of 38

False Alarms, Numerically

Columns = left cutoff, Rows = right cutoff

Of the possible combinations chosen from the previous table, choose the one with the minimum number of false alarms

300 304 306 310 314 318315 92.4% 92.4% 92.4% 92.4% 96.3% -317 84.7% 84.7% 84.7% 84.7% 88.7% -319 78.6% 78.6% 78.6% 78.6% 82.5% 97.1%321 72.5% 72.5% 72.5% 72.5% 76.5% 91.0%323 64.8% 64.8% 64.8% 64.8% 68.7% 83.2%325 54.1% 54.1% 54.1% 54.1% 58.1% 72.6%327 41.2% 41.2% 41.2% 41.2% 45.2% 59.7%329 23.9% 23.9% 23.9% 23.9% 27.8% 42.3%331 5.0% 5.0% 5.0% 5.0% 9.0% 23.5%333 0.0% 0.0% 0.0% 0.0% 4.0% 18.5%335 0.0% 0.0% 0.0% 0.0% 3.9% 18.4%337 0.0% 0.0% 0.0% 0.0% 3.9% 18.4%339 0.0% 0.0% 0.0% 0.0% 3.9% 18.4%

Bounds Filter: False Alarms


33 of 38

Multiple Filters By combining multiple filters, fault detection is improved

Sp

atia

l L

oca

lity

filt

er

Fault Detection False Alarms60.0% 70.0% 90.0% 60.0% 70.0% 90.0%

40.0% 63.7% 71.9% 89.6% 40.0% 15.7% 22.6% 76.0%50.0% 64.0% 72.1% 89.7% 50.0% 15.7% 22.6% 76.0%60.0% 67.5% 72.7% 90.2% 60.0% 15.7% 22.6% 76.0%70.0% 76.3% 80.1% 94.2% 70.0% 36.3% 42.2% 84.2%80.0% 84.1% 87.4% 96.8% 80.0% 59.9% 64.6% 90.5%90.0% 93.0% 94.3% 98.7% 90.0% 77.1% 79.0% 94.5%

Bounds filter

False Alarm run secondary unnecessarily


34 of 38

ALFTD-corrected output (“Blob”)

Faulty Output

33% Overhead 50% Overhead

Fault-Free Output

25% Overhead

ALFTD- corrected Output

35 of 38

Difference Plots (“Blob”)

No Error Max Error

Faulty 25% Overhead 33% Overhead 50% Overhead

Faulty output versus fault-free output


36 of 38

Conclusions

A high degree of fault tolerance at a minimal investment of system resources

Particularly useful in applications exhibiting data parallelism and some level of data redundancy or correlation

Scalable fault-tolerance Attractive alternative to more expensive schemes

such as hardware and/or software redundancy Can complement system-level fault tolerance

schemesUMass - Architecture and Real-Time Systems Lab

37 of 38

References

J. Haines, V.R. Lakamraju, I. Koren and C.M. Krishna, “Development of Application-Level Fault Tolerance in a Real-Time Benchmark," Proc. of EFTS'98, IEEE Workshop On Embedded Fault-Tolerant Systems, May 1998.

J. Haines, V.R. Lakamraju, I. Koren and C.M. Krishna, “Application- Level Fault Tolerance as a Complement to System-Level Fault Tolerance," The Journal of Supercomputing, Special Issue on “Embedded Fault-Tolerant Computing Systems,” Vol. 16, pp. 53-68, Kluwer Academic Publishers, MA, 2000.

E. Ciocca, I. Koren, C.M. Krishna, “Determining Acceptance Tests for Application-Level Fault Detection,” Proc. of the 2nd ASPLOS Workshop on Evaluating and Architecting System Dependability, pp. 47-53, Oct. 2002.


38 of 38

Thank You!

C.M. Krishna

Vijay Lakamraju

Josh Haines

Eric Ciocca

39 of 38

Further Extension (Input Errors)

Real-time applications exposed to extreme environments can be affected by charged particles like alpha/cosmic rays

High likelihood of input data faults manifesting as bit flips Re-running the process or its secondary is useless as the

input remains the same Input data should be preprocessed to detect input errors and

attempt to correct them We have integrated preprocessing of input data in two NASA

applications - OTIS and NGST


40 of 38

Next Generation Space Telescope

Multiple readouts during each period Use this redundancy to identify and recover from input data bit

errors Algorithms like optimal median smoothing and sliding-window bit

majority smoothing can be used

Ground StationSpace Station


41 of 38

NGST - Results

Probability of a data bit flip

Rela

tive E

rror

(en

tire

data

set)


42 of 38

Results for OTIS

Data redundancy in OTIS: multiple radiation mappings – one for each wavelength out of 128

Thermal data exhibits strong spatial locality and tight natural bounds can also be exploited by the preprocessing

Probability of a data bit flip

Rela

tive E

rror


Documents

1 Application-Level Fault Tolerance for Embedded Real-Time Systems Israel Koren Department of Electrical & Computer Engineering University of Massachusetts