42
1 Application-Level Fault Tolerance for Embedded Real-Time Systems Israel Koren Department of Electrical & Computer Engineering University of Massachusetts Amherst, MA, 01003 Supported in part by DARPA, NASA/JPL and NS

1 Application-Level Fault Tolerance for Embedded Real-Time Systems Israel Koren Department of Electrical & Computer Engineering University of Massachusetts

Embed Size (px)

Citation preview

Page 1: 1 Application-Level Fault Tolerance for Embedded Real-Time Systems Israel Koren Department of Electrical & Computer Engineering University of Massachusetts

1

Application-Level Fault Tolerance for

Embedded Real-Time Systems

Israel KorenDepartment of Electrical & Computer Engineering

University of Massachusetts Amherst, MA, 01003

Supported in part by DARPA, NASA/JPL and NSF

Page 2: 1 Application-Level Fault Tolerance for Embedded Real-Time Systems Israel Koren Department of Electrical & Computer Engineering University of Massachusetts

2 of 38

Introduction Fault Tolerance can be incorporated at two levels:

System Level: encompasses all types of redundancy of system HW and SW components and recovery actions taken by the system (application independent)

Application level: encompasses redundancy and recovery actions within the application software itself

For general-purpose systems the first is preferable For large-scale real-time applications system-level fault

tolerance alone is too expensive and may be insufficient Massive hardware and/or software redundancy is usually

too expensive for embedded systems Recovery overhead associated with movement of large

process checkpoints increases the chances of missing a deadline

UMass - Architecture and Real-Time Systems Lab

Page 3: 1 Application-Level Fault Tolerance for Embedded Real-Time Systems Israel Koren Department of Electrical & Computer Engineering University of Massachusetts

3 of 38

Application-Level Fault Tolerance (ALFT)

Key Idea: Exploit application semantics to implement low overhead fault tolerance

Redundancy can be tuned to the extent of fault-tolerance required - scalable fault-tolerance

Allowing more overhead for ALFT produces higher quality results

Trade off fault- tolerance against computation overhead Application-Level Fault Tolerance (ALFT) can complement

existing system- or algorithm-level fault-tolerance by leveraging information available only at the application level

We have integrated our ALFT techniques with four large-scale real-time applications from Honeywell and NASA

UMass - Architecture and Real-Time Systems Lab

Page 4: 1 Application-Level Fault Tolerance for Embedded Real-Time Systems Israel Koren Department of Electrical & Computer Engineering University of Massachusetts

4 of 38

ALFT - General Approach

UMass - Architecture and Real-Time Systems Lab

• Each processor performs, in addition to its own work (P,primary) , a scaled-down copy of its neighbor's work (S,secondary)• Upon detecting a faulty neighbor, the node provides its secondary results as substitution

Node 1

Node 2

Node 3

Node 4

P1 S4

P2 S1

P3 S2

P4 S3

• When recovered, the interrupted process begins calculations with data which its secondary has computed on its behalf

Fault

Page 5: 1 Application-Level Fault Tolerance for Embedded Real-Time Systems Israel Koren Department of Electrical & Computer Engineering University of Massachusetts

5 of 38

Issues to be resolved

How to scale down the secondary? Precision vs. overhead

Should we always run the secondaries? The answers are application dependent

UMass - Architecture and Real-Time Systems Lab

Page 6: 1 Application-Level Fault Tolerance for Embedded Real-Time Systems Israel Koren Department of Electrical & Computer Engineering University of Massachusetts

6 of 38

Benchmark Applications

Real-Time applications used for benchmarking:

Applications from Honeywell RTHT (real-time hypothesis tracking) ABF (adaptive beam forming)

Applications from NASA’s REE suite OTIS (orbital thermal imaging

spectrometer) NGST (next generation space telescope)

UMass - Architecture and Real-Time Systems Lab

Page 7: 1 Application-Level Fault Tolerance for Embedded Real-Time Systems Israel Koren Department of Electrical & Computer Engineering University of Massachusetts

7 of 38

The RTHT Application

Real-Time Hypothesis Tracking: tracks objects moving about in a 2-D coordinate plane (using data from radar), to distinguish between real targets and noise clutter

UMass - architecture and Real-Time Systems Lab

Page 8: 1 Application-Level Fault Tolerance for Embedded Real-Time Systems Israel Koren Department of Electrical & Computer Engineering University of Massachusetts

8 of 38

RTHT Processes

Each process tracks targets through the creation and extension of hypotheses which include a figure of likelihood

When a target object makes it through more and more consecutive frames, its hypothesized track becomes more likely to be real

Umass - Architecture and Real-Time Systems Lab

Page 9: 1 Application-Level Fault Tolerance for Embedded Real-Time Systems Israel Koren Department of Electrical & Computer Engineering University of Massachusetts

9 of 38

RTHT with ALFT

Umass - Architecture and Real-Time Systems Lab

Without the secondary a Cold-Start would be required if the node recovers but does not take part in the compilation

Secondary extends the top p% of hypotheses

Page 10: 1 Application-Level Fault Tolerance for Embedded Real-Time Systems Israel Koren Department of Electrical & Computer Engineering University of Massachusetts

10 of 38

RTHT Results

30 real targets, 80 false alarms and two application processes A single fault, lasting one frame, occurs at Frame No. 15 With a redundancy of just 15%, we can track all the real

targets, despite the faulty nodeUmass - Architecture and Real-Time Systems Lab

Nu

mb

er

of

Targ

ets

Tra

cked

Page 11: 1 Application-Level Fault Tolerance for Embedded Real-Time Systems Israel Koren Department of Electrical & Computer Engineering University of Massachusetts

11 of 38

Why only 15%?

Hypotheses are sorted in order of likelihood The hypotheses extended by the secondary are the

ones most likely to be real targets

Umass - Architecture and Real-Time Systems Lab

Page 12: 1 Application-Level Fault Tolerance for Embedded Real-Time Systems Israel Koren Department of Electrical & Computer Engineering University of Massachusetts

12 of 38

Secondary time overhead

An even smaller computational load is imposed by the secondary The extension of hypotheses that are most likely to be real, takes

less time

Umass - Architecture and Real-Time Systems Lab

Rati

o o

f Seco

nd

ary

Execu

tion T

ime t

o

Pri

mary

Percentage of Secondary Overlap

Page 13: 1 Application-Level Fault Tolerance for Embedded Real-Time Systems Israel Koren Department of Electrical & Computer Engineering University of Massachusetts

13 of 38

The ABF Application

The Adaptive Beam Forming Application detects sound as it impinges on a linear array of sonar sensors

Umass - Architecture and Real-Time Systems Lab

Linear Array of Sonar Sensors

Plane wave arriving at a

rray

Page 14: 1 Application-Level Fault Tolerance for Embedded Real-Time Systems Israel Koren Department of Electrical & Computer Engineering University of Massachusetts

14 of 38

ABF Processes

Each process works on a distinct subset of frequency range, and dynamically updates a set of weights every frame

A beam that emphasizes the sound coming from each direction is formed using these weights

Umass - Architecture and Real-Time Systems Lab

Direction (angle) of arrival (degrees)

Mag

nit

ud

e (

db

)

Page 15: 1 Application-Level Fault Tolerance for Embedded Real-Time Systems Israel Koren Department of Electrical & Computer Engineering University of Massachusetts

15 of 38

ABF with ALFT Two methods of secondary reduction:

Limited Field of View : search only in certain directions (windows) Reduced Granularity : search full field at lower granularity

A blend of the two methods

Magnit

ude (

dB

)

Example Output: Combined Techniques

Direction of Arrival (Angle) - Degrees

Page 16: 1 Application-Level Fault Tolerance for Embedded Real-Time Systems Israel Koren Department of Electrical & Computer Engineering University of Massachusetts

16 of 38

ABF Results

Four beams of sound at 32 frequency ranges Two application processes A single node failure in Frame 20 Table shows minimum redundancy required to not lose

track of any beam Combining the two techniques reduces the computational

overhead, while maintaining similar results

Umass - Architecture and Real-Time Systems Lab

17%

30%

35%

Computational Overhead

15%

30%

33%

Secondary Overlap

Combined - 30% FOV and 50% Granularity

Limited FOV

Reduced Granularity

Redundancy Technique

Page 17: 1 Application-Level Fault Tolerance for Embedded Real-Time Systems Israel Koren Department of Electrical & Computer Engineering University of Massachusetts

17 of 38

ABF - Secondary Overhead

The computational load curves are linear (unlike RTHT) due to uniform dataset priority

Still, a reasonably small amount of extra computation is necessary to mask the fault

Umass - Architecture and Real-Time Systems Lab

Percentage of Secondary Overlap

Rati

o o

f Sec.

Execu

tion T

ime t

o

Pri

mary

Page 18: 1 Application-Level Fault Tolerance for Embedded Real-Time Systems Israel Koren Department of Electrical & Computer Engineering University of Massachusetts

18 of 38

Adding Fault Detection

Faults do not always completely disable a node Malformed and corrupted data are more likely

Hardware-disabling faults are easy to detect with watchdog hardware and “I am alive” messages

Faulty data is difficult to detect without application syntax

Fault detection is a necessary condition for ALFT to schedule which secondary tasks to run

Adding fault detection: employ acceptance filters to validate the primary’s output

Secondary tasks can provide verification for ambiguous (possibly faulty) data

Umass - Architecture and Real-Time Systems Lab

Page 19: 1 Application-Level Fault Tolerance for Embedded Real-Time Systems Israel Koren Department of Electrical & Computer Engineering University of Massachusetts

19 of 38

Validation Through Secondaries

The “better” data is chosen according to the following logic grid:

Run Secondary

Primary*PrimaryPrimaryFaulty

SecondaryPrimaryPrimaryAmbiguous

SecondarySecondaryPrimaryFaultless

FaultyAmbiguousFaultless

Primary

Sec

onda

ry

Umass - Architecture and Real-Time Systems Lab

Page 20: 1 Application-Level Fault Tolerance for Embedded Real-Time Systems Israel Koren Department of Electrical & Computer Engineering University of Massachusetts

20 of 38

Acceptance Filters

Faults are detected by passing results through one or more acceptance filters

Filters are unique to applications with certain data characteristics

Value bound tests are applicable to most applications

Sanity check tests require knowledge of the expected output behavior and format

Results from Primary

Filter 1

Secondary Task Queue

Filter 2

Data is OKPass Fail

Umass - Architecture and Real-Time Systems Lab

Page 21: 1 Application-Level Fault Tolerance for Embedded Real-Time Systems Israel Koren Department of Electrical & Computer Engineering University of Massachusetts

21 of 38

OTIS Characteristics

ALFTD was applied to OTIS (Orbital Thermal Imaging Spectrometer) - part of the REE suite

OTIS reads radiation values from various bands and calculates temperature data

Useful characteristics of OTIS’ output (temperature) Local Correlation: Data changes gradually over

an area Absolute Bounds: Data falls within some

expected realistic range

UMass - Architecture and Real-Time Systems Lab

Page 22: 1 Application-Level Fault Tolerance for Embedded Real-Time Systems Israel Koren Department of Electrical & Computer Engineering University of Massachusetts

22 of 38

ALFTD Filters for OTIS

Local Correlation and Absolute Bounds on the data led to the creation of two filters:

Spatial Locality Filter: If the difference between pixel (x,y) and (x-1,y) is greater than some threshold - the pixel may be the result of faulty data

Absolute Bounds Filter: Any pixel not falling in the value range of < value < may be the result of faulty data

The filter thresholds (, , ) are set based on sample datasets

UMass - Architecture and Real-Time Systems Lab

Page 23: 1 Application-Level Fault Tolerance for Embedded Real-Time Systems Israel Koren Department of Electrical & Computer Engineering University of Massachusetts

23 of 38

OTIS Datasets

“Blob” “Stripe” “Spots”

Faulty

Fault-free

UMass - Architecture and Real-Time Systems Lab

Page 24: 1 Application-Level Fault Tolerance for Embedded Real-Time Systems Israel Koren Department of Electrical & Computer Engineering University of Massachusetts

24 of 38

Filter Calibration

ALFTD filters require calibration Higher detection probability with low rate

of false alarms can be achieved with well-tuned filters

Calibration should be based on characteristics of the most frequent data

UMass - Architecture and Real-Time Systems Lab

Page 25: 1 Application-Level Fault Tolerance for Embedded Real-Time Systems Israel Koren Department of Electrical & Computer Engineering University of Massachusetts

25 of 38

Frequency Plots (Bounds Filter)

Frequency of temperature values

Page 26: 1 Application-Level Fault Tolerance for Embedded Real-Time Systems Israel Koren Department of Electrical & Computer Engineering University of Massachusetts

26 of 38

Frequency Plots (Spatial Locality Filter)

Frequency of differences between adjacent pixels

Page 27: 1 Application-Level Fault Tolerance for Embedded Real-Time Systems Israel Koren Department of Electrical & Computer Engineering University of Massachusetts

27 of 38

Fault Injection To test the detection capability we compared the fault-free

output to an erroneous output - generated using fault injection

Faults produce different kinds and intensities of errors Intensely faulty data (set-to-zero errors, memory

gibberish) is easily detected, as it seldom falls inside the prescribed filters

“Lightly” faulty data will not be detected but is negligible

Our experiments include moderately faulty data: offsets in value of up to 30%

These faults tend to blend in with non-faulty data, making them especially hard to detect

UMass - Architecture and Real-Time Systems Lab

Page 28: 1 Application-Level Fault Tolerance for Embedded Real-Time Systems Israel Koren Department of Electrical & Computer Engineering University of Massachusetts

28 of 38

Filter Adjustment

Filters can be adjusted in steps A single filter has a high (“right”) and low (“left”)

cutoffs The “left” and “right” bounds of data are usually

exclusive, therefore their detections act cumulatively For each filter - a tradeoff between the desired fault

detection rate and the number of false alarms Multiple filters are independently calibrated

Multiple filters may detect more faults than a single filter and have a lower false alarms rate

But - the subsets of faults detected will not necessarily be disjoint

UMass - Architecture and Real-Time Systems Lab

Page 29: 1 Application-Level Fault Tolerance for Embedded Real-Time Systems Israel Koren Department of Electrical & Computer Engineering University of Massachusetts

29 of 38

Detection Plots (Single Side)

Fault detections and false alarms for the left cutoff (“Blob”)

Page 30: 1 Application-Level Fault Tolerance for Embedded Real-Time Systems Israel Koren Department of Electrical & Computer Engineering University of Massachusetts

30 of 38

Detection Plots (Both Sides)

Overlaying the left and right filter cutoff plots - the impacts of the right and left cutoff values are asymmetric (“Blob”)

Page 31: 1 Application-Level Fault Tolerance for Embedded Real-Time Systems Israel Koren Department of Electrical & Computer Engineering University of Massachusetts

31 of 38

Fault Detections, Numerically

Columns = left cutoff, Rows = right cutoff

This table is used to find the possible configurations that satisfy a minimum required fault detection rate (80%)

300 304 306 310 314 318315 98.9% 99.1% 99.2% 99.2% 99.4% -317 96.6% 96.8% 96.8% 96.9% 97.1% -319 93.8% 93.9% 94.0% 94.0% 94.3% 98.5%321 91.0% 91.1% 91.2% 91.3% 91.5% 95.7%323 88.2% 88.3% 88.4% 88.4% 88.7% 92.9%325 83.6% 83.7% 83.8% 83.9% 84.1% 88.3%327 78.5% 78.7% 78.8% 78.8% 79.0% 83.3%329 71.2% 71.4% 71.5% 71.5% 71.7% 76.0%331 64.0% 64.2% 64.3% 64.3% 64.5% 68.8%333 61.4% 61.5% 61.6% 61.7% 61.9% 66.1%335 60.9% 61.0% 61.1% 61.2% 61.4% 65.6%337 60.2% 60.4% 60.4% 60.5% 60.7% 64.9%339 59.2% 59.4% 59.5% 59.5% 59.7% 64.0%

Bounds Filter: Fault Detections

UMass - Architecture and Real-Time Systems Lab

Page 32: 1 Application-Level Fault Tolerance for Embedded Real-Time Systems Israel Koren Department of Electrical & Computer Engineering University of Massachusetts

32 of 38

False Alarms, Numerically

Columns = left cutoff, Rows = right cutoff

Of the possible combinations chosen from the previous table, choose the one with the minimum number of false alarms

300 304 306 310 314 318315 92.4% 92.4% 92.4% 92.4% 96.3% -317 84.7% 84.7% 84.7% 84.7% 88.7% -319 78.6% 78.6% 78.6% 78.6% 82.5% 97.1%321 72.5% 72.5% 72.5% 72.5% 76.5% 91.0%323 64.8% 64.8% 64.8% 64.8% 68.7% 83.2%325 54.1% 54.1% 54.1% 54.1% 58.1% 72.6%327 41.2% 41.2% 41.2% 41.2% 45.2% 59.7%329 23.9% 23.9% 23.9% 23.9% 27.8% 42.3%331 5.0% 5.0% 5.0% 5.0% 9.0% 23.5%333 0.0% 0.0% 0.0% 0.0% 4.0% 18.5%335 0.0% 0.0% 0.0% 0.0% 3.9% 18.4%337 0.0% 0.0% 0.0% 0.0% 3.9% 18.4%339 0.0% 0.0% 0.0% 0.0% 3.9% 18.4%

Bounds Filter: False Alarms

UMass - Architecture and Real-Time Systems Lab

Page 33: 1 Application-Level Fault Tolerance for Embedded Real-Time Systems Israel Koren Department of Electrical & Computer Engineering University of Massachusetts

33 of 38

Multiple Filters By combining multiple filters, fault detection is improved

Sp

atia

l L

oca

lity

filt

er

Fault Detection False Alarms60.0% 70.0% 90.0% 60.0% 70.0% 90.0%

40.0% 63.7% 71.9% 89.6% 40.0% 15.7% 22.6% 76.0%50.0% 64.0% 72.1% 89.7% 50.0% 15.7% 22.6% 76.0%60.0% 67.5% 72.7% 90.2% 60.0% 15.7% 22.6% 76.0%70.0% 76.3% 80.1% 94.2% 70.0% 36.3% 42.2% 84.2%80.0% 84.1% 87.4% 96.8% 80.0% 59.9% 64.6% 90.5%90.0% 93.0% 94.3% 98.7% 90.0% 77.1% 79.0% 94.5%

Bounds filter

False Alarm run secondary unnecessarily

UMass - Architecture and Real-Time Systems Lab

Page 34: 1 Application-Level Fault Tolerance for Embedded Real-Time Systems Israel Koren Department of Electrical & Computer Engineering University of Massachusetts

34 of 38

ALFTD-corrected output (“Blob”)

Faulty Output

33% Overhead 50% Overhead

Fault-Free Output

25% Overhead

ALFTD- corrected Output

Page 35: 1 Application-Level Fault Tolerance for Embedded Real-Time Systems Israel Koren Department of Electrical & Computer Engineering University of Massachusetts

35 of 38

Difference Plots (“Blob”)

No Error Max Error

Faulty 25% Overhead 33% Overhead 50% Overhead

Faulty output versus fault-free output

UMass - Architecture and Real-Time Systems Lab

Page 36: 1 Application-Level Fault Tolerance for Embedded Real-Time Systems Israel Koren Department of Electrical & Computer Engineering University of Massachusetts

36 of 38

Conclusions

A high degree of fault tolerance at a minimal investment of system resources

Particularly useful in applications exhibiting data parallelism and some level of data redundancy or correlation

Scalable fault-tolerance Attractive alternative to more expensive schemes

such as hardware and/or software redundancy Can complement system-level fault tolerance

schemesUMass - Architecture and Real-Time Systems Lab

Page 37: 1 Application-Level Fault Tolerance for Embedded Real-Time Systems Israel Koren Department of Electrical & Computer Engineering University of Massachusetts

37 of 38

References

J. Haines, V.R. Lakamraju, I. Koren and C.M. Krishna, “Development of Application-Level Fault Tolerance in a Real-Time Benchmark," Proc. of EFTS'98, IEEE Workshop On Embedded Fault-Tolerant Systems, May 1998.

J. Haines, V.R. Lakamraju, I. Koren and C.M. Krishna, “Application- Level Fault Tolerance as a Complement to System-Level Fault Tolerance," The Journal of Supercomputing, Special Issue on “Embedded Fault-Tolerant Computing Systems,” Vol. 16, pp. 53-68, Kluwer Academic Publishers, MA, 2000.

E. Ciocca, I. Koren, C.M. Krishna, “Determining Acceptance Tests for Application-Level Fault Detection,” Proc. of the 2nd ASPLOS Workshop on Evaluating and Architecting System Dependability, pp. 47-53, Oct. 2002.

UMass - Architecture and Real-Time Systems Lab

Page 38: 1 Application-Level Fault Tolerance for Embedded Real-Time Systems Israel Koren Department of Electrical & Computer Engineering University of Massachusetts

38 of 38

Thank You!

C.M. Krishna

Vijay Lakamraju

Josh Haines

Eric Ciocca

Page 39: 1 Application-Level Fault Tolerance for Embedded Real-Time Systems Israel Koren Department of Electrical & Computer Engineering University of Massachusetts

39 of 38

Further Extension (Input Errors)

Real-time applications exposed to extreme environments can be affected by charged particles like alpha/cosmic rays

High likelihood of input data faults manifesting as bit flips Re-running the process or its secondary is useless as the

input remains the same Input data should be preprocessed to detect input errors and

attempt to correct them We have integrated preprocessing of input data in two NASA

applications - OTIS and NGST

UMass - Architecture and Real-Time Systems Lab

Page 40: 1 Application-Level Fault Tolerance for Embedded Real-Time Systems Israel Koren Department of Electrical & Computer Engineering University of Massachusetts

40 of 38

Next Generation Space Telescope

Multiple readouts during each period Use this redundancy to identify and recover from input data bit

errors Algorithms like optimal median smoothing and sliding-window bit

majority smoothing can be used

Ground StationSpace Station

UMass - Architecture and Real-Time Systems Lab

Page 41: 1 Application-Level Fault Tolerance for Embedded Real-Time Systems Israel Koren Department of Electrical & Computer Engineering University of Massachusetts

41 of 38

NGST - Results

Probability of a data bit flip

Rela

tive E

rror

(en

tire

data

set)

UMass - Architecture and Real-Time Systems Lab

Page 42: 1 Application-Level Fault Tolerance for Embedded Real-Time Systems Israel Koren Department of Electrical & Computer Engineering University of Massachusetts

42 of 38

Results for OTIS

Data redundancy in OTIS: multiple radiation mappings – one for each wavelength out of 128

Thermal data exhibits strong spatial locality and tight natural bounds can also be exploited by the preprocessing

Probability of a data bit flip

Rela

tive E

rror

UMass - Architecture and Real-Time Systems Lab