50
1 Software Fault Tolerance (SWFT) SWFT for Wireless Sensor Networks (Lec 2) Dependable Embedded Systems & SW Group www.deeds.informatik.tu-darmstadt.de Prof. Neeraj Suri Abdelmajid Khelil Dept. of Computer Science TU Darmstadt, Germany

1 Software Fault Tolerance (SWFT) SWFT for Wireless Sensor Networks (Lec 2) Dependable Embedded Systems & SW Group

Embed Size (px)

Citation preview

Page 1: 1 Software Fault Tolerance (SWFT) SWFT for Wireless Sensor Networks (Lec 2) Dependable Embedded Systems & SW Group

1

Software Fault Tolerance (SWFT)

SWFT for Wireless Sensor Networks (Lec 2)

Dependable Embedded Systems & SW Group www.deeds.informatik.tu-darmstadt.de

Prof. Neeraj Suri

Abdelmajid Khelil

Dept. of Computer ScienceTU Darmstadt, Germany

Page 2: 1 Software Fault Tolerance (SWFT) SWFT for Wireless Sensor Networks (Lec 2) Dependable Embedded Systems & SW Group

2

Last lecture:

.. FT environmental monitoring , e.g. target detection.... Data fusion ..

But the WSN ages!The system is evolvable!

“So who watches the watchmen?”

Motivation

Page 3: 1 Software Fault Tolerance (SWFT) SWFT for Wireless Sensor Networks (Lec 2) Dependable Embedded Systems & SW Group

3

• IDEALLY a self-healing network could• Identify Problem: That a bird landed on a node• Identify a Fix: Need to remove the bird • Fix the Problem: Actuation to remove the bird

Self-healing

Goal: Enable WSN to self-monitoring system health and autonomous debug

Begin by enabling human debugging in order to learn what metrics and techniques are useful, in order to enable autonomous system debugging

Page 4: 1 Software Fault Tolerance (SWFT) SWFT for Wireless Sensor Networks (Lec 2) Dependable Embedded Systems & SW Group

4

Debugging Is Hard in WSN

Wide range of failures node crashes, sensor fails, code bugs, transient

environmental changes to the network Bugs are multi-causal, non-repeatable, timing-sensitive

and have ephemeral triggers

Transient problems are common Not necessarily indicative of failures Interactions between sensor hardware, protocols, and

environmental characteristics are impossible to predict

Limited visibility Hard accessibility of the system Minimal resources, RAM, communication

WSN application design is an iterative process between debugging and deployment.

Page 5: 1 Software Fault Tolerance (SWFT) SWFT for Wireless Sensor Networks (Lec 2) Dependable Embedded Systems & SW Group

5

Some Debugging Challenges Minimal resource

Cannot remotely log on to nodes Bugs are hard to track down

Evolvable system/conditions Application behavior changes after deployment Operating conditions (energy ..)

Extracting debugging information Existing fault-tolerance techniques are limited Ensuring system health

Page 6: 1 Software Fault Tolerance (SWFT) SWFT for Wireless Sensor Networks (Lec 2) Dependable Embedded Systems & SW Group

6

Scenario

After Deploying a Sensor Network…

Very little data arrives at the sink, could be…. anything!

The sink is receiving fluctuating averages from a region – could be caused by Environmental fluctuations Bad sensors Channel drops the data Calculation / algorithmic errors; and Bad nodes

Page 7: 1 Software Fault Tolerance (SWFT) SWFT for Wireless Sensor Networks (Lec 2) Dependable Embedded Systems & SW Group

7

Existing Works

Simulators / Emulators / Visualizers E.g. EmTOS, EmView, moteview,

Tossim ..• Provide real-time information• Do not capture historical context or aid

in root-causing a failure

SNMS Interactive health monitoring focuses on infrastructure to deliver

metrics, and high code size Log files contain excessive data

which can obfuscate important events

[1] focus on metric collection and not on metric content

Momento, Sympathy

EmView

Page 8: 1 Software Fault Tolerance (SWFT) SWFT for Wireless Sensor Networks (Lec 2) Dependable Embedded Systems & SW Group

8

Sympathy: A Debugging System for Sensor Network

Nithya Ramanathan, Kevin Chang, Rahul Kapur, Lewis Girod, Eddie Kohler, and Deborah Estrin.

SenSys '05

Page 9: 1 Software Fault Tolerance (SWFT) SWFT for Wireless Sensor Networks (Lec 2) Dependable Embedded Systems & SW Group

9

Overview

System Model Approach Architecture Evaluation

Page 10: 1 Software Fault Tolerance (SWFT) SWFT for Wireless Sensor Networks (Lec 2) Dependable Embedded Systems & SW Group

10

System Model

Network feature some regular traffic: SN are expected to generate traffic of some kind (Monitored traffic): routing updates, time synchronization beacons, periodical data ..

Sympathy suspects a failure when a node generates less monitored traffic than expected.

Sympathy generates additional metrics traffic. No malicious behavior

Page 11: 1 Software Fault Tolerance (SWFT) SWFT for Wireless Sensor Networks (Lec 2) Dependable Embedded Systems & SW Group

11

Model For Correct Data Flow

Sink may not receive sufficient traffic from a node for multiple reasons

To determine where and why traffic is lost, Sympathy outlines high-level requirements for data to flow through the network

Page 12: 1 Software Fault Tolerance (SWFT) SWFT for Wireless Sensor Networks (Lec 2) Dependable Embedded Systems & SW Group

12

Node Not Connected

If destination node is not connected, then it may not receive the packet, and it will not respond

Page 13: 1 Software Fault Tolerance (SWFT) SWFT for Wireless Sensor Networks (Lec 2) Dependable Embedded Systems & SW Group

13

Node Does Not Receive Packet

If destination node does not receive certain packets (e.g. a query) from source, it may not transmit expected traffic

Page 14: 1 Software Fault Tolerance (SWFT) SWFT for Wireless Sensor Networks (Lec 2) Dependable Embedded Systems & SW Group

14

Node Does Not Transmit Traffic

Destination node may receive traffic, but due to software or hardware failure it may not transmit expected traffic

Page 15: 1 Software Fault Tolerance (SWFT) SWFT for Wireless Sensor Networks (Lec 2) Dependable Embedded Systems & SW Group

15

Sink Does Not Receive Traffic

Sink may not receive traffic due to collisions or other problems along the route

Page 16: 1 Software Fault Tolerance (SWFT) SWFT for Wireless Sensor Networks (Lec 2) Dependable Embedded Systems & SW Group

16

Tracking Traffic Through The Data Flow Model

Should NodeTransmitTraffic?

Did NodeTransmitTraffic?

Did SinkReceiveTraffic?

Node Connected?

Node NOTConnected!

(Node Crash)

(AsymmetricCommunication)

Node NOTConnected!

Page 17: 1 Software Fault Tolerance (SWFT) SWFT for Wireless Sensor Networks (Lec 2) Dependable Embedded Systems & SW Group

17

Design Requirements

Tool for detecting and debugging failures in pre- and post-deployment phases.

Debugging information should provide Most precise and meaningful failure detection

• Accuracy• Latency

Lowest overhead

Transmitted debugging information must be minimized

Page 18: 1 Software Fault Tolerance (SWFT) SWFT for Wireless Sensor Networks (Lec 2) Dependable Embedded Systems & SW Group

18

3 Challenges in WSN Debugging

Has a failure happened? Which failure happened? Is the failure important?

Sympathy aids users in - finding (detecting and localizing) and - fixing failures

by attempting to answer these questions

Page 19: 1 Software Fault Tolerance (SWFT) SWFT for Wireless Sensor Networks (Lec 2) Dependable Embedded Systems & SW Group

19

Sympathy Approach

Sink collects stats passively & activelyactively

Sink

Monitors data flow fromnodes / components

Identifies and localizes failures

X

Highlights failure dependencies and event correlations

2

1

34

Idea: “There is a direct relationship between amount of data collected at the sink and the existence of failures in the system”

Page 20: 1 Software Fault Tolerance (SWFT) SWFT for Wireless Sensor Networks (Lec 2) Dependable Embedded Systems & SW Group

20

Network’s purpose is to communicate If nodes are communicating sufficiently, network is

working Simplest solution is the best

“Insufficient” Traffic => Failure Application defines “sufficient”

Sympathy detects many different failure types by tracking application end-to-end data flow

Channel Contention

Node Crash

AsymmetricLinks

Sensor Failure

No Sensor Data

Has a Failure Happened?

Page 21: 1 Software Fault Tolerance (SWFT) SWFT for Wireless Sensor Networks (Lec 2) Dependable Embedded Systems & SW Group

21

Valid Neighbor Table No Neighbors

Yes

Valid RouteTable

No Route

Yes

No

Sufficient #PktsReceived

Bad Path to Node

No

Yes

Bad Node Transmit

No

No Bad Path to Sink

Anybody heard from node Sink Timestamp/Neighbor Tables Node Crash

Yes

No

Time Awake increases

No

Yes

Node Reboot

No

Should NodeTransmit?

Did NodeTransmit?

Did SinkReceive Traffic?

Node Connected?

Sufficient #PktsReceived at Sink

Yes

Sufficient #PktsTransmitted

Which Failure Happened?

Page 22: 1 Software Fault Tolerance (SWFT) SWFT for Wireless Sensor Networks (Lec 2) Dependable Embedded Systems & SW Group

22

Is the Failure Important? Analyze failure dependences to highlight primary failures

Based on reported node topology “Can failure X be caused by failure Y”

Deemphasize secondary failures to focus user’s attention Does NOT identify all failures or debug failures to line of

code

Primary FailureSecondary

Failures

Page 23: 1 Software Fault Tolerance (SWFT) SWFT for Wireless Sensor Networks (Lec 2) Dependable Embedded Systems & SW Group

23

Failure Localization

Determining why data is missing Physically narrow down cause

E.g. Where is the data lost

X

Was the data even sent by the component?

Where in the transmission path was the data lost?

OR

Page 24: 1 Software Fault Tolerance (SWFT) SWFT for Wireless Sensor Networks (Lec 2) Dependable Embedded Systems & SW Group

24

Bad Node Transmit (ADC failure)

Node crashed

No Neighbors

PRIMARY FailuresIn Red Boxes

Bad Path To Node (due to contention at sink)

Bad Path To Sink (due to contentionat sink)

Detect Failures Determine what the failure is (Root Cause) Determine if the failure is important (Primary

Failures)

Contention at Sink

Final Goal

Page 25: 1 Software Fault Tolerance (SWFT) SWFT for Wireless Sensor Networks (Lec 2) Dependable Embedded Systems & SW Group

25

Sympathy

Routing

AppsSympathy

Routing

Apps

SINK

UserProcesses

CollectMetrics Sympathy

Sympathy System: Architecture

Page 26: 1 Software Fault Tolerance (SWFT) SWFT for Wireless Sensor Networks (Lec 2) Dependable Embedded Systems & SW Group

26

Architecture Definitions Network: a sink and

distributed nodes Component

Node components Sink components

Sympathy-sink Communicates with sink

components Understands all packet formats

sent to the sink Non resource constrained node

Sympathy-node Statistics period Epoch

Sympathy sink

SinkComponent

Sympathy node

Node Component

Sink

Sensor node

Page 27: 1 Software Fault Tolerance (SWFT) SWFT for Wireless Sensor Networks (Lec 2) Dependable Embedded Systems & SW Group

27

CollectStats

PerformDiagnostic

If Insufficient data Run

Tests

Run FaultLocalizationAlgorithm

SYMPATHY

Sympathy

Routing

Comp 1

SINK

CollectStats

PerformDiagnostic

If No/Insufficient data Run

Tests

Run FaultLocalizationAlgorithm

SYMPATHY

USER

Sink Components

Nodes

Architecture: Overview

1

Page 28: 1 Software Fault Tolerance (SWFT) SWFT for Wireless Sensor Networks (Lec 2) Dependable Embedded Systems & SW Group

28

Routing Layer

MAC Layer

Retrieve CompStatistics R

ing

Bu

ffer

Stats Recorder&

Event Processor

Sympathy - Node

Data Return

…Comp 1

Sympathy Code on Sensor Node Each component is monitored independently Return generic or app-specific statistics

Page 29: 1 Software Fault Tolerance (SWFT) SWFT for Wireless Sensor Networks (Lec 2) Dependable Embedded Systems & SW Group

29

Metrics Metrics are collected in 3 ways:

Sympathy code on each SN actively reports to the sink (periodically or on-demand)

Sink passively snoops its own transmission area Sympathy code on sink extracts sink metrics from sink

application 3 metric categories:

Connectivity: ROUTING TABLE, NEIGHBOUR LIST from each node, either passively or actively.

Flow: PACKETS SENT, PACKETS RECEIVED, #SINK PACKETS TRANMITTED from each SN, and #SINK PACKETS RECEIVED and SINK LAST TIMESTAMP from sink.

Node: SN actively report UPTIME, BAD PACKETS RECEIVED, GOOD PACKETS RECEIVED. Sink also maintains BAD and GOOD PACKETS RECEIVED.

All Metrics timeout after EPOCH

Page 30: 1 Software Fault Tolerance (SWFT) SWFT for Wireless Sensor Networks (Lec 2) Dependable Embedded Systems & SW Group

30

Node Statistics

Passive (in sink’s broadcast domain) and actively transmitted by nodes

Statistic Name Description

ROUTING TABLE (Sink, next hop, quality) tuples.

NEIGHBOUR LIST Neighbors and associated ingress/ egress

UP TIME Time node is awake

#Statistics tx #Statistics packets transmitted to sink

#Pkts routed #Packets routed by the node

Page 31: 1 Software Fault Tolerance (SWFT) SWFT for Wireless Sensor Networks (Lec 2) Dependable Embedded Systems & SW Group

31

Component Statistics

Actively transmitted by a node to the sink, for each instrumented component

Statistic Name Description

#Reqs rx Number of packets component received

#Pkts tx Number of packets component transmitted

SINK LAST TIMESTAMP Timestamp of last data stored by

component

Page 32: 1 Software Fault Tolerance (SWFT) SWFT for Wireless Sensor Networks (Lec 2) Dependable Embedded Systems & SW Group

32

CollectStats

SYMPATHY

Sympathy

Routing

Comp 1

2

SINK

CollectStats

SYMPATHY

Sink Components

Comp 1

Comp 1

Sympathy System

Page 33: 1 Software Fault Tolerance (SWFT) SWFT for Wireless Sensor Networks (Lec 2) Dependable Embedded Systems & SW Group

33

Sink Interface

Sympathy passes comp-specific statistics using a packet queue

Components return ascii translations for Sympathy to print to the log file

Sympathy

Comp 1

Comp 2

Comp 3

Comp-specificstatistics

Ascii translationof statistics /Data received

Page 34: 1 Software Fault Tolerance (SWFT) SWFT for Wireless Sensor Networks (Lec 2) Dependable Embedded Systems & SW Group

34

CollectStats

PerformDiagnostic

If Insufficient data Run

Tests

Run FaultLocalizationAlgorithm

SYMPATHY

Sympathy

Routing

Comp 1

SINK

CollectStats

PerformDiagnostic

RunTests

Run FailureLocalizationAlgorithm

SYMPATHY

Sink Components

3

Sympathy System

If No/Insufficient data

Page 35: 1 Software Fault Tolerance (SWFT) SWFT for Wireless Sensor Networks (Lec 2) Dependable Embedded Systems & SW Group

35

Node Rebooted

Node Rebooted

Yes NoRx a Pkt

from nodeYes No

Some node hasheard this node

No

Node Crashed

Yes

Some nodehas route to sink

NoYes

Some node hassink as neighbor

No

No node has sink on their

neighbor list

No node has aRoute to sink

Yes

No Data

Rx Statistics

No stats

Yes No

Rx all Comp’sData

NO FAILURE (Comp has no Data to

Tx)

NoYes

Comp Rx Reqs

NoYes

Node not Rx ReqsComp Tx Resps

NoYes

Node not Tx RespsSink Rx Resps Comp Tx

NoYes

Sink not Rx Resps

DIAGNOSTIC

No DataInsufficient DataInsufficient

Data

Failure Localization: Decision Tree

Tx: transmitRx: receive

Page 36: 1 Software Fault Tolerance (SWFT) SWFT for Wireless Sensor Networks (Lec 2) Dependable Embedded Systems & SW Group

36

Functional “No Data” Failure Localization

Failure Description

Node Crash Node has crashed and not come back

No Route to Sink

No valid route exists to the sink from a node

No Data No data received from a node, and Sympathy cannot localize the failure

Page 37: 1 Software Fault Tolerance (SWFT) SWFT for Wireless Sensor Networks (Lec 2) Dependable Embedded Systems & SW Group

37

Performance “Insufficient Data” Failure Localization

Failure DescriptionNode Reboot Node has rebooted

Congestion Correlated failures on packet reception

No requests rx Component is not receiving requests from sink

No response tx Component is not transmitting data in response to requests

No response rx Sink is not receiving data transmitted by a component

No statistics rx Sink has not received Sympathy statistics on the component

Page 38: 1 Software Fault Tolerance (SWFT) SWFT for Wireless Sensor Networks (Lec 2) Dependable Embedded Systems & SW Group

38

Source Localization

Root Causes with Associated Metrics and Source:

Three localized sources for Failures:

• Node self (crash, reboot, local bug, connectivity issue..)• Path between node and sink (relay failure, collisions ..)• Sink

Page 39: 1 Software Fault Tolerance (SWFT) SWFT for Wireless Sensor Networks (Lec 2) Dependable Embedded Systems & SW Group

39

CollectStats

PerformDiagnostic

If Insufficient data Run

Tests

Run FaultLocalizationAlgorithm

SYMPATHY

Sympathy

Routing

Comp 1

4 SINK

CollectStats

PerformDiagnostic

If Insufficient data Run

Tests

Run FaultLocalizationAlgorithm

SYMPATHY

USER

Sink Components

Sympathy System

Page 40: 1 Software Fault Tolerance (SWFT) SWFT for Wireless Sensor Networks (Lec 2) Dependable Embedded Systems & SW Group

40

Informational Log File

Node 25Time: Node awake: 78 (mins) Sink awake: 78(mins)Route: 25 -> 18 -> 15 -> 12 -> 10 -> 8 -> 6 -> 2 Num neighbors heard this node: 6

Pkt-type #Rx Mins-since-last #Rx-errors Mins-since-last

1:Beacon 15(2) 0 mins 1(0) 52 mins3:Route 3(0) 37 mins 0(0) INFSymp-stats 12(2) 1 mins

Reported Stats from Components------------------------------------**Sympathy: #metrics tx/#stats tx/#metrics expected/#pkts routed: 13(2)/12(2)/13(1)/0(0)

Node-ID Egress Ingress-----------------------------------------------8 128 7113 128 12124 249 254

Page 41: 1 Software Fault Tolerance (SWFT) SWFT for Wireless Sensor Networks (Lec 2) Dependable Embedded Systems & SW Group

41

Failure Log FileNode 18 Node awake: 0 (mins)Sink awake: 3 (mins)Node Failure Category: Node Failed!

TESTS Received stats from module [FAILED] Received data this period [FAILED] Node thinks it is transmitting data [FAILED] Node has been claimed by other nodes as a neighbor [FAILED] Sink has heard some packets from node [FAILED] Received data this period: Num pkts rx: 0(0) Received stats from module: Num pkts rx: 0(0)

Node’s next-hop has no failures

Page 42: 1 Software Fault Tolerance (SWFT) SWFT for Wireless Sensor Networks (Lec 2) Dependable Embedded Systems & SW Group

42

Spurious Failures An artifact of another failure Sympathy highlights failure dependencies in order

to distinguish spurious failures

SympathySink

NodeCrashed

CongestionAppears tobe sending

very little data

Appears tonot be sending

data

Page 43: 1 Software Fault Tolerance (SWFT) SWFT for Wireless Sensor Networks (Lec 2) Dependable Embedded Systems & SW Group

43

Testing Methodology Application

Run in Sympathy with the ESS (Extensible Sensing System) application

In simulation, emulation and deployment Traffic conditions: no traffic, application traffic,

congestion Node failures

Node reboot – only requires information from the node Node crash – requires spatial information from neighboring

nodes to diagnose Failure injected in one node per run, for each node 18 node network, with maximum 7 hops to the sink

Page 44: 1 Software Fault Tolerance (SWFT) SWFT for Wireless Sensor Networks (Lec 2) Dependable Embedded Systems & SW Group

44

Evaluation Metrics

Accuracy of Failure Detection: Number of primary failure notifications

Latency of Failure Detection/notification Time from when the failure is injected to when Sympathy

notifies the user about the failure

There is a tradeoff between accuracy and latency

Page 45: 1 Software Fault Tolerance (SWFT) SWFT for Wireless Sensor Networks (Lec 2) Dependable Embedded Systems & SW Group

45

Notification Latency Does Sympathy always detects an injected failure?Detection =

Assign a root cause of node crash Highlight the failure as primary

EPOCH

Page 46: 1 Software Fault Tolerance (SWFT) SWFT for Wireless Sensor Networks (Lec 2) Dependable Embedded Systems & SW Group

46

Notification Accuracy

Page 47: 1 Software Fault Tolerance (SWFT) SWFT for Wireless Sensor Networks (Lec 2) Dependable Embedded Systems & SW Group

47

Memory Footprint TinyOS, mica2

Binary RAM ROM

ESS w/o Sympathy 3089 B 96094 B

ESS w/ Sympathy 3160 B 104802 B

Difference 71 B 8708 B

Sympathy 47 B 1558 B

Page 48: 1 Software Fault Tolerance (SWFT) SWFT for Wireless Sensor Networks (Lec 2) Dependable Embedded Systems & SW Group

48

Extensibility

Adding new metrics requires ~5 lines of code on the nodes and ~10 lines of code on the sink

Extensible to application classes with predictable data flow within bounds of an epoch User specifies expected amount of data

Extensible to different routing layers due to modular design Multihop routing plug-in was 140 lines Mintroute routing plug-in was 100 lines

Page 49: 1 Software Fault Tolerance (SWFT) SWFT for Wireless Sensor Networks (Lec 2) Dependable Embedded Systems & SW Group

49

Conclusion

A deployed system that aids in debugging by detecting and localizing failures

Small list of statistics that are effective in identifying and localizing failures

Behavioral model for a certain application class that provides a simple diagnostic to measure system health

Page 50: 1 Software Fault Tolerance (SWFT) SWFT for Wireless Sensor Networks (Lec 2) Dependable Embedded Systems & SW Group

50

Literature

[1] Zhao, J.   Govindan, R.   Estrin, D.  “Computing aggregates for monitoring wireless sensor networks” SNPA 2003.

[2] Nithya Ramanathan, Kevin Chang, Rahul Kapur, Lewis Girod, Eddie Kohler, and Deborah Estrin “Sympathy for the Sensor Network Debugger”, Sensys 2005.

[3] Rost, S.; Balakrishnan, H. “Memento: A Health Monitoring System for Wireless Sensor Networks“, SECON 2006.