Slide 1/36 Dependability in a Connected World: From the Very Large to the Very Small Saurabh Bagchi School of Electrical and Computer Engineering Department

Slide 1/36

Dependability in a Connected World: From the Very Large to the Very

Small

Saurabh BagchiSchool of Electrical and Computer Engineering

Department of Computer Science Purdue University

Presentation available at: engineering.purdue.edu/dcsl

Slide 2/36

Greetings come to you from …

Slide 3/36

A Few Words about Me• B. Tech. (1996): IIT Kgp, CSE• PhD (2000): University of Illinois at Urbana-Champaign

(CS)• IBM Research, T. J. Watson (2001)• Joined Purdue as Tenure-track Assistant Professor (2002)• Promoted to Associate Professor (2007)• Promoted to Professor (2012)• Visiting Scientist (2011-current): IBM Research• Current research group: 12 graduate students, 1 post-

doctoral Research Associate, 4 undergraduate students• 12 graduated PhD students

Slide 4/36

A long time ago … in a galaxy far, far away

Slide 5/36

A few words about Purdue University• One of the largest graduate schools in engineering

– 362 faculty, 10,000 students, US News rank: 8th

• 40,000 students at its main campus in West Lafayette• Electrical and Computer Engineering @ Purdue

– 85 faculty, 650 graduate students, 900 undergraduate students– One of the largest producers of Ph.D.s in Electrical and Computer

Engineering (about 60 Ph.D's a year)– 3rd largest ECE department in the US– US News rank: 9th (Computer Engineering), 10th

(Electrical/Electronics Engineering)• Computer Science @ Purdue

– 50 faculty, 245 graduate students– US News rank: 20th

Slide 6/36

What is Dependable Computing?

• Dependability: Property that the computer system meets its specification despite the presence of faults– Faults can be due to natural causes (bugs, defects in hardware),

or– Maliciously induced (attacks from external or internal sources)

Fault Error Failure

Some faults are not activated

Some errors are masked

• Terminology– Failure: Deviation of the delivered service from compliance

with the specification– Error: Part of the system state that has been damaged by the

fault and, if uncorrected, can lead to a failure. – Fault: The adjudged or hypothesized cause of an error

Slide 7/36

Some Domains for Dependable Computing• Traditional needs

– Long-life applications (e.g., unmanned and manned space missions )

– Life-critical, short-term applications (e.g., aircraft engine control, fly-by-wire)

– Defense applications (e.g., missile defense, guidance & control)– Nuclear industry

• Newer critical-computation applications– Health industry– Automotive industry– Industrial control systems, production lines– Banking, reservations, switching, commerce

Slide 8/36

Some Domains for Dependable Computing• Networks

– Wired and wireless networked applications– Data mining– Information processing on the Internet– Distributed, networked systems

• Scientific computing– Lots of cores– Petaflop machines being used for reliable computation

Slide 9/36

Two Facets of Dependable Computing• Mercedes version

– High hardware or software development costs– High power consumption– High space overhead– Example: Boeing 777 fly-by-wire (FBW) system, which used

triple modular redundancy for all hardware resources– Example: AT&T’s ESS telecommunication switch which had a

requirement of downtime < 2 minutes/year• Commodity systems

– Cannot have too high development costs– Cannot impede performance of system significantly– Cannot take recourse to too high redundancy– Example: ECC, Parity, RAID

Slide 10/36

System Design for Dependable Computing

Hardware

System network

Processing elementsMemoryStorage system

Operating system

Reliable communications

SIFT

Application program interface (API)

Middleware

Applications

What can be provided in COTS hardware to ensure fail-silent behavior of system components (nodes, network)?

What can be provided in the communication layer?

What can be provided in software and application itself?

What is typically provided in the operating system?

Ho

w t

o c

om

bin

e h

ard

wa

re a

nd

so

ftw

are

fau

lt t

ole

ran

ce

tech

niq

ue

s -

(1)

fas

t er

ror

de

tect

ion

in

har

dw

are,

(2)

hig

h e

ffic

ien

cy d

ete

ctio

n a

nd

re

cove

ry i

n s

oft

wa

re

Ho

w t

o a

sses

s w

het

her

th

e ac

hie

ved

av

aila

bil

ity

mee

ts

sys

tem

re

qu

irem

ents

Slide 11/36

How do We Achieve the Objectives?

Error correcting codes, N_of_M and standby redundancy , voting, watchdog timers, reliable storage (RAID, mirrored disks)

CRC on messages , acknowledgment,watchdogs, heartbeats, consistency protocols

Memory management, detection of process failures, hooks to support software fault tolerance for application

Checkpointing and rollback, application replication, software, voting (fault masking), Process pairs, robust data structures, recovery blocks, N-version programming,

HardwareSystem network

Processing elementsMemoryStorage system

Operating system

Reliable communications

SIFT

Application program interface (API)

Middleware

Applications

Slide 12/36

A Trip Back in History

• September 9, 1947• Mark II computer at

Harvard• Popularly attributed to

Grace Hopper• Actually Bill Burke

found the moth• Diary with the taped

moth can be seen at Smithsonian National Museum of American History

Slide 13/36

Costly Software Bugs in History• Facebook IPO: May 2012

– NASDAQ earmarked $62 million for reimbursing investors and says it expects to incur significant costs beyond that for system upgrades and legal battles. WSJ pegged the total loss to investors at $500M.

– Investors were unsure how much of Facebook they’d bought. There were 12 million postponed share orders suddenly filled between 1:49 p.m. and 1:51 p.m. without being properly marked ‘late sale’, which exaggerated the impression that people were trying to dump Facebook shares.

– Diagnosis: Computer systems used to establish the opening price were overwhelmed by order cancellations and updates during the "biggest IPO cross in the history of mankind," Nasdaq Chief Executive Officer Robert Greifeld said Sunday. Nasdaq's systems fell into a "loop" that prevented it from opening the shares on schedule

Slide 14/36

Costly Software Bugs in History• US Power Grid Blackout: August 14, 2003

– 50 million people lost power for up to two days in the biggest blackout in North American history. The event contributed to at least 11 deaths and cost an estimated $6 billion.

– Diagnosis: The task force responsible for investigating the cause of the blackout concluded that a software failure at FirstEnergy Corp. "may have contributed significantly" to the outage.

– FirstEnergy 's Alarm and Event Processing Routine (AEPR), a key software program that gives operators visual and audible indications of events occurring on their portion of the grid, began to malfunction. As a result, "key personnel may not have been aware of the need to take preventive measures at critical times”.

– Internet links to Supervisory Control and Data Acquisition (SCADA) software weren't properly secure and some operators lacked a system to view the status of electric systems outside their immediate control.

Slide 15/36

Costly Software Bugs in History• Medical Devices: Ongoing

– It has been shown to be possible for a heart defibrillator and pacemaker to reprogram it to shut down and to deliver jolts of electricity that could be fatal. (For a device called Maximo from industry #1 company called Medtronic)

– Also possible to glean personal patient data by eavesdropping on signals from the tiny wireless radio embedded in the implant as a way to let doctors monitor and adjust it without surgery.

– 1983-1997: There were 2,792 quality problems that resulted in recalls of medical devices, 383 of which were related to computer software (14%), according to a 2001 study analyzing FDA reports of the medical devices that were voluntarily recalled by manufacturers.

– 2014: FDA has had 86 recalls, categorized as ones where there is “reasonable probability that use of these products will cause serious adverse health consequences or death.” At least 12 of the recalls were likely caused by software defects.

Slide 16/36

Dependable Computing Terminology

DEPENDABILITY

ATTRIBUTES

AVAILABILITYRELIABILITYSAFETYCONFIDENTIALITYINTEGRITYMAINTAINABILITY

FAULT PREVENTIONFAULT TOLERANCEFAULT REMOVALFAULT FORECASTING

FAULTSERRORSFAILURES

MEANS

IMPAIRMENTS

FAULT TOLERANCE

PREDICTION

DETECTION

CONTAINMENTDIAGNOSIS

RECOVERY

Slide 17/36

Fault Classes

Based on the temporal persistence

• Permanent faults, whose presence is continuous and stable.

• Intermittent faults, whose presence is only occasional due to unstable hardware or varying hardware and software states (e.g., as a function of load or activity).

• Transient faults, resulting from temporary environmental conditions.

Based on the origin

• Physical faults, stemming from physical phenomena internal to the system, such as threshold change, shorts, opens, etc., or from external changes, such as environmental, electromagnetic, vibration, etc.

• Human-made faults, which may be either design faults, introduced during system design, modification, or establishment of operating procedures, or interaction faults, which are violation of operating or maintenance procedures.

Slide 18/36

Fault Cycle & Dependability Measures

Reliability:a measure of the continuous delivery of service;R(t) is the probability that the system survives (does not fail) throughout [0, t];expected value: MTTF(Mean Time To Failure)

Availability:a measure of the service delivery with respect to the alternation of the delivery and interruptionsA(t) is the probability that the system delivers a proper (conforming to specification)service at a given time t.expected value: EA = ???

Maintainability:a measure of the service interruptionM(t) is the probability that the system will be repaired within a time less than t;expected value: MTTR (Mean Time To Repair)

Safety:a measure of the time to catastrophic failure S(t) is the probability that no catastrophic failures occur during [0, t];expected value: MTTCF(Mean Time To Catastrophic Failure)

Slide 19/36

Ways to Increase System Availability

• To increase availability:

1. Increase MTTF

2. Decrease MTTR

3. Decrease DL

Availability = MTTF/(MTTF + MTTR + DL)

Mean Time To Failure (MTTF) = EXP(t2-t0)

Mean Time To Repair (MTTR) = EXP(t4-t3)

Mean Time Between Failure (MTBF) = EXP(t6-t2)

Slide 20/36

Amazon Web Service (AWS): Case Study

• A set of services built in for reliability and security

• Amazon Web Services (AWS) provides on-demand computing resources and services in the cloud, with pay-as-you-go pricing– It is a form of cloud computing, falling within Infrastructure as

a Service (IaaS)– You can run a server on AWS that you can log on to, configure,

secure, and run just as you would a server sitting in front of you

• AWS products are available to use in different regions– Each region contains multiple distinct locations called

Availability Zones– Each Availability Zone is engineered to be isolated from

failures in other Availability Zones, and to provide low-latency network connectivity to other zones in the same region

Slide 21/36

Amazon Web Service: Case Study• Amazon Machine Images (AMIs): Commonly used

machine instances from which the user can choose to use as an execution platform; Spare instances can be kept running

• Amazon Elastic Block Store (Amazon EBS): Block-level storage volumes for AMIs– Durability of EBS is higher than a typical hard drive due to

storing data redundantly; Annual failure rate for an EBS volume is 0.1 to 0.5% compared to 4% for a regular hard drive.

– EBS provides a snapshot feature – a backup of the system taken at a specific instance of time. Snapshots are stored in the Amazon S3 to ensure high durability.

Slide 22/36

Amazon Web Service: Case Study• Autoscaling and Elastic Load Balancing: Allows

EC2 capacity to go up or down as needed by load– Example: When # running server instances is below a

threshold, launch new server instances– Example: Monitor resource utilization of server

instances using CloudWatch service; if utilization too high, launch new server instances

– Example: Distribute incoming traffic across EC2 instances, for load balancing or to route around failed instances

Slide 23/36

Overview of Research in our Group: Dependable Computing Systems Lab

Slide 24/36

Thrust #1: Distributed Intrusion Tolerant System• Distributed systems subjected to malicious attacks to

services, including hitherto unknown attacks• Objective is to tolerate intrusions, not just detect• Different phases: Detection, Diagnosis, Containment,

Response.• Solution approach:

– Attack graph based modeling of multi-stage attacks– Algorithms for containment and survivability computation and

semi-optimal response selection– Semi-optimal selection and configuration of intrusion detection

sensors for gaining visibility into the security state of system– Dealing with zero-day attacks

Slide 25/36

Thrust #1: Distributed Intrusion Tolerant System

S1S2

S3S4

Hacker

Slide 26/36

Thrust #1: Distributed Intrusion Tolerant System

Slide 27/36

Thrust #2: Debugging Large-Scale Distributed Applications

• Goal: Provide highly available applications (e.g., web service) in distributed environment– Perform failure prediction– Perform bug localization

• Challenges in today’s distributed systems– Large number of entities and large amount of data– Interactions between entities causes error propagation– High throughput or low latency requirements for the

application

Slide 28/36

Thrust #2: Predictive Reliability Engine for Cellular Networks

ISPBackbone

eNodeB MME

S/P-Gateway

IMS Core

Onlinepredictor

P-GW VM

Extreme Cloud Site

OfflineTrainingML Classifier

Online dataHandoff thresholds

Call

paus

e

Offline data (GPEH)

Slide 29/36

Thrust #2: Metric-based Bug Localization

How can we use these metrics to localize the root cause of problems?

MiddlewareVirtual machines and containers

statistics

Operating SystemCPU, memory, I/O, network

statistics

HardwareCPU performance counters

ApplicationRequests rate, transactions, DB

reads/writes, etc..

Tivoli Operations Manager

Slide 30/36

Thrust #2: Metric-based Bug Localization

• Look for abnormal time patterns• Pinpoint code regions that are correlated with these

abnormal patterns

Slide 31/36

Thrust #2: Debugging at Extreme Scales•Problem Statement:• Applications being run at large scales – large number of processes (such as,

Hadoop clusters) and on large amounts of data (such as, computational genomics applications)

• Correctness or performance problem does not show up at small scales, but shows up at large scales

• Current debugging techniques cannot handle such problems

RUN #

# O

F T

IME

S L

OO

P E

XE

CU

TE

S

Is there a bug in one of the

production runs?

Training runs Production runs

SCALE

# O

F T

IME

S L

OO

P E

XE

CU

TE

S

Training runs Production runs

Accounting for scale makes trends clear,

errors at large scales obvious

Slide 32/36

Thrust #3: Dependability in the Very Small

Slide 33/36

Thrust #3: Dependability in the Very Small• Embedded wireless networks have fundamental resource constraints

and are often deployed in hostile or uncertain environments• Constraints include: Energy, Bandwidth, Untrusted nodes,

Disconnected networks• Goal:

– Develop middleware that provides a robust platform keeping environment constraints in mind

– Provide detection, diagnosis, and isolation functionality

• Solution directions:– Software for tracing events on the wireless node [Startup company:

SensorHound]– Replaying traced events on a lab server – Debugging support from replayed trace– Fastest reprogramming protocol to upload a patch to the network while nodes

are deployed in the field

Slide 34/36

Thrust #3: Secure Communication with Smart Electric Meters

• Smart electric meters are connected as a wireless mesh network

• Push-pull functionality:– They need to be reprogrammed from time to time to push out new

patches or functionality– They need to be queried for meter readings– Both need to be done with high reliability and low use of wireless

bandwidth– Use of cryptography to secure communication to and from meters

Slide 35/36

How Can You Collaborate with Us?• Take part in an existing project

– Present design ideas– Present new application areas– Participate in system implementation– Participate in evaluation – Example: Our Middleware ’14 paper had a co-author from IBM

Research, India

• Suggest a new project idea– Get involved from the ground up– Remote collaboration is part of our DNA– Scope for joint funding exists: NSF (US), IUSSTF (US-India)– Example: Visiting scholar from U of Naples, Italy

Slide 36/36

Take-Aways• Dependable computing includes reliability and security

– Hardware, software, and co-design– Variety of fault models, system requirements and constraints

• Achieving dependability involves five steps: prediction, detection, diagnosis, containment, recovery

• Research in our group DCSL involves– Building real systems– Typically at the middleware and application levels– Deploying solutions in real world settings– Academic impact through continuous publications at top confs

• We are always looking out for collaboration opportunities with the bright and the driven

Slide 37/36

Presentation available on:Research group web pageengineering.purdue.edu/dcsl

Slide 38/36

Backup Slides

Slide 39/36

Effect of major network outages on large business customers

October 2000 data

• Survey by Meta Group Inc. of 21 industrial sectors in 2000 found the mean loss of revenue due to computer system downtime was $1.01M/hour

Documents

Slide 1/36 Dependability in a Connected World: From the Very Large to the Very Small Saurabh Bagchi School of Electrical and Computer Engineering Department