Upload
esmond-norris
View
214
Download
0
Tags:
Embed Size (px)
Citation preview
Slide 1/36
Dependability in a Connected World: From the Very Large to the Very
Small
Saurabh BagchiSchool of Electrical and Computer Engineering
Department of Computer Science Purdue University
Presentation available at: engineering.purdue.edu/dcsl
Slide 2/36
Greetings come to you from …
Slide 3/36
A Few Words about Me• B. Tech. (1996): IIT Kgp, CSE• PhD (2000): University of Illinois at Urbana-Champaign
(CS)• IBM Research, T. J. Watson (2001)• Joined Purdue as Tenure-track Assistant Professor (2002)• Promoted to Associate Professor (2007)• Promoted to Professor (2012)• Visiting Scientist (2011-current): IBM Research• Current research group: 12 graduate students, 1 post-
doctoral Research Associate, 4 undergraduate students• 12 graduated PhD students
Slide 4/36
A long time ago … in a galaxy far, far away
Slide 5/36
A few words about Purdue University• One of the largest graduate schools in engineering
– 362 faculty, 10,000 students, US News rank: 8th
• 40,000 students at its main campus in West Lafayette• Electrical and Computer Engineering @ Purdue
– 85 faculty, 650 graduate students, 900 undergraduate students– One of the largest producers of Ph.D.s in Electrical and Computer
Engineering (about 60 Ph.D's a year)– 3rd largest ECE department in the US– US News rank: 9th (Computer Engineering), 10th
(Electrical/Electronics Engineering)• Computer Science @ Purdue
– 50 faculty, 245 graduate students– US News rank: 20th
Slide 6/36
What is Dependable Computing?
• Dependability: Property that the computer system meets its specification despite the presence of faults– Faults can be due to natural causes (bugs, defects in hardware),
or– Maliciously induced (attacks from external or internal sources)
Fault Error Failure
Some faults are not activated
Some errors are masked
• Terminology– Failure: Deviation of the delivered service from compliance
with the specification– Error: Part of the system state that has been damaged by the
fault and, if uncorrected, can lead to a failure. – Fault: The adjudged or hypothesized cause of an error
Slide 7/36
Some Domains for Dependable Computing• Traditional needs
– Long-life applications (e.g., unmanned and manned space missions )
– Life-critical, short-term applications (e.g., aircraft engine control, fly-by-wire)
– Defense applications (e.g., missile defense, guidance & control)– Nuclear industry
• Newer critical-computation applications– Health industry– Automotive industry– Industrial control systems, production lines– Banking, reservations, switching, commerce
Slide 8/36
Some Domains for Dependable Computing• Networks
– Wired and wireless networked applications– Data mining– Information processing on the Internet– Distributed, networked systems
• Scientific computing– Lots of cores– Petaflop machines being used for reliable computation
Slide 9/36
Two Facets of Dependable Computing• Mercedes version
– High hardware or software development costs– High power consumption– High space overhead– Example: Boeing 777 fly-by-wire (FBW) system, which used
triple modular redundancy for all hardware resources– Example: AT&T’s ESS telecommunication switch which had a
requirement of downtime < 2 minutes/year• Commodity systems
– Cannot have too high development costs– Cannot impede performance of system significantly– Cannot take recourse to too high redundancy– Example: ECC, Parity, RAID
Slide 10/36
System Design for Dependable Computing
Hardware
System network
Processing elementsMemoryStorage system
Operating system
Reliable communications
SIFT
Application program interface (API)
Middleware
Applications
What can be provided in COTS hardware to ensure fail-silent behavior of system components (nodes, network)?
What can be provided in the communication layer?
What can be provided in software and application itself?
What is typically provided in the operating system?
Ho
w t
o c
om
bin
e h
ard
wa
re a
nd
so
ftw
are
fau
lt t
ole
ran
ce
tech
niq
ue
s -
(1)
fas
t er
ror
de
tect
ion
in
har
dw
are,
(2)
hig
h e
ffic
ien
cy d
ete
ctio
n a
nd
re
cove
ry i
n s
oft
wa
re
Ho
w t
o a
sses
s w
het
her
th
e ac
hie
ved
av
aila
bil
ity
mee
ts
sys
tem
re
qu
irem
ents
Slide 11/36
How do We Achieve the Objectives?
Error correcting codes, N_of_M and standby redundancy , voting, watchdog timers, reliable storage (RAID, mirrored disks)
CRC on messages , acknowledgment,watchdogs, heartbeats, consistency protocols
Memory management, detection of process failures, hooks to support software fault tolerance for application
Checkpointing and rollback, application replication, software, voting (fault masking), Process pairs, robust data structures, recovery blocks, N-version programming,
HardwareSystem network
Processing elementsMemoryStorage system
Operating system
Reliable communications
SIFT
Application program interface (API)
Middleware
Applications
Slide 12/36
A Trip Back in History
• September 9, 1947• Mark II computer at
Harvard• Popularly attributed to
Grace Hopper• Actually Bill Burke
found the moth• Diary with the taped
moth can be seen at Smithsonian National Museum of American History
Slide 13/36
Costly Software Bugs in History• Facebook IPO: May 2012
– NASDAQ earmarked $62 million for reimbursing investors and says it expects to incur significant costs beyond that for system upgrades and legal battles. WSJ pegged the total loss to investors at $500M.
– Investors were unsure how much of Facebook they’d bought. There were 12 million postponed share orders suddenly filled between 1:49 p.m. and 1:51 p.m. without being properly marked ‘late sale’, which exaggerated the impression that people were trying to dump Facebook shares.
– Diagnosis: Computer systems used to establish the opening price were overwhelmed by order cancellations and updates during the "biggest IPO cross in the history of mankind," Nasdaq Chief Executive Officer Robert Greifeld said Sunday. Nasdaq's systems fell into a "loop" that prevented it from opening the shares on schedule
Slide 14/36
Costly Software Bugs in History• US Power Grid Blackout: August 14, 2003
– 50 million people lost power for up to two days in the biggest blackout in North American history. The event contributed to at least 11 deaths and cost an estimated $6 billion.
– Diagnosis: The task force responsible for investigating the cause of the blackout concluded that a software failure at FirstEnergy Corp. "may have contributed significantly" to the outage.
– FirstEnergy 's Alarm and Event Processing Routine (AEPR), a key software program that gives operators visual and audible indications of events occurring on their portion of the grid, began to malfunction. As a result, "key personnel may not have been aware of the need to take preventive measures at critical times”.
– Internet links to Supervisory Control and Data Acquisition (SCADA) software weren't properly secure and some operators lacked a system to view the status of electric systems outside their immediate control.
Slide 15/36
Costly Software Bugs in History• Medical Devices: Ongoing
– It has been shown to be possible for a heart defibrillator and pacemaker to reprogram it to shut down and to deliver jolts of electricity that could be fatal. (For a device called Maximo from industry #1 company called Medtronic)
– Also possible to glean personal patient data by eavesdropping on signals from the tiny wireless radio embedded in the implant as a way to let doctors monitor and adjust it without surgery.
– 1983-1997: There were 2,792 quality problems that resulted in recalls of medical devices, 383 of which were related to computer software (14%), according to a 2001 study analyzing FDA reports of the medical devices that were voluntarily recalled by manufacturers.
– 2014: FDA has had 86 recalls, categorized as ones where there is “reasonable probability that use of these products will cause serious adverse health consequences or death.” At least 12 of the recalls were likely caused by software defects.
Slide 16/36
Dependable Computing Terminology
DEPENDABILITY
ATTRIBUTES
AVAILABILITYRELIABILITYSAFETYCONFIDENTIALITYINTEGRITYMAINTAINABILITY
FAULT PREVENTIONFAULT TOLERANCEFAULT REMOVALFAULT FORECASTING
FAULTSERRORSFAILURES
MEANS
IMPAIRMENTS
FAULT TOLERANCE
PREDICTION
DETECTION
CONTAINMENTDIAGNOSIS
RECOVERY
Slide 17/36
Fault Classes
Based on the temporal persistence
• Permanent faults, whose presence is continuous and stable.
• Intermittent faults, whose presence is only occasional due to unstable hardware or varying hardware and software states (e.g., as a function of load or activity).
• Transient faults, resulting from temporary environmental conditions.
Based on the origin
• Physical faults, stemming from physical phenomena internal to the system, such as threshold change, shorts, opens, etc., or from external changes, such as environmental, electromagnetic, vibration, etc.
• Human-made faults, which may be either design faults, introduced during system design, modification, or establishment of operating procedures, or interaction faults, which are violation of operating or maintenance procedures.
Slide 18/36
Fault Cycle & Dependability Measures
Reliability:a measure of the continuous delivery of service;R(t) is the probability that the system survives (does not fail) throughout [0, t];expected value: MTTF(Mean Time To Failure)
Availability:a measure of the service delivery with respect to the alternation of the delivery and interruptionsA(t) is the probability that the system delivers a proper (conforming to specification)service at a given time t.expected value: EA = ???
Maintainability:a measure of the service interruptionM(t) is the probability that the system will be repaired within a time less than t;expected value: MTTR (Mean Time To Repair)
Safety:a measure of the time to catastrophic failure S(t) is the probability that no catastrophic failures occur during [0, t];expected value: MTTCF(Mean Time To Catastrophic Failure)
Slide 19/36
Ways to Increase System Availability
• To increase availability:
1. Increase MTTF
2. Decrease MTTR
3. Decrease DL
Availability = MTTF/(MTTF + MTTR + DL)
Mean Time To Failure (MTTF) = EXP(t2-t0)
Mean Time To Repair (MTTR) = EXP(t4-t3)
Mean Time Between Failure (MTBF) = EXP(t6-t2)
Slide 20/36
Amazon Web Service (AWS): Case Study
• A set of services built in for reliability and security
• Amazon Web Services (AWS) provides on-demand computing resources and services in the cloud, with pay-as-you-go pricing– It is a form of cloud computing, falling within Infrastructure as
a Service (IaaS)– You can run a server on AWS that you can log on to, configure,
secure, and run just as you would a server sitting in front of you
• AWS products are available to use in different regions– Each region contains multiple distinct locations called
Availability Zones– Each Availability Zone is engineered to be isolated from
failures in other Availability Zones, and to provide low-latency network connectivity to other zones in the same region
Slide 21/36
Amazon Web Service: Case Study• Amazon Machine Images (AMIs): Commonly used
machine instances from which the user can choose to use as an execution platform; Spare instances can be kept running
• Amazon Elastic Block Store (Amazon EBS): Block-level storage volumes for AMIs– Durability of EBS is higher than a typical hard drive due to
storing data redundantly; Annual failure rate for an EBS volume is 0.1 to 0.5% compared to 4% for a regular hard drive.
– EBS provides a snapshot feature – a backup of the system taken at a specific instance of time. Snapshots are stored in the Amazon S3 to ensure high durability.
Slide 22/36
Amazon Web Service: Case Study• Autoscaling and Elastic Load Balancing: Allows
EC2 capacity to go up or down as needed by load– Example: When # running server instances is below a
threshold, launch new server instances– Example: Monitor resource utilization of server
instances using CloudWatch service; if utilization too high, launch new server instances
– Example: Distribute incoming traffic across EC2 instances, for load balancing or to route around failed instances
Slide 23/36
Overview of Research in our Group: Dependable Computing Systems Lab
Slide 24/36
Thrust #1: Distributed Intrusion Tolerant System• Distributed systems subjected to malicious attacks to
services, including hitherto unknown attacks• Objective is to tolerate intrusions, not just detect• Different phases: Detection, Diagnosis, Containment,
Response.• Solution approach:
– Attack graph based modeling of multi-stage attacks– Algorithms for containment and survivability computation and
semi-optimal response selection– Semi-optimal selection and configuration of intrusion detection
sensors for gaining visibility into the security state of system– Dealing with zero-day attacks
Slide 25/36
Thrust #1: Distributed Intrusion Tolerant System
S1S2
S3S4
Hacker
Slide 26/36
Thrust #1: Distributed Intrusion Tolerant System
Slide 27/36
Thrust #2: Debugging Large-Scale Distributed Applications
• Goal: Provide highly available applications (e.g., web service) in distributed environment– Perform failure prediction– Perform bug localization
• Challenges in today’s distributed systems– Large number of entities and large amount of data– Interactions between entities causes error propagation– High throughput or low latency requirements for the
application
Slide 28/36
Thrust #2: Predictive Reliability Engine for Cellular Networks
ISPBackbone
eNodeB MME
S/P-Gateway
IMS Core
Onlinepredictor
P-GW VM
Extreme Cloud Site
OfflineTrainingML Classifier
Online dataHandoff thresholds
Call
paus
e
Offline data (GPEH)
Slide 29/36
Thrust #2: Metric-based Bug Localization
How can we use these metrics to localize the root cause of problems?
MiddlewareVirtual machines and containers
statistics
Operating SystemCPU, memory, I/O, network
statistics
HardwareCPU performance counters
ApplicationRequests rate, transactions, DB
reads/writes, etc..
Tivoli Operations Manager
Slide 30/36
Thrust #2: Metric-based Bug Localization
• Look for abnormal time patterns• Pinpoint code regions that are correlated with these
abnormal patterns
Slide 31/36
Thrust #2: Debugging at Extreme Scales•Problem Statement:• Applications being run at large scales – large number of processes (such as,
Hadoop clusters) and on large amounts of data (such as, computational genomics applications)
• Correctness or performance problem does not show up at small scales, but shows up at large scales
• Current debugging techniques cannot handle such problems
RUN #
# O
F T
IME
S L
OO
P E
XE
CU
TE
S
Is there a bug in one of the
production runs?
Training runs Production runs
SCALE
# O
F T
IME
S L
OO
P E
XE
CU
TE
S
Training runs Production runs
Accounting for scale makes trends clear,
errors at large scales obvious
Slide 32/36
Thrust #3: Dependability in the Very Small
Slide 33/36
Thrust #3: Dependability in the Very Small• Embedded wireless networks have fundamental resource constraints
and are often deployed in hostile or uncertain environments• Constraints include: Energy, Bandwidth, Untrusted nodes,
Disconnected networks• Goal:
– Develop middleware that provides a robust platform keeping environment constraints in mind
– Provide detection, diagnosis, and isolation functionality
• Solution directions:– Software for tracing events on the wireless node [Startup company:
SensorHound]– Replaying traced events on a lab server – Debugging support from replayed trace– Fastest reprogramming protocol to upload a patch to the network while nodes
are deployed in the field
Slide 34/36
Thrust #3: Secure Communication with Smart Electric Meters
• Smart electric meters are connected as a wireless mesh network
• Push-pull functionality:– They need to be reprogrammed from time to time to push out new
patches or functionality– They need to be queried for meter readings– Both need to be done with high reliability and low use of wireless
bandwidth– Use of cryptography to secure communication to and from meters
Slide 35/36
How Can You Collaborate with Us?• Take part in an existing project
– Present design ideas– Present new application areas– Participate in system implementation– Participate in evaluation – Example: Our Middleware ’14 paper had a co-author from IBM
Research, India
• Suggest a new project idea– Get involved from the ground up– Remote collaboration is part of our DNA– Scope for joint funding exists: NSF (US), IUSSTF (US-India)– Example: Visiting scholar from U of Naples, Italy
Slide 36/36
Take-Aways• Dependable computing includes reliability and security
– Hardware, software, and co-design– Variety of fault models, system requirements and constraints
• Achieving dependability involves five steps: prediction, detection, diagnosis, containment, recovery
• Research in our group DCSL involves– Building real systems– Typically at the middleware and application levels– Deploying solutions in real world settings– Academic impact through continuous publications at top confs
• We are always looking out for collaboration opportunities with the bright and the driven
Slide 37/36
Presentation available on:Research group web pageengineering.purdue.edu/dcsl
Slide 38/36
Backup Slides
Slide 39/36
Effect of major network outages on large business customers
October 2000 data
• Survey by Meta Group Inc. of 21 industrial sectors in 2000 found the mean loss of revenue due to computer system downtime was $1.01M/hour