Upload
dylan-hensley
View
213
Download
0
Embed Size (px)
Citation preview
Operating 24x7Operating 24x7
Amin Vahdaton behalf of John Jannotti, Jeff Mogul, Larry
Peterson, Joe Touch, Paulo Verissimo, Werner Vogels, Bill Weihl
24x7 Availability: Goals24x7 Availability: Goals
Holistic approach
• Not just individual computers, but services
• Need to consider operators, etc.
Sustainability (24x7 for how long) Need to handle a variety of failure model
• Understand what is and what is not correlated
• Real time, noisy, chaotic environment
24x7 Availability: Goals24x7 Availability: Goals
Self-configuration Evolvability Managing the availability/consistency tradeoff
• We live in a probabilistic world
• Monitoring needs built in from the ground up
Predict and quantify cost of delivering certain levels of availability
• Including management, auditing, etc.
• With infinite cost, operating 24x7 is easy
New Models Fault Tolerant SoftwareNew Models Fault Tolerant Software
BFT is insufficient because of assumption of independence Multi-version programming is insufficient
• e.g., working from the same bad spec
100k nodes running more or less the same thing
• Extremely tolerant of hardware faults
• But if traffic causes software to fail Bohr bug
• No spare capacity in current power grid
• Interference is another problem in power grid
Dealing with AttacksDealing with Attacks
Techniques to divert the traffic (/dev/null it) Isolate the attack traffic toward sacrificial machines Distinguish attack from non attack Legal and financial models primary technique for fighting
attack Distinguishing humans versus bots Contracts distinguish between internal failures and acts of
God/war
Living with FailureLiving with Failure
Services must behave within expectations even when individual components fail
Graceful degradation Probabilistic reasoning, statistical models
• Statistical guarantees given failure models
Must express assumptions about system behavior
• Expressing assumptions can be very difficult
• Mapping high-level system behavior to failure scenarios
MTTR just as important as MTTF Tail (99.9%) of response curve must be within bounds
EvolvabilityEvolvability
Easier for centralized services, much more difficult in distributed environments
Before deploying the new version, must have the old version available to deploy as the new version (quickly)
• What if a database scheme update was required
Special case answers in some scenarios
• Tunneling in networks
Huge amount of resources dedicated to test & development
• Regimented versus ad hoc environments
• Do you value reliability or innovation?
SustainabilitySustainability
Operating 24x7 for how many weeks sustainability
• Economic incentives
• Decentralized control can lead to longer term system reliability
Internet partially succeeded because of decentralization
Decentralization may help with evolvability though it can cut both ways
Infrastructure SupportInfrastructure Support
Virtualization Exporting appropriate failure models Fault injection
• Dependent/independent failures
What is the minimal set of nodes required to predict behavior of much larger scale system?
Evaluation techniques in general
• Simulated or emulated environments
• Including error models