Operating 24x7 Amin Vahdat on behalf of John Jannotti, Jeff Mogul, Larry Peterson, Joe Touch, Paulo Verissimo, Werner Vogels, Bill Weihl

Operating 24x7Operating 24x7

Amin Vahdaton behalf of John Jannotti, Jeff Mogul, Larry

Peterson, Joe Touch, Paulo Verissimo, Werner Vogels, Bill Weihl

24x7 Availability: Goals24x7 Availability: Goals

Holistic approach

• Not just individual computers, but services

• Need to consider operators, etc.

Sustainability (24x7 for how long) Need to handle a variety of failure model

• Understand what is and what is not correlated

• Real time, noisy, chaotic environment

24x7 Availability: Goals24x7 Availability: Goals

Self-configuration Evolvability Managing the availability/consistency tradeoff

• We live in a probabilistic world

• Monitoring needs built in from the ground up

Predict and quantify cost of delivering certain levels of availability

• Including management, auditing, etc.

• With infinite cost, operating 24x7 is easy

New Models Fault Tolerant SoftwareNew Models Fault Tolerant Software

BFT is insufficient because of assumption of independence Multi-version programming is insufficient

• e.g., working from the same bad spec

100k nodes running more or less the same thing

• Extremely tolerant of hardware faults

• But if traffic causes software to fail Bohr bug

• No spare capacity in current power grid

• Interference is another problem in power grid

Dealing with AttacksDealing with Attacks

Techniques to divert the traffic (/dev/null it) Isolate the attack traffic toward sacrificial machines Distinguish attack from non attack Legal and financial models primary technique for fighting

attack Distinguishing humans versus bots Contracts distinguish between internal failures and acts of

God/war

Living with FailureLiving with Failure

Services must behave within expectations even when individual components fail

Graceful degradation Probabilistic reasoning, statistical models

• Statistical guarantees given failure models

Must express assumptions about system behavior

• Expressing assumptions can be very difficult

• Mapping high-level system behavior to failure scenarios

MTTR just as important as MTTF Tail (99.9%) of response curve must be within bounds

EvolvabilityEvolvability

Easier for centralized services, much more difficult in distributed environments

Before deploying the new version, must have the old version available to deploy as the new version (quickly)

• What if a database scheme update was required

Special case answers in some scenarios

• Tunneling in networks

Huge amount of resources dedicated to test & development

• Regimented versus ad hoc environments

• Do you value reliability or innovation?

SustainabilitySustainability

Operating 24x7 for how many weeks sustainability

• Economic incentives

• Decentralized control can lead to longer term system reliability

Internet partially succeeded because of decentralization

Decentralization may help with evolvability though it can cut both ways

Infrastructure SupportInfrastructure Support

Virtualization Exporting appropriate failure models Fault injection

• Dependent/independent failures

What is the minimal set of nodes required to predict behavior of much larger scale system?

Evaluation techniques in general

• Simulated or emulated environments

• Including error models

Documents

Operating 24x7 Amin Vahdat on behalf of John Jannotti, Jeff Mogul, Larry Peterson, Joe Touch, Paulo Verissimo, Werner Vogels, Bill Weihl