46
Availability, the Cloud and Everything Joe Williams Saturday, October 2, 2010

Availability, The Cloud and Everything (version 2, Surge2010)

Embed Size (px)

Citation preview

Page 1: Availability, The Cloud and Everything (version 2, Surge2010)

Availability,the Cloud and

Everything

Joe Williams

Saturday, October 2, 2010

Page 2: Availability, The Cloud and Everything (version 2, Surge2010)

Me

• Joe Williams• Infrastructure Engineer • Cloudant• @williamsjoe• joeandmotorboat.com

Saturday, October 2, 2010

Page 3: Availability, The Cloud and Everything (version 2, Surge2010)

• Distributed database built on CouchDB• Real-time Search and Analytics• Sign Up! (Free to 256MB)• cloudant.com• http://github.com/cloudant/bigcouch

Saturday, October 2, 2010

Page 4: Availability, The Cloud and Everything (version 2, Surge2010)

Bias

• Distributed Databases (CouchDB)• Amazon EC2• Chef• Erlang

Saturday, October 2, 2010

Page 5: Availability, The Cloud and Everything (version 2, Surge2010)

Availability

Saturday, October 2, 2010

Page 6: Availability, The Cloud and Everything (version 2, Surge2010)

Availability

• What is Availability?

Saturday, October 2, 2010

Page 7: Availability, The Cloud and Everything (version 2, Surge2010)

Availability

Saturday, October 2, 2010

Page 8: Availability, The Cloud and Everything (version 2, Surge2010)

Availability

“System availability refers to the accessibility of system services to users. A system is available if it is

operational for an overwhelming fraction of the time. Unlike reliability, availability is instantaneous.”

Saturday, October 2, 2010

Page 9: Availability, The Cloud and Everything (version 2, Surge2010)

Availability

“System reliability refers to the property of tolerating constituent component failures, for the longest time. A

system is perfectly reliable if it never fails.”

Saturday, October 2, 2010

Page 10: Availability, The Cloud and Everything (version 2, Surge2010)

Availability

• Reliability * Availability = Dependability

Saturday, October 2, 2010

Page 11: Availability, The Cloud and Everything (version 2, Surge2010)

Availability

• Availability & Reliability• Mean time to failures• Mean time to repair• Durability• Fault isolation• Fault tolerance

Saturday, October 2, 2010

Page 12: Availability, The Cloud and Everything (version 2, Surge2010)

Availability

• Uptime / Downtime• Perceived• Actual

Saturday, October 2, 2010

Page 13: Availability, The Cloud and Everything (version 2, Surge2010)

Availability

• Probabilistic Risk Assessment• Event Tree Analysis• Fault Tree Analysis

Apthorpe (http://www.usenix.org/events/lisa01/tech/apthorpe/apthorpe.ps)

Saturday, October 2, 2010

Page 14: Availability, The Cloud and Everything (version 2, Surge2010)

The Cloud

Saturday, October 2, 2010

Page 15: Availability, The Cloud and Everything (version 2, Surge2010)

The Cloud

“It never gets easier, you just go faster.”- Greg Lemond

Saturday, October 2, 2010

Page 16: Availability, The Cloud and Everything (version 2, Surge2010)

The Cloud

• Abstraction• Commoditization• Homogenous• Ephemeral

Saturday, October 2, 2010

Page 17: Availability, The Cloud and Everything (version 2, Surge2010)

The Cloud

• Costs• Loss of Control• Single Points of Failure• Network Partitions / Data Locality• Unreliable• Performance

Saturday, October 2, 2010

Page 18: Availability, The Cloud and Everything (version 2, Surge2010)

The Cloud

• Benefits• API to everything• Fast and Flexible Resource Mgmt• “Unlimited” Resources

Saturday, October 2, 2010

Page 19: Availability, The Cloud and Everything (version 2, Surge2010)

The Cloud

• Bootstrapping• Time and Effort

Adam Jacob and Ezra Zygmuntowicz (http://blip.tv/file/2285124/)

Saturday, October 2, 2010

Page 20: Availability, The Cloud and Everything (version 2, Surge2010)

The Cloud

• Nodes are stateless and disposable.

Saturday, October 2, 2010

Page 21: Availability, The Cloud and Everything (version 2, Surge2010)

The Cloud

"Clouds are systems ... and with systems, you have to think hard and know how to deal with issues in that environment. The scale is so much bigger, and you don't have the physical control. But we think people should

be optimistic about what we can do here. If we are clever about deploying cloud computing with a clear-eyed notion of what the risk models are, maybe we can actually save the economy through technology."

- Security in the Ether By David Talbot - MIT Technology Review Jan/Feb 2010

Saturday, October 2, 2010

Page 22: Availability, The Cloud and Everything (version 2, Surge2010)

What’s Next

• Distributed Systems• Automation• Data Driven Operations

Saturday, October 2, 2010

Page 23: Availability, The Cloud and Everything (version 2, Surge2010)

Distributed Systems

Baran (http://www.rand.org/pubs/research_memoranda/RM3420/)

Saturday, October 2, 2010

Page 24: Availability, The Cloud and Everything (version 2, Surge2010)

Distributed Systems

• RAID ain’t as redundant as it used to be.

Leventhal (http://queue.acm.org/detail.cfm?id=1670144)

Saturday, October 2, 2010

Page 25: Availability, The Cloud and Everything (version 2, Surge2010)

Distributed Systems

• Redundancy• Duplication• Distribution

Saturday, October 2, 2010

Page 26: Availability, The Cloud and Everything (version 2, Surge2010)

Distributed Systems

• Alphabet Soup• ACID, CAP, BASE, 2PC, MVCC• Vector Clocks, Eventual Consistency• Dynamo, Paxos, Chandra, Byzantine

Saturday, October 2, 2010

Page 27: Availability, The Cloud and Everything (version 2, Surge2010)

Distributed Systems

• CAP == Availability

Saturday, October 2, 2010

Page 28: Availability, The Cloud and Everything (version 2, Surge2010)

Distributed Systems

• Erlang• Distributed• Concurrent• Fault Tolerant

Saturday, October 2, 2010

Page 29: Availability, The Cloud and Everything (version 2, Surge2010)

Distributed Systems

• Erlang• Supervision Trees

Saturday, October 2, 2010

Page 30: Availability, The Cloud and Everything (version 2, Surge2010)

Distributed Systems

• Erlang• Hot Code Upgrades• Distributed Upgrades are HARD

Saturday, October 2, 2010

Page 31: Availability, The Cloud and Everything (version 2, Surge2010)

Distributed Systems

• Future Work

• Erlang Supervision Trees

• PRA / FTA / ETA

Apthorpe (http://www.usenix.org/events/lisa01/tech/apthorpe/apthorpe.ps)

Saturday, October 2, 2010

Page 32: Availability, The Cloud and Everything (version 2, Surge2010)

Automation

Saturday, October 2, 2010

Page 33: Availability, The Cloud and Everything (version 2, Surge2010)

Automation

• Optimal use of the cloud.

Saturday, October 2, 2010

Page 34: Availability, The Cloud and Everything (version 2, Surge2010)

Automation

• Frequent deployment.

Saturday, October 2, 2010

Page 35: Availability, The Cloud and Everything (version 2, Surge2010)

Automation

• Tools• Chef• Puppet• Cfengine• Bcfg2

Saturday, October 2, 2010

Page 36: Availability, The Cloud and Everything (version 2, Surge2010)

Automation

• Erlang + Chef (as of v0.8)• erl_call Provider

Saturday, October 2, 2010

Page 37: Availability, The Cloud and Everything (version 2, Surge2010)

Data Driven Operations

Saturday, October 2, 2010

Page 38: Availability, The Cloud and Everything (version 2, Surge2010)

Data Driven Operations

“What gets measured, gets managed.”-Peter Drucker

Saturday, October 2, 2010

Page 39: Availability, The Cloud and Everything (version 2, Surge2010)

Data Driven Operations

• Instrumentation

Saturday, October 2, 2010

Page 40: Availability, The Cloud and Everything (version 2, Surge2010)

Data Driven Operations

• Logging

Saturday, October 2, 2010

Page 41: Availability, The Cloud and Everything (version 2, Surge2010)

Data Driven Operations

• Visualization

Saturday, October 2, 2010

Page 42: Availability, The Cloud and Everything (version 2, Surge2010)

Data Driven Operations

• Demo!

Saturday, October 2, 2010

Page 43: Availability, The Cloud and Everything (version 2, Surge2010)

Data Driven Operations

• Modeling

• Analysis

• Universal Law of Computational Scalability

• Amdahl’s Law

Saturday, October 2, 2010

Page 44: Availability, The Cloud and Everything (version 2, Surge2010)

Data Driven Operations

• Modeling isn’t just for capacity planning.

Montagne (http://queue.acm.org/detail.cfm?id=1862187)

Saturday, October 2, 2010

Page 45: Availability, The Cloud and Everything (version 2, Surge2010)

The End

Saturday, October 2, 2010

Page 46: Availability, The Cloud and Everything (version 2, Surge2010)

Questions?

Joe Williams - @williamsjoe

Saturday, October 2, 2010