Upload
mikhail-panchenko
View
959
Download
2
Embed Size (px)
Citation preview
Building a cloud service on a cloud infrastructure atBuilding a cloud service on a cloud infrastructure at
Also, cloud.Also, cloud.Mikhail Panchenko, Surge 2011
Who Am I?Who Am I?
PancakesInfrastructure Engineer at SimpleGeoBackend Engineer at Flickr before thatBackend and Frontend Engineer at Yahoo!Ops/Tools before thatPhilosophy, Economics, and French majorbefore that
Tools for mobile/geo developersPrimarily focused on services, some data-oriented APIsPaaS, I guess? I've lost track a bitAvailability, redundancy part of brand
Our outage = your outageNo pressure
AgendaAgenda
Goals
A little bit of theory
Challenges in The Cloud
General Architecture
Implementation Details
Architectural GoalsArchitectural Goals
High availability
Linear scalability
Elasticity/Flexibility
Redundancy/Fault Tolerance
"Complex interactions are those of unfamiliarsequences, or unplanned and unexpectedsequences, and either not visible or not
immediately comprehensible."
Charles Perrow. Normal Accidents: Living with High-Risk Technologies (p. 78). Kindle Edition.
"The notion of baffling interactions is increasinglyfamiliar to all of us. [...] As systems grow in size andin the number of diverse functions they serve, and
are built to function in ever more hostileenvironments, increasing their ties to other systems,they experience more and more incomprehensible
or unexpected interactions. They become morevulnerable to unavoidable system accidents."
Charles Perrow. Normal Accidents: Living with High-Risk Technologies (p. 72). Kindle Edition.
"The beauty of this is its simplicity. Once a plangets too complex, everything can go wrong."
Walter Sobchak, The Big Lebowski
Three Mile IslandThree Mile Island"... they found that radioactive water was not
traveling to the tank they intended, but because ofcomplex flow and pressure interactions, was goingto a different, wrong tank, which also overflowed,
this time in the auxiliary building."
Charles Perrow. Normal Accidents: Living with High-Risk Technologies (pp. 22-23). Kindle Edition.
Amazon Web ServicesAmazon Web Services"The traffic shift was executed incorrectly and
rather than routing the traffic to the other router onthe primary network, the traffic was routed onto the
lower capacity redundant EBS network."
"Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region"
http://aws.amazon.com/message/65648/
Common ThemeCommon ThemePreviously independent systems become
coupled as a result of unanticipatedinteractions, leading to fundamentally
surprising results
When pumping radioactive water into the wrongWhen pumping radioactive water into the wrongtank, the behavior of the program is undefinedtank, the behavior of the program is undefined
Tightly coupled to a complex system over which youTightly coupled to a complex system over which youhave no control and into which you have no insighthave no control and into which you have no insight
"The notion of baffling interactions is increasinglyfamiliar to all of us. [...] As systems grow in size andin the number of diverse functions they serve, and
are built to function in ever more hostileenvironments, increasing their ties to other
systems, they experience more and moreincomprehensible or unexpected interactions. They
become more vulnerable to unavoidable systemaccidents."
Charles Perrow. Normal Accidents: Living with High-Risk Technologies (p. 72). Kindle Edition.
Decouple Your SubsystemsDecouple Your Subsystems
Shared resources are the most commonsource of unexpected interaction
Resist temptation to double up on roles
Use queues, caches as buffersNOTE: those are complexsubsystems of their own
Decouple Your SubsystemsDecouple Your SubsystemsExplicit Decoupling
CPU AffinityWebserver on 1-7; SSH etc on 8Crude, but gets the job done
More robust solutions - containers
Decouple Your FunctionalityDecouple Your Functionality
Service architecture
Each service does one thing well
Easier to measure, understand, andaccommodate resource demands
Reduce potential for interactions,cross-functional failure
Decouple from Your Environment with ConfigurationDecouple from Your Environment with ConfigurationManagementManagement
Decouple from your platform (OS/kernel)Easy to test/bench potential candidatesEasy to migrate if you find a winnerThis is especially important when dealing with cloud
Automate as much of deploy/bootstrapprocess as possible
Probably won't help much during a provider outagedue to stampedeBUT: DirectConnectYou might not always be in the cloud..
Decouple Your DatacentersDecouple Your Datacenters
Most robust redundancy mechanism
Hot-hot keeps you on your toes
Simplifies, not just for the cloudYahoo! now foregoing datacenterfeatures like HVAC"If it gets too hot in Washington,turn that DC off for a while"I'm sure they're not the only ones
Decouple Your DatacentersDecouple Your Datacenters
"AZ" - Basic building block for EC2
This is the level they (theoretically)decouple at
They are probably thinking along thesame lines we are - must be able to turnoff one AZ without impact in the other
Every datacenter as an independent microcosm ofEvery datacenter as an independent microcosm ofyour overall architectureyour overall architecture
Really simple operational steps for stressful tasksReally simple operational steps for stressful tasks& situations& situations
Temporally decouple the problem from theTemporally decouple the problem from theresolutionresolution
ELBELB
Dynamic Load Balancing
Flexible virtual IP
Easy to add/remove AZs
Uses healthchecks to automaticallyevict nodes
Gate - "Layer 8 Proxy"Gate - "Layer 8 Proxy"
Lightweight Node.js daemon
OAuth
Rate Limiting
Basic routing to actual services
Services - Pick Your Own AdventureServices - Pick Your Own Adventure
Node.js and PythonSome people just hate Node.js
Can be anything, as long as Gate cantalk to it
( another reason to decouple )
Highly specialized
RabbitMQRabbitMQ
A grenade for our knife-fight
Very flexible - more than we needSimplification candidate
New persistor in >= 1.3 - degradationover failure
See talk at 1:30PM
CassandraCassandra
A mostly-textbook DHT
Homogenous distributed model
Random load distribution
Partition toleranceA perfect foundation for ourarchitecture