Upload
dyn
View
5.778
Download
0
Embed Size (px)
DESCRIPTION
Citation preview
Cloud Computing Expo
2009
High Availability Clouds“Moving mission critical applications to the cloud.”
Jeremy Hitchcock, CEODynamic Network Services
Cloud Computing Expo
2009
Who cares? Why Relevant?
• Enterprises and service providers: “now what”?• Desire to move business or mission critical apps– That’s most of them
• Clouds have an “unstable” feel
Cloud Computing Expo
2009
Who cares? Why Relevant?
• Still, benefits to virtualizing computing resources• Most don’t care about raw hardware• Becoming more software/resource integrators– Less concerned with software/hardware integration
• Better use of hardware resources– Most systems are pretty idle all the time
• Hardware is getting expensive (well, power is)
Cloud Computing Expo
2009
Where are Clouds?You Are
Here
Cloud Computing Expo
2009
Where we are going (or like to be)
• Cloud adoption going to be like this?– Limited to spiky demand or distributed processing
• Will more services move to cloud environments?• Even between clouds and traditional hosting?• No hardware?– Someone has to worry about infrastructure though
Cloud Computing Expo
2009
Background on me
• Internet infrastructure: DNS for other people– DynDNS.com, Dynect Platform
• Do traffic management, dynamic "routing" for clouds• Work with a lot of cloud providers to get domain.com
to node-19334 but not node-49291• Background in networking, software engineering• Use all unmanaged hosting (but do have a VPS
offering for consumer (it was a dev project))
Cloud Computing Expo
2009
Terms
• Unmanaged hosting – corporate/outsourced datacenter, your own everything
• Managed hosting – Hardware is provided with ping port and power
• Cloud hosting – Using virtual resources to accomplish the same as the above two items
Cloud Computing Expo
2009
Goals with High Availability
• Availability: Users do not see outages• Scaling: Not impossible or easy– Does not mean more resources available– Important when you think “on demand”
• Efficient use of resources (more on that)• Institutionalized operations practices– Monitoring, security regimes
Cloud Computing Expo
2009
High Available What?
• Well, anything?• Applications• File systems• CPU, I/O, and network– I/O is both storage space and retrieval
Cloud Computing Expo
2009
HA Availability
Cloud Computing Expo
2009
Early Days of Hosting
• Been here before: mainframes to 1U servers• Copy over redundancy in larger systems– “That’s how larger systems were so accessible”
• Expensive 1Us lead to commodity hardware• “We just take our application and move it over here”• And that was when things took a turn…
Cloud Computing Expo
2009
Cloud Computing Expo
2009
Ouch!
• Lots of cheap hardware, gained efficiency– Most of the time anyway
• Applications were not available– Up and down all of time
• DB admins, network admins, system admins all pointing fingers
Cloud Computing Expo
2009
Ouch!
• Needed more 1Us to do the job• 1U equipment quality was not as good• More people, more operations issues• Security concerns, DB admins having system access• Failures and scaling became a problem until…
Cloud Computing Expo
2009
Ah Ha Moment!
It’s ok if a 1U fails. It happens all the time!
Cloud Computing Expo
2009
Ah Ha Moment!
• Make the system more redundant, fault-tolerant• Break apart units to create working spaces– N+1 redundancy, whatever your risk tolerance is
• Specialized hardware to maintain efficiency• Monitor the units of work– Ping, port, power separately
Cloud Computing Expo
2009
Ah Ha Moment!
• Separate DB/app/file into clusters– That makes scaling and failover easy
• Filiers for DB and large scale storage• Demand SLAs for network transit• Get the NOC to work on cross system outages
Cloud Computing Expo
2009
Still Some Lingering Issues
• Architectures grew to match applications– Tightly coupled, is that good?– Makes it hard to move around– Specialized hardware pieces
• Do you look like Flickr?– If you do, their hosting platform will work for your app
Cloud Computing Expo
2009
Cloud Computing Expo
2009
Still Some Lingering Issues
• Systems are more complicated– Yahoo 9/11 Memorial site cascade failures
• Fix was a load balancer/DNS tweak
• Lots of “glue” to make sure everything works• Each architecture is [slightly] different
Cloud Computing Expo
2009
Finally: Some Lingering Issues
• Therefore:– Failures, if an application is in shards, works– Scaling is application specific, different bottlenecks– Reasonable efficiency, limited specialized hardware– More people to maintain “the system” but secure
Cloud Computing Expo
2009
Now Onto Clouds…
• Promise:– On demand resources (true if you can use it)– Greater computer efficiency (all costs are internalized)– More flexibility for development and peak usage– Greater availability
• Reality:– Your responsibility to throw in more hardware– Trade specialization for generalization (bottlenecks)– Limited by tools provided and consumed– Maybe
Cloud Computing Expo
2009
Availability
Cloud Computing Expo
2009
Availability is Defined by Outages
Cloud Computing Expo
2009
Amazon/Cloud Outages?
• Not clear:– “There was this one in July 2008”– “Some DNS issues yesterday”
• How often? How regular?• Out of 500,000 harddrives, x will fail in 3.243 years• Out of 1 cloud provider? (or maybe 5)– We don’t know.
Cloud Computing Expo
2009
Cloud Realities
• “Best effort” to provide services• Ever ask for an SLA?– I’m sure it’s coming but not soon enough for some
• Remember, Amazon is providing a service– Unmanaged environment
• Relax, that’s the Internet, we’ll figure it out
Cloud Computing Expo
2009
Cloud Realities
• No physical access to systems• No guarantee for systems to be available• No guarantee that new systems to be available• No continuity guarantee– Great performance one moment, maybe not the next– Shared resources
• Everything is local, security is a lot different
Cloud Computing Expo
2009
But Clouds are Virtualized 1Us!
• Well, they are, but not really• Used to be:– Ping, port, power – raw access– Hybrids: corporate datacenter, managed, unmanaged
• Now:– Ping, port, power, file I/O– virtual access
• Outsourcing network, hardware, and OS
Cloud Computing Expo
2009
Why is it different
• Hardware becomes a service– Depending on the application, that may matter
• More vendors in the mix– Network, hardware, OS much more packaged
• Simpler presentation but complicated behind the scenes• Library issues, security issues, OS upgrades?
Cloud Computing Expo
2009
Availability
• Goal: Eliminate single points of failure– Clouds are consolidations of services– Solution is to split it apart
• Achieve true diversity– Business continuity diversity– Geographic diversity– Network diversity– OS diversity
• More layers make interactions hard to predict
Cloud Computing Expo
2009
Eliminate Points of Failure
• Cloud diversity• Cloud outages are typically binary• Interoperability needed to make it easier– That will come in several ways
Cloud Computing Expo
2009
Failover Events
• Failure events happen (more frequently in clouds?)• Trick is detecting and redirecting– "Once is a mistake, twice is jazz” – Miles Davis
• Needs to be seamless and automatic• Good provisioning and monitoring in place– Server builds, revisioning, server configurations– Everything more modular
Cloud Computing Expo
2009
Scaling
• Go from 1 to 2 to 4 to 10,000 units• Split apart work units• Have to do it sooner than later• More sharding, less efficient• Not all units are going to be equal nor constant
Cloud Computing Expo
2009
Provisioning
• Everything needs to be automatic (or at least close)• As you grow, this hurts more and more• Provisioning means lab, dev, and production• This becomes a critical system– Monitoring and backups should work with provisioning
Cloud Computing Expo
2009
Hardware Considerations
• Hardware optimized software packages may change• Security patches– Default images v. custom images
• Physical access not granted to you but others– Physical access means all access– Encrypted data on disc– Less recovery options
• Do you really have access to your data?– See backups
Cloud Computing Expo
2009
Host Issues
• Host system security vulnerabilities• Everything is local– VLANing becoming more available
• Underlying systems need maintenance– Live migrations
Cloud Computing Expo
2009
Monitoring
• System related outages because units will fail• Normal tools are based on physical limitations• Cloud environments not always clear where the
failure is• Test from the last mile• Performance testing important too• System testing and transactions• May not pinpoint problems but it does send pages
Cloud Computing Expo
2009
Backups
• Incremental backups much more important• Backup within the same cloud? – Probably not, but where?
• Data files, application files, configuration files– Version everything– Document how they all go together
• But you already do that so it’s ok
Cloud Computing Expo
2009
Migrations
• Be able to take your data (server image)– Server import and export
• Live migration, underlying software provides it• This is all interoperability needs
Cloud Computing Expo
2009
Disaster Planning
• When things go really wrong:– Need to communicate using other means
• Social networking like Twitter (are they affected as well?)– Have a plan B, diversity of cloud providers– Seek SLAs?
Cloud Computing Expo
2009
Some Things External
• DNS– Point domain.com to your plan B
• Backups and files– When you want to publish content at plan B
• Customer communications– Tell customers and users what’s going on
• Last-mile monitoring– Everything might look ok in the cloud
• Want options if there is an outage
Cloud Computing Expo
2009
Key Points
• Clouds are great for applications, even mission critical ones
• Best practices for server farms aren’t always best practices for clouds
• Need to rely on software to make hardware assumptions work right
• Constant trade off of cost and availability, what’s the risk tolerance