High Availability Clouds-Cloud Computing Expo

Cloud Computing Expo

2009

High Availability Clouds“Moving mission critical applications to the cloud.”

Jeremy Hitchcock, CEODynamic Network Services


2009

Who cares? Why Relevant?

• Enterprises and service providers: “now what”?• Desire to move business or mission critical apps– That’s most of them

• Clouds have an “unstable” feel


2009

Who cares? Why Relevant?

• Still, benefits to virtualizing computing resources• Most don’t care about raw hardware• Becoming more software/resource integrators– Less concerned with software/hardware integration

• Better use of hardware resources– Most systems are pretty idle all the time

• Hardware is getting expensive (well, power is)


2009

Where are Clouds?You Are

Here


2009

Where we are going (or like to be)

• Cloud adoption going to be like this?– Limited to spiky demand or distributed processing

• Will more services move to cloud environments?• Even between clouds and traditional hosting?• No hardware?– Someone has to worry about infrastructure though


2009

Background on me

• Internet infrastructure: DNS for other people– DynDNS.com, Dynect Platform

• Do traffic management, dynamic "routing" for clouds• Work with a lot of cloud providers to get domain.com

to node-19334 but not node-49291• Background in networking, software engineering• Use all unmanaged hosting (but do have a VPS

offering for consumer (it was a dev project))


2009

Terms

• Unmanaged hosting – corporate/outsourced datacenter, your own everything

• Managed hosting – Hardware is provided with ping port and power

• Cloud hosting – Using virtual resources to accomplish the same as the above two items


2009

Goals with High Availability

• Availability: Users do not see outages• Scaling: Not impossible or easy– Does not mean more resources available– Important when you think “on demand”

• Efficient use of resources (more on that)• Institutionalized operations practices– Monitoring, security regimes


2009

High Available What?

• Well, anything?• Applications• File systems• CPU, I/O, and network– I/O is both storage space and retrieval


2009

HA Availability


2009

Early Days of Hosting

• Been here before: mainframes to 1U servers• Copy over redundancy in larger systems– “That’s how larger systems were so accessible”

• Expensive 1Us lead to commodity hardware• “We just take our application and move it over here”• And that was when things took a turn…


2009


2009

Ouch!

• Lots of cheap hardware, gained efficiency– Most of the time anyway

• Applications were not available– Up and down all of time

• DB admins, network admins, system admins all pointing fingers


2009

Ouch!

• Needed more 1Us to do the job• 1U equipment quality was not as good• More people, more operations issues• Security concerns, DB admins having system access• Failures and scaling became a problem until…


2009

Ah Ha Moment!

It’s ok if a 1U fails. It happens all the time!


2009

Ah Ha Moment!

• Make the system more redundant, fault-tolerant• Break apart units to create working spaces– N+1 redundancy, whatever your risk tolerance is

• Specialized hardware to maintain efficiency• Monitor the units of work– Ping, port, power separately


2009

Ah Ha Moment!

• Separate DB/app/file into clusters– That makes scaling and failover easy

• Filiers for DB and large scale storage• Demand SLAs for network transit• Get the NOC to work on cross system outages


2009

Still Some Lingering Issues

• Architectures grew to match applications– Tightly coupled, is that good?– Makes it hard to move around– Specialized hardware pieces

• Do you look like Flickr?– If you do, their hosting platform will work for your app


2009


2009

Still Some Lingering Issues

• Systems are more complicated– Yahoo 9/11 Memorial site cascade failures

• Fix was a load balancer/DNS tweak

• Lots of “glue” to make sure everything works• Each architecture is [slightly] different


2009

Finally: Some Lingering Issues

• Therefore:– Failures, if an application is in shards, works– Scaling is application specific, different bottlenecks– Reasonable efficiency, limited specialized hardware– More people to maintain “the system” but secure


2009

Now Onto Clouds…

• Promise:– On demand resources (true if you can use it)– Greater computer efficiency (all costs are internalized)– More flexibility for development and peak usage– Greater availability

• Reality:– Your responsibility to throw in more hardware– Trade specialization for generalization (bottlenecks)– Limited by tools provided and consumed– Maybe


2009

Availability


2009

Availability is Defined by Outages


2009

Amazon/Cloud Outages?

• Not clear:– “There was this one in July 2008”– “Some DNS issues yesterday”

• How often? How regular?• Out of 500,000 harddrives, x will fail in 3.243 years• Out of 1 cloud provider? (or maybe 5)– We don’t know.


2009

Cloud Realities

• “Best effort” to provide services• Ever ask for an SLA?– I’m sure it’s coming but not soon enough for some

• Remember, Amazon is providing a service– Unmanaged environment

• Relax, that’s the Internet, we’ll figure it out


2009

Cloud Realities

• No physical access to systems• No guarantee for systems to be available• No guarantee that new systems to be available• No continuity guarantee– Great performance one moment, maybe not the next– Shared resources

• Everything is local, security is a lot different


2009

But Clouds are Virtualized 1Us!

• Well, they are, but not really• Used to be:– Ping, port, power – raw access– Hybrids: corporate datacenter, managed, unmanaged

• Now:– Ping, port, power, file I/O– virtual access

• Outsourcing network, hardware, and OS


2009

Why is it different

• Hardware becomes a service– Depending on the application, that may matter

• More vendors in the mix– Network, hardware, OS much more packaged

• Simpler presentation but complicated behind the scenes• Library issues, security issues, OS upgrades?


2009

Availability

• Goal: Eliminate single points of failure– Clouds are consolidations of services– Solution is to split it apart

• Achieve true diversity– Business continuity diversity– Geographic diversity– Network diversity– OS diversity

• More layers make interactions hard to predict


2009

Eliminate Points of Failure

• Cloud diversity• Cloud outages are typically binary• Interoperability needed to make it easier– That will come in several ways


2009

Failover Events

• Failure events happen (more frequently in clouds?)• Trick is detecting and redirecting– "Once is a mistake, twice is jazz” – Miles Davis

• Needs to be seamless and automatic• Good provisioning and monitoring in place– Server builds, revisioning, server configurations– Everything more modular


2009

Scaling

• Go from 1 to 2 to 4 to 10,000 units• Split apart work units• Have to do it sooner than later• More sharding, less efficient• Not all units are going to be equal nor constant


2009

Provisioning

• Everything needs to be automatic (or at least close)• As you grow, this hurts more and more• Provisioning means lab, dev, and production• This becomes a critical system– Monitoring and backups should work with provisioning


2009

Hardware Considerations

• Hardware optimized software packages may change• Security patches– Default images v. custom images

• Physical access not granted to you but others– Physical access means all access– Encrypted data on disc– Less recovery options

• Do you really have access to your data?– See backups


2009

Host Issues

• Host system security vulnerabilities• Everything is local– VLANing becoming more available

• Underlying systems need maintenance– Live migrations


2009

Monitoring

• System related outages because units will fail• Normal tools are based on physical limitations• Cloud environments not always clear where the

failure is• Test from the last mile• Performance testing important too• System testing and transactions• May not pinpoint problems but it does send pages


2009

Backups

• Incremental backups much more important• Backup within the same cloud? – Probably not, but where?

• Data files, application files, configuration files– Version everything– Document how they all go together

• But you already do that so it’s ok


2009

Migrations

• Be able to take your data (server image)– Server import and export

• Live migration, underlying software provides it• This is all interoperability needs


2009

Disaster Planning

• When things go really wrong:– Need to communicate using other means

• Social networking like Twitter (are they affected as well?)– Have a plan B, diversity of cloud providers– Seek SLAs?


2009

Some Things External

• DNS– Point domain.com to your plan B

• Backups and files– When you want to publish content at plan B

• Customer communications– Tell customers and users what’s going on

• Last-mile monitoring– Everything might look ok in the cloud

• Want options if there is an outage


2009

Key Points

• Clouds are great for applications, even mission critical ones

• Best practices for server farms aren’t always best practices for clouds

• Need to rely on software to make hardware assumptions work right

• Constant trade off of cost and availability, what’s the risk tolerance


2009

Questions

Jeremy [email protected]://dyn-inc.com/

Technology

High Availability Clouds-Cloud Computing Expo