Getting more 9s from your Cloud operations

● Fail-proof architecture

● Devops tools and utilities

● Monitoring - next level

● Backups and Disaster recovery

● Communication

● Best practices

● Group similar components

● Load distribution is important

● Network level isolation for each group or cluster

● Failover plan for every component

● Someone has to take care of failures

● Design for failures

● Unleash the chaos monkey

“Everything fails all the time” -- Werner Vogels (CTO, Amazon)

Source: http://dev.mysql.com/doc/refman/5.0/en/ha-overview.html

● Every operation must be scripted and tested

● One click operations

● Verification tools are a must!

● Data collecting and reporting tools

● Tools to shorten the pipeline from Dev -> Prod

● Enforce standards

● Documentation has to be a part of tooling

● Are you happy with conventional tools?

● Alert if 1m_load_avg > 5 is not enough

● Analytics is a part of monitoring

● Usage predictions and trend analysis

● Co-relating incidents with logs is very useful

● Simulate user activities

● Be your own Xavier!

● How frequently you backup?

● Alerts for backups

● Verification is a MUST

● Practice DR plan frequently

● Make the DR plan to align with the deployment plan

● Documentation!Source : http://blogger.srvnetwork.com/wp-content/uploads/2010/10/disaster_recovery_plan1.jpg

Source: http://www.accountanttown.com/site/wp-content/uploads/2010/08/sticky_note_backup_small.gif

Source: http://jenniferbrogee.files.wordpress.com/2011/03/backupyourcomputer1.jpg

● Always sound human

● “Our web-monkeys can’t find the page you are looking for”

● Downtimes or failures can be turned into opportunities

● Be honest

● Users are always curious on what’s going on

● Separate communication channel

Source: http://www.transparentuptime.com/2010/06/video-of-my-talk-upside-of-downtime-at.html

Source: http://www.transparentuptime.com/2010/06/video-of-my-talk-upside-of-downtime-at.html

● Staging setup to run parallel

● Verification process after every operation

● Change log and maintenance log

● Use of configuration management

● Manage the complete ALM

● Knowledge sharing

● Culture

Source: http://www.cartoonstock.com/newscartoons/cartoonists/rmo/lowres/business-commerce-best_practice-business_venture-business_model-business_practice-bankrupt-rmon2464l.jpg

[email protected] | @gnuchami

Technology

Getting more 9s from your Cloud operations