Upload
chamith-kumarage
View
246
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Managing a highly available and highly reliable cloud infrastructure has always been challenging. Apart from the technology, proper architecture, use of the right tool for the right task, discipline, intelligent monitoring, and effective communication are the key areas to focus on to obtain more 9’s from your cloud operations. This slide deck will describe; - Mitigating risks with a fail-proof architecture - Swiss Army Knife for Devops - Next generation monitoring - Effective communication - Best practices and know-hows
Citation preview
● Fail-proof architecture
● Devops tools and utilities
● Monitoring - next level
● Backups and Disaster recovery
● Communication
● Best practices
● Group similar components
● Load distribution is important
● Network level isolation for each group or cluster
● Failover plan for every component
● Someone has to take care of failures
● Design for failures
● Unleash the chaos monkey
“Everything fails all the time” -- Werner Vogels (CTO, Amazon)
Source: http://dev.mysql.com/doc/refman/5.0/en/ha-overview.html
● Every operation must be scripted and tested
● One click operations
● Verification tools are a must!
● Data collecting and reporting tools
● Tools to shorten the pipeline from Dev -> Prod
● Enforce standards
● Documentation has to be a part of tooling
● Are you happy with conventional tools?
● Alert if 1m_load_avg > 5 is not enough
● Analytics is a part of monitoring
● Usage predictions and trend analysis
● Co-relating incidents with logs is very useful
● Simulate user activities
● Be your own Xavier!
● How frequently you backup?
● Alerts for backups
● Verification is a MUST
● Practice DR plan frequently
● Make the DR plan to align with the deployment plan
● Documentation!Source : http://blogger.srvnetwork.com/wp-content/uploads/2010/10/disaster_recovery_plan1.jpg
Source: http://www.accountanttown.com/site/wp-content/uploads/2010/08/sticky_note_backup_small.gif
Source: http://jenniferbrogee.files.wordpress.com/2011/03/backupyourcomputer1.jpg
● Always sound human
● “Our web-monkeys can’t find the page you are looking for”
● Downtimes or failures can be turned into opportunities
● Be honest
● Users are always curious on what’s going on
● Separate communication channel
Source: http://www.transparentuptime.com/2010/06/video-of-my-talk-upside-of-downtime-at.html
Source: http://www.transparentuptime.com/2010/06/video-of-my-talk-upside-of-downtime-at.html
● Staging setup to run parallel
● Verification process after every operation
● Change log and maintenance log
● Use of configuration management
● Manage the complete ALM
● Knowledge sharing
● Culture
Source: http://www.cartoonstock.com/newscartoons/cartoonists/rmo/lowres/business-commerce-best_practice-business_venture-business_model-business_practice-bankrupt-rmon2464l.jpg
[email protected] | @gnuchami