Upload
others
View
18
Download
0
Embed Size (px)
Citation preview
Ramkumar Natarajan & Manager - NFT
Cognizant Technology Solutions
Resilience Validation
2
AbstractWith the greater complexities in application and infrastructure landscapes, the risk of failure is ever increasing. Building a robust,
highly available & fault tolerant software system becomes a challenging task and an absolute necessity in today's world. Even a minute of
downtime can cost millions of dollars.
Domino effects
What if? analysis
Myths & Facts
Major Incidents
Key tenants
Feature Groups
3
Need of Resilience Validation
$500,000Outage of 5 Minutes
Overall Internet traffic dropped by 40%
$150 MOutage of 5 Hours
Cancelled about 1000 flights
$81 MOutage of 12 Hours
Cancelled about 2100 flights
$150 MOutage of 2.5 Hours
Amazon S3 is used by around 148,213 websites
Is our application resilient to handle unforeseen / unplanned events? Being prepared to maneuver disruptive events has
become an essential part of business success in the modern era.
Major Outages and Domino effects
4
Business Impacts
Service Level Agreement Penalties
Lost Opportunities & Revenue losses.
Impact on reputation & quality.
Impact to Employee Productivity.
Impact to customers and strategic partners.
Outage costs are in increasing trend as the user base and complexities expands exponentially.
5
Myths & Facts
Disaster Recovery / Passive system will protectagainst all kind of failures without any impacts.
Impact analysis needs be done to identify anychallenges or issues during failover.
Active nodes in cluster can handle failures ofother nodes / data centers effectively.
Failover testing needs to be performed to validatesystem’s ability to handle usage and the burdenthat additional use may cause on the system.
Resiliency testing is laborious process &requires availability of multidisciplinaryteam members.
Just like other testing, few of the steps like failuresimulation can be automated with less or no manualintervention.
MYTHS FACTS
1
2
3
4
5
Migrating to cloud make applications resilient Moving to cloud doesn't guarantee resiliency.
Off-the-shelf software products (like Oracle,Couchbase) are already tested and certified
Testing needs to be done in the client environmentto find the limitations & right configurations
6
Fault Tolerance
Load Tolerance
Latency Tolerance
Data Tolerance
Manage the load variations without any issues.
Handle erroneous situations & data that is outside the range
Handle hardware & component failures gracefully
Manoeuvre network & downstream latencies efficiently.
Errors , Exceptions & bulk data from downstream systems
Black Friday, Product Release peak load events.
Network & Downstream system latencies
Rainy day scenarios, introducing CPU, Memory Spikes, node &
process failures.
How
an
appl
icat
ion
shou
ld b
e?
Resilience Tenants
7
Objectives of Resiliency Validation
Identify & fix Single Point of Failure.
Know the unknown – Engineering, Functional & Performance Flaws.
Enable resilient features & provide self healing abilities.
Reduce MTTD (Mean Time to Detect) & MTTR (Mean Time To Recovery).
Improve high availability (Fall back) & Scalability by embracing next-gen HA technologies.
Effort required to restore system health.
High Availability Scalability Switchover Dependency DRRecovery Data
ValidationCI
Resilience Feature Groups
8
Application Resilience
vulnerability and prioritization
matrixProduct Backlog
Architecture & System Review
Incident Analysis
Near Neighbor Analysis
Data Analysis
FMEA Analysis
Resiliency Implementation Resilience implementation & validation for different applications and streams were broadly categorized across three planes.
Application, Platform & Operational Readiness
9
Validation - War gamingWar gaming is a process of simulating failures / production events and observe the system behavior.
War Gaming will help augment operational efficiency and stability through discovery, practice and teamwork.
To ensure our customers experience the absolute least amount of impact during unanticipated production events.
Validate Fault, load, latency & data tolerance levels of application
10
Key Industry Trends
Hystrix is a fault tolerance librarydeveloped by Netflix to stop cascadingfailures and enable resilience in thedistributed systems
Key Features
Fault Throttling
Circuit Breaker.
Fall back
11
Cognizant Chaos Framework
v
v
v
12
Succ
ess
Stor
ies
YTD Incident 20% than 2015 YTD Incident duration (MTTR) 42% than 2015 Total outage by application platform reduced to 5%
from 30% More than 10+ outages were avoided due to
downstream failures with Fault Throttling Improved the MTTR of downstream system between 5
to 10 minutes from 1-2 hour
201
161
114
5475 77
34 38
AVAILABILITY LOAD LATENCY DATA
DEFECT COUNTS
Monkeys Executed Defects Found
outage
Self recovery
Client: Largest cable company and home Internet service provider in the United States.
Services: Cable television, Internet, telephone service and home security to both residential and commercial customers.
13
Benefits
Ensure continuous response to end-customers. Resiliency testing will avoid / reduce domino effect. Meets Quality of Service (QoS) beyond BCP and DR. Cost of downtime will be reduced. Shorter up time following a service disruption Customer satisfaction will be improved.
14
Takeaways
Expect the unexpected.
Key tenets of resilient application.
Resiliency testing not only addresses downtime - Load, Data, Latency
Testing and engineering approaches.
Failure simulations can be automated using Chaos Framework.
Resilience testing is good to must have.
15
References & Appendix
• Cognizant internal reference links & materials.
• https://github.com/Netflix/
• http://www.itproportal.com/2015/10/07/how-to-overcome-business-continuity-challenges/
• http://searchcloudcomputing.techtarget.com/definition/cloud-computing
• http://people.cs.uchicago.edu/~aelmore/papers/dasfaa.pdf
16
Author Biography
Ramkumar Natarajan has over 14+ years of experience with specialization in Performance testing and engineering space. Ram is part of Cognizant QE&A NFT CoE & leads the Resiliency testing initiatives.Ram holds Master’s degree in Computer Applications from Madras University. Earlier he has worked with Wipro Technologies & Ramco Systems.
17
Question & Answers
18
Thank You!!!